1. Introduction
The exponential growth in digital transactions and online insurance services has created a parallel risk in fraudulent activities, particularly in life insurance. According to the Coalition Against Insurance Fraud, global insurance fraud accounts for a loss of USD 80 billion annually, with life insurance claims increasingly manipulated using forged documentation and identity theft. Traditional claim processing methods are resource intensive and offer delayed results, making manual fraud detection ineffective, unsustainable, and prone to errors. These limitations have encouraged the adoption of artificial intelligence (AI) and machine learning (ML) techniques to automate and enhance the fraud detection process in the insurance industry [
1,
2].
This research is necessary due to several limitations, as recent trends in digital insurance workflows have shown an increasing occurrence of synthetic identities, false documentation, and abnormal behavioural patterns in claim submissions. These developments demand an intelligent, scalable solution for fraud detection that can operate in real time while minimizing false positives and maintaining trust. Additionally, the complexity of high-dimensional insurance datasets and the rarity of fraud cases present challenges that traditional role-based or shallow learning models cannot effectively address [
3]. Therefore, advanced AI-driven systems capable of learning subtle patterns, modelling temporal dependencies, and generating interpretable results are urgently needed to improve operational efficiency, reduce financial losses, and ensure regulatory compliance within the insurance industry.
Despite the growing utilization of machine learning models for fraud detection, the identification of fraudulent transactions within highly unbalanced datasets—where these instances constitute just a small proportion of the overall data—remains a significant difficulty. Many existing models optimize accuracy, a metric that can be misleading when identifying rare but high-impact fraudulent claims [
3,
4]. Furthermore, traditional machine learning techniques like random forest and support vector machine often struggle to uncover latent dependencies. In contrast, deep learning models, despite their complexity, require careful tuning and validation to perform effectively [
4,
5].
Despite the growing use of machine learning and deep learning models for insurance fraud detection, several important limitations remain in the current literature. First, many studies focus on maximizing overall accuracy, which can be misleading in imbalanced datasets where fraudulent claims are only a small part of the data. A model that classifies all claims as genuine might achieve high accuracy but fail at identifying fraudulent activity, making it ineffective in real-world applications.
Second, traditional machine learning techniques such as random forest, XGBoost, support vector machine, and logistic regression perform well on structured tabular data and are interpretable. However, these models are limited in their ability to model temporal patterns or evolving fraud behaviours and often struggle to detect latent dependencies that are crucial for fraud detection [
2].
Third, while deep learning methods like LSTM and Bi-LSTM can capture temporal dependencies, they lack interpretability. This “black-box” nature makes them difficult to justify in high-stakes, regulated environments like insurance, where decision transparency is essential for auditing and compliance [
3].
Fourth, unsupervised models, including autoencoders and variational autoencoders (VAEs), have been used for anomaly detection without labelled data. However, their effectiveness is limited in highly imbalanced datasets because the latent space representations of fraudulent and genuine cases often overlap, reducing their ability to distinguish between them [
4].
Lastly, hybrid models such as CNN-LSTM, LSTM-XGBoost, and RF-CNN have shown potential by combining sequence modelling and ensemble learning. Yet, they usually optimize accuracy or precision, often neglecting recall, which is vital for reducing false negatives in fraud detection. Additionally, prior studies have rarely incorporated chaotic transformations in the latent space, which could improve the separation of complex fraud patterns and heighten anomaly sensitivity [
5].
These limitations highlight the need for a unified approach that leverages models while explicitly addressing the interpretability–performance trade-off. This study aims to close this gap by proposing a novel hybrid framework, as detailed in
Section 3.
This research presents timely, relevant, scalable, adaptive, and explainable AI systems that rely on digital life insurance to flag suspicious claims in real time. While earlier studies explored isolated deep learning or machine learning models, few have examined comparative and hybrid architectures. Given the socio-economic impact of undetected fraud and the costs of false positives (which may harm genuine users), there is a strong need for a balanced, performance-oriented framework [
6,
7].
There is a gap in the literature where fraud detection models are often optimized for accuracy instead of fraud sensitivity (recall), and their evaluation lacks depth in curve-based metrics like ROC and precision–recall curves, which are better suited to imbalanced data. Few studies have examined hybrid models that combine the temporal learning capabilities of Bi-LSTM with the anomaly detection features of VAEs, alongside the interpretability of RF, particularly in the context of life insurance fraud. Furthermore, earlier research may have neglected assessment methods using curve-based metrics appropriate for uneven data. This research introduces an innovative hybrid architecture that integrates RF and Bi-LSTM, evaluating its performance against independent CVAE and Bi-LSTM models using a life insurance fraud dataset. Unlike prior studies, it emphasizes recall and the robustness of the PR curve rather than solely accuracy, thereby enhancing its relevance to real-world fraud detection.
This study contributes a comprehensive, interpretable, and recall-sensitive framework for life insurance fraud detection by integrating generative, sequential, and ensemble modelling approaches. Specifically, it introduces a novel hybrid architecture that combines the anomaly detection capabilities of a CVAE, the temporal sequence modelling strength of Bi-LSTM, and the interpretability of random forest classifiers. A carefully constructed and balanced dataset of 4000 life insurance applications was developed to simulate real-world fraud patterns. The proposed models were evaluated using metrics appropriate for imbalanced classification, such as recall, F1-score, and precision–recall curves compared to traditional accuracy-driven models. The hybrid RF+Bi-LSTM model demonstrated improved recall and curve stability, making it suitable for deployment in operational fraud detection systems. Overall, this research provides both methodological advancement and a deployable solution for real-time fraud detection in the insurance sector.
Objectives of the Study
To develop and pre-process a life insurance fraud detection dataset consisting of 4000 applications and 83 features.
To design and implement three distinct models: CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM.
To evaluate three models using confusion matrices, classification reports, ROC curves, and PR curves.
To compare their performance based on recall, F1-score, and precision consistency rather than only accuracy.
To identify the ideal model for real-time insurance fraud detection by assessing its interpretability, scalability, and ability to accurately detect fraudulent activities.
To address the limitations identified in the existing literature, such as poor fraud sensitivity in imbalanced datasets, the lack of temporal modelling, limited interpretability, and poor latent space separability, this study proposes a unified, data-driven framework integrating generative, sequential, and ensemble learning models. This approach is specifically tailored to life insurance fraud detection, a domain where rare event identification, decision transparency, and real-time applicability are essential. The proposed framework consists of three complementary components:
1. Chaotic variational autoencoder (CVAE):
The generative model is augmented with sinusoidal perturbations within the latent space to enhance the separability between genuine and fraudulent claims. This mechanism improves unsupervised anomaly detection by capturing non-linear dependencies and subtle deviations in fraudulent behaviour that may not be distinguishable through traditional latent representations.
2. Bidirectional long short-term memory (BI-LSTM)
This is a deep sequential learning model that captures both past and future contextual dependencies in structured claim data. This allows the model to identify evolving repeated fraud behaviour that may otherwise be overlooked by unidirectional or shallow classifiers.
3. Hybrid random forest + Bi-LSTM ensemble:
This is a two-stage architecture, where the Bi-LSTM outputs are concatenated with original features and passed through a random forest classifier. This fusion leverages Bi-LSTM temporal representation with random forest’s decision-level interpretability and stability, making the model suitable for regulated domains.
These models were implemented and validated using carefully constructed datasets consisting of 4000 synthetic life insurance applications with 83 domain-relevant features. The dataset was pre-processed through imputation, encoding, normalization, and class balancing using SMOTE to reflect real-world imbalance scenarios. Three distinct models—CVAE, Bi-LSTM, and RF + Bi-LSTM—were evaluated through extensive experimentation involving confusion matrices, classification reports, ROC curves, and precision–recall curves.
In contrast to previous studies that emphasize overall accuracy, this study’s evaluation focuses on recall, F1-score, and PR-curve robustness. These metrics are more appropriate for fraud detection, where false negatives carry a high cost. The results demonstrated that while the CVAE exhibits high overall accuracy, it lacks fraud sensitivity. The Bi-LSTM model significantly improves recall, and the hybrid RF+Bi-LSTM model achieves a better balance between detection performance and interpretability, showing improved PR-curve stability and real-time suitability. Thus, the proposed approach directly aligns with the research objectives by
Developing a realistic, balanced dataset reflecting life insurance claim fraud patterns.
Comparing these models based on sensitivity, precision, and robustness.
Identifying a model suitable for operational deployment in insurance ecosystems.
This framework not only advances methodological innovation but also bridges practical gaps in real-world fraud detection by offering a scalable and interpretable solution deployable in industry settings.
The rest of this paper is divided into five sections.
Section 2 provides a detailed explanation of the proposed methodology, beginning with the data collection and preprocessing strategies applied to the custom-built life insurance dataset. This section outlines the steps taken to handle class imbalance, missing values, feature encoding, and normalization, followed by the normalization descriptions of the three models: chaotic variational autoencoder (CVAE), bidirectional long short-term memory (Bi-LSTM), and the hybrid random forest + Bi-LSTM ensemble.
Section 3 focuses on the experimental results, including confusion matrices, classification reports, performance metrics, and curve-based evaluations such as receiver operating characteristic (ROC) and precision–recall (PR) curves. In
Section 4, a comparative discussion is presented interpreting each model’s behaviour concerning recall, precision, F1-score, and robustness in curve evaluation. This section also discusses the implications of dataset characteristics, model complexity, and metric-driven analysis while referencing the relevant literature. Finally,
Section 5 concludes the paper by summarizing the major findings, identifying the best-performing model, and offering insights for future research directions, including the integration of reinforcement learning and attention-based mechanisms. The paper’s structure is designed to ensure logical progression from problem definition to practical implications, making it accessible to both academic and applied AI audiences.
2. Literature Survey
Fraud detection in the insurance sector is a critical area, where timely identification of fraudulent claims can greatly reduce operational losses. Various ML and DL techniques have been used to address this issue. Traditional models like random forest (RF), XGBoost, logistic regression, and support vector machine (SVM) perform well on structured data, providing interpretable results and low computational costs. However, these models cannot capture temporal dependencies, making them less effective in identifying complex fraud patterns that evolve over time.
To address these limitations, sequential models such as long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM) have been explored because of their ability to capture dependencies in both directions within sequential insurance claim data. These models are especially effective at identifying evolving fraud schemes over time, like repeated claims and timing-based fraud behaviour inconsistencies. However, LSTM-based models lack interpretability, which is crucial in domains like insurance, where audit trails and justification are required for every decision.
On the other hand, autoencoders and variational encoders (VAEs) have been employed for unsupervised anomaly detection, learning the latent distribution of normal transactions, and flagging outliers. These models perform well in scenarios with sparse labels but often struggle to clearly separate fraud from legitimate claims in the latent space. Although this approach enhances feature separation, its effectiveness remains limited in the fraud detection task.
Table 1 shows the Literature summary.
To improve accuracy and robustness, hybrid models have emerged, including combinations such as CNN-LSTM, RF-CNN, and LSTM-XGBoost, which aim to integrate sequential modelling with ensemble or convolutional layers. However, these models primarily optimize accuracy or precision and often fail to prioritize recall, which is more important in fraud detection due to the higher cost of false negatives. Moreover, most hybrids are still black boxes, lacking the transparency needed for real-time operational integration.
Research Gap
Despite the proliferation of models in insurance fraud detection, several persistent gaps remain unaddressed. Traditional machine learning algorithms such as random forest, XGBoost, logistic regression, and SVM have proven effective for structured datasets due to their interpretability and high baseline accuracy. However, they fail to model the temporal dynamics present in evolving fraud schemes, making them less effective for long-term behaviour patterns.
Deep learning models like LSTM and Bi-LSTM have attempted to capture these sequential dependencies, but their black-box nature limits interpretability, a critical feature in domains where decision traceability and compliance are paramount.
Additionally, while autoencoders and VAEs have been leveraged for unsupervised anomaly detection, their latent space often lacks separability between normal and fraudulent claims, especially in an imbalanced dataset. Hybrid approaches such as CNN-LSTM and RF-CNN have shown improved robustness across various domains, yet they generally prioritize accuracy and precision, neglecting recall, which is vital in detecting fraudulent events. Most importantly, none of these studies incorporates chaotic transformations within the latent space, which recent research suggests could significantly enhance fraud separability and detection sensitivity.
In response to the limitations identified in prior studies, including the overemphasis on accuracy rather than recall, the lack of model interpretability, insufficient handling of rare event class imbalance, and the absence of chaotic latent modelling, our study introduces a novel hybrid model that integrates random forest and Bi-LSTM. This integrated approach effectively addresses all the aforementioned challenges by enhancing latent feature separability, capturing temporal fraud patterns, improving recall, and ensuring decision transparency through an interpretable classifier.
3. Methodology
3.1. Data Collection Process
This research utilized the dataset from Scratch to ensure a realistic representation of fraud detection in life insurance applications. The dataset comprises 4000 life insurance applications with 83 features, capturing key aspects such as policyholder demographics, financial history, claim details, transaction behaviour, and fraud risk indicators. The development process included domain knowledge from insurance fraud specialists, statistical analysis of existing fraud trends, and synthetic data generation to balance fraudulent-to-genuine claim ratios.
An initial pool of 120 features was generated based on industry fraud patterns and previous studies to construct the dataset. These features covered policy-related information, claim history, financial risk factors, and transaction behaviour. Following feature selection and correlation analysis, 83 features were finalized based on predictive importance, domain relevance, and data source availability. The dataset was designed with an approximate fraudulent-to-genuine ratio of 15:85, necessitating resampling techniques to address claim imbalance issues.
The fraudulent applications were generated using statistical sampling and synthetic data augmentation. A fraud pattern was simulated based on historical trends, including abnormal claim amounts, policyholders’ consistency, unusual transaction frequencies, and high-risk financial attributes. A controlled proportion of fraudulent cases was introduced into the dataset, mimicking real-world fraud actions.
3.2. Dataset Structure and Key Features
The dataset consists of numerical and categorical attributes, ensuring a comprehensive representation of fraud indicators in life insurance applications. These features were cautiously selected based on their predictive significance in fraud detection, statistical correlation analysis, and expert validation from insurance fraud analysts.
Table 2 and
Table 3 show the breakdown of the dataset into numerical and categorical feature categories.
3.2.1. Numerical Features: (Continuous and Discrete Variables)
Numerical features include policy details like financial indicators, claim history, and behaviour metrics, which provide quantitative insights into fraudulent activity.
3.2.2. Categorical Features (Nominal and Ordinal Variables)
Categorical features include policyholder demographics, claims-related classification, and behaviour patterns, which are useful in detecting fraud risk.
3.2.3. Key Insights on Dataset Structure
Numerical features provide detailed quantitative indicators of fraudulent activity, including financial factors, claim behaviour trends, and transaction anomalies.
Categorical features capture policyholder demographic claim classifications and behavioural risk factors, which help to identify fraudulent claims based on historical trends.
Fraudulent activities tend to be characterized by anomalies in both numerical and categorical features, such as unusual claim amounts, high debt ratios, frequent claim filings, and inconsistency in submitting documents.
This structured dataset enables the fraud detection models to leverage both structured numerical insight and categorical risk patterns for improved predictive accuracy.
3.3. Dataset Preprocessing
Dataset preprocessing is essential for ensuring data quality, consistency, and proper feature representation before training fraud detection models. This section outlines the preprocessing techniques applied to the dataset of 4000 life insurance applications, which contains 83 features, incorporating mathematical analysis for handling missing values, duplicates, categorical encoding, scaling, class imbalance, and feature selection.
3.3.1. Handling Missing Values
Missing values were observed in 7.5% of numerical variables and 5.2% of categorical variables. To address this, different imputation techniques were applied based on feature distribution, i.e., mean and median imputation. For normally distributed features (for example, annual income, total outstanding debt), mean imputation was used:
where
is the imputed value, N is the total number of available values, and
represents existing values for skewed distribution (for example, claim amount debt-to-income ratio, etc.), median imputation is applied.
For skewed distribution (like claim amount, debt–income rate), median imputation was applied to prevent the influence of outliers in the financial data.
For categorical missing values used as mode imputation (for example, employment, statistical education level, etc.)
where f(x) is the frequency of occurrence for category x.
For time-dependent variables (for example, claim submission time stamps), forward fill and backward fill techniques were used.
For forward fill,
For backward fill,
3.3.2. Handling Duplication and Data Consistency
Duplicate records were identified based on a combination of policy ID, claim ID, and submission time stamps, ensuring no overlapping fraudulent entries. The duplicate removal function was formulated as follows:
where D(x) detects the duplicate entries and 1 can indicate the function that returns 1 if
otherwise 0.
After removal, 3.2% of records were identified as duplicates and eliminated to ensure data consistency.
3.4. Feature Selection Process
From the initial dataset of 83 features, we employed a multi-stage approach to identify the five most influential features for interpretability and model refinement. First, a random forest classifier was trained, and feature importance was computed using the mean decrease in “Gini impurity”. This yielded a ranked list based on its contribution to classification accuracy.
Next, we shortlisted the top 10 features and applied a sequential forward selection technique using the hybrid RF + Bi-LSTM model. This method evaluated a combination of features incrementally, selecting those that improve model performance, particularly the F1-score and recall. The optimal subset of five features was finalized based on this performance-driven criterion.
Finally, an interpretability check was conducted to ensure that each selected feature held practical relevance in the context of fraud detection. These included indicators such as high claims-to-premium ratios, unusual hospitalization durations, and temporal anomalies in claim submissions. The selected features are discussed in
Supplementary File S1, along with their domain relevance.
3.5. Feature Scaling and Normalization
Feature scaling was applied to improve model convergence and ensure a fair weight distribution among features.
3.5.1. F1-Score Standardization for Normally Distributed Data
Features like annual income, total outstanding loans, and credit score standardization were considered as follows:
where
is the mean and
is the standard deviation.
3.5.2. Min–Max Scaling for Bounded Data
For features like the debt-to-income ratio and the number of prior claims applied, Min-Max scaling was as follows:
ensuring that the values fall within the range [0, 1].
3.6. Handling Class Imbalance
The dataset had an imbalance ratio of 15:85 (fraudulent-to-genuine cases), leading to biased model predictions. This was addressed using synthetic minority over-sampling techniques (SMOTEs).
For each fraud sample
, the synthetic instance was generated as follows:
where
represents the k-nearest neighbour fraud instances and
is the random weight. After applying SMOTE, the dataset was balanced to a 50:50 fraudulent-to-genuine ratio, improving model sensitivity to fraud cases.
3.7. Feature Selection
To reduce dimensionality and return the model’s relevant fraud detection features, feature selection techniques were applied.
where 0 is the observed value and E is the expected value.
- B.
Mutual Information for Numerical Variables
This means how much knowing one variable reduces uncertainty in the target (fraud status).
- C.
Random Forest Features Importance Ranking
The final feature importance ranking was computed as follows:
where T is the number of trees and
is the information gained from splits in the random forest. The final five most predictive features were the following:
3.8. Final Proposed Dataset
From
Table 4, the final preprocessing dataset contained 4000 records, with 83 optimized features, balanced for fraud detection. The preprocessing step eliminated inconsistencies, addressed claim imbalance, transformed categorical data, and selected the most predictive attributes, ensuring better model generalized fraud classification accuracy.
3.9. Proposed Methodology/Proposed Framework
This study introduces three robust methodologies for the detection of insurance fraud using both deep learning and hybrid architectures: chaotic variational autoencoder (CVAE), bidirectional long short-term memory (Bi-LSTM), and a hybrid random forest+ Bi-LSTM (RF-Bi LSTM). Each method is structured to extract high-level representations and exploit different aspects of temporal and probabilistic behaviour inherent in fraudulent and healthcare transactions.
CVAE represents an advanced variant of the traditional variational autoencoder (VAE), tailored to incorporate chaotic perturbation in the latent space. This enables the model to respond sensitively to anomalous patterns by enhancing its latent representations through sinusoidal transformation, which amplifies the divergence between normal and fraudulent inputs. The theoretical justification lies in the ability of chaotic dynamics to capture non-linear interactions among features, making CVAE particularly suitable for anomaly detection in complex fraud data.
Bi-LSTM, a variant of RNN, is designed to address the limitations of unidirectional sequential learning. By processing input sequences in both forward and backward directions, Bi-LSTM captures contextual dependencies that may occur before or after a given point in the sequence. This bidirectional learning mechanism allows the model to better understand features, which is especially beneficial in identifying time-dependent fraud patterns.
The hybrid RF-Bi LSTM model synergizes deep learning’s sequential modelling strengths with ensemble learning’s interpretability and robustness. The Bi-LSTM component extracts temporary features and output probability scores, which are then appended to the original feature vector and passed into a random forest classifier. This fusion ensures that the deep temporal features guide the ensemble decision-making, enhancing the model’s capacity to differentiate subtle fraud cases without sacrificing interpretability.
Together, these three models offer complementary perspectives: CVAE excels in detecting anomalous behaviour without requiring labels; Bi-LSTM combines predictive strength and explainability, making the architecture suitable for real-world fraud detection systems; and their systematic comparison offers insights into the advantages and limitations of different methodologies and approaches to insurance fraud detection.
3.9.1. Algorithms for CVAE
The chaotic variational autoencoder (CVAE) enhances the classic VAE framework by incorporating a chaotic function into the latent space to increase the model’s sensitivity to anomalies.
Figure 1 The following algorithms give the end-to-end CVAE process with corresponding theoretical and mathematical foundations (
Table 5).
Step 1: Input Data Preparation:
Let represent an input vector from the insurance dataset. All features are normalized using standard techniques like Min–Max normalization.
Step 2: Encoding to Latent Space:
The encoder neural network maps the input x to the parameters of a Gaussian distribution:, log = using the representation trick . This allows backpropagation through stochastic sampling.
Step 3: Applying Chaotic Perturbation:
To increase robustness to subtle deviations and capture non-linear fraud behaviour, apply a chaotic sinusoidal transformation: where is the chaos control parameter that governs perturbation amplitude.
Step 4: Decoding and Reconstruction:
The decoder attempts to reconstruct the input from the chaotic latent vector . This encourages the network to learn compact representations capable of capturing key data characteristics.
Step 5: Loss Function Computation:
The objective function of CVAE combines the reconstruction error and the KL divergence between the learned latent distribution and a prior standard normal distribution:
The first term represents accurate reconstruction, another term 2 regularizes the encoder to maintain a smooth latent representation.
Step 6: Model Training:
Minimizing using gradient descent over multiple epochs, parameters like learning rate, batch size, latent dimension, and are tuned via cross-validation.
Step 7: Anomaly Detection:
After training, the reconstruction loss
is computed. This loss quantifies the dissimilarity between the original input and its reconstruction. A high reconstruction error implies that the input deviates significantly from patterns that the model considers normal, which may indicate fraudulent activity. To distinguish between genuine and fraudulent cases, an empirical threshold
is established:
Here, y = 1 identifies a transaction as potentially fraudulent and y = 0 as genuine. The threshold is typically chosen based on validation data and domain expertise to optimize performance.
The CVAE model offers a powerful, flexible, and unsupervised approach to insurance fraud detection. Its ability to model complex non-linear patterns through chaotic transformation, combined with probabilistic latent encoding, makes it highly suitable for scenarios where fraudulent data is sparse and diverse. The integration of anomaly detection via reconstruction loss further enhances its applicability to real-world fraud detection systems. This complete step-by-step framework outlines a scalable methodology that can be extended to other domains where anomalous behaviour is difficult to label or predict.
3.9.2. Bi-LSTM Model
From
Figure 2, bidirectional long short-term memory (Bi-LSTM) networks are an extension of traditional LSTM networks that process sequential data in both forward and backward directions. This dual processing allows the model to capture contextual information from both past and future time steps, which is particularly valuable in detecting fraud patterns embedded in structural data sequences (
Table 6).
Step 1: Input Data Preparation:
Given a dataset with an instance , all features are normalized using standard scaling. Each instance is reshaped into a sequence format suitable for time-series modelling: , where T = 1 (single step) and F L is the number of features (83).
Step 2: Embedding into LSTM Structure:
The Bi LSTM consists of two LSTM layers:
The output from both directions is concatenated:
. This representation incorporates temporal context from both the past and the future.
Step 3: Fully Connected Layer and Prediction:
The combined representation is passed through a dense layer: , where is the sigmoid activation function used for binary classification. The output represents the fraud probability: 0000.
Step 4: Loss Function and Optimization:
The loss function used in binary cross-entropy (BCE) is as follows:
where
are true labels and
are predicted probabilities.
Step 5: Model Training:
The Bi-LSTM model is trained using the Adam optimizer. Typical hyperparameters include the following:
Step 6: Evaluation and Thresholding:
After training, the model generates fraud probabilities for the test data. A classification threshold
is applied:
The predictions are then evaluated using accuracy, precision, recall, F1-score, and ROC and precision–recall curves.
The Bi-LSTM architecture provides a powerful sequential modelling approach for structured fraud detection datasets. Its ability to learn contextual and temporal dependencies allows it to outperform simple feed-forward models in many classification tasks. In insurance fraud, where relationships among application fields may have implicit sequence-like characteristics, Bi-LSTM provides a robust, end-to-end supervised solution that integrates well with both performance and interpretability goals.
3.9.3. Hybrid Random Forest + Bi LSTM Model:
Figure 3 shows the hybrid random forest + Bi-LSTM model combines the deep temporal learning ability of Bi-LSTM with the ensemble classification power of random forests. These two-stage pipelines are designed to take advantage of the sequential modelling from Bi-LSTM to produce high-quality feature representations, which are then used to enhance the discriminative performance of traditional machine learning classifiers (
Table 7).
Step 1: Input Data Preprocessing:
Given a dataset with an instance , where T = 1 and F = 83.
Step 2: Bi-LSTM Training and Prediction:
A Bi-LSTM model is trained on the reshaped data using the same structure as defined earlier:
Forward and backward hidden states: ;
Concatenated output: ;
Dense output layer with sigmoid Each instance generates a predicted fraud probability .
Step 3: Feature Augmentation:
The predicted probabilities from the Bi-LSTM model are concatenated with the original input features: This augmented feature vector becomes the new input for the random forest model.
Step 4: Random Forest Training:
A random forest classifier is trained on the augmented data
. Each decision tree
in the ensemble learns to predict class labels:
Step 5: Prediction and Evaluation:
The test data undergoes the same Bi-LSTM transformation and augmentation. The trained random forest predicts the final class labels.
The hybrid RF + Bi-LSTM model provides a novel mechanism to boost fraud detection performance by integrating the temporal representation power of Bi-LSTM with the ensemble decision-making of random forests. This approach helps in reducing false positives and improving model confidence, offering both depth and interpretability in handling complex fraud detection tasks.
Table 8 summarizes the architectural configurations and training hyperparameters for the CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM models used in the study. These configurations were chosen based on grid search optimization and validation performance to ensure a robust evaluation of model behaviour in imbalanced insurance fraud detection scenarios.
4. Results
A confusion matrix evaluates fraud detection within insurance datasets. It provides insight into each model’s behaviour, identifying minority class instances, particularly fraudulent activities. This section evaluates three advanced architectures—CVAE, Bi-LSTM, and hybrid RF + Bi–LSTM—based on their confusion matrices, with vigorous interpretation grounded in the principles of statistical learning and anomaly detection.
Figure 4- A.
CVAE Model:
The CVAE model exhibits strong generalization to the majority classes (genuine claims), reflected in a high true-negative rate of over 98%. This outcome signifies the model’s robustness in preserving customer experience by minimizing false alarms. However, the recall for the minority class is critically low (~3.31%), revealing the model’s inability to capture suitable fraudulent anomalies in the latent space. While CVAE’s performance aligns with its anomaly detection nature, characterized by supervised learning and a focus on reconstructive fidelity, the trade-off in fraud sensitivity limits its standard-alone applicability in high-risk domains. This validates the hypothesis that unsupervised fine-tuning or ensemble augmentation boosts fraud detection.
The CVAE model demonstrated strong generalization to genuine claims, misclassifying only 12 out of 678 non-fraud cases, yielding a low false-positive rate (~1.8%). However, it detected actual fraud in only 4 out of 122 cases (recall ~3.3%). Despite the unsupervised anomaly detector, the model underperformed in identifying minority class frauds, emphasizing the need for further latent space tuning or integration with supervised learning.
- B.
Bi–LSTM Model:
The Bi–LSTM model captured seven fraud cases, improving recall by ~5.9%. However, it increased the number of false positives (42). The model performs better than CVAE in fraud recognition due to its ability to model temporal dependencies and complex sequential interactions in data. Nevertheless, the missed 110 fraud cases present a critical limitation, indicating the need for class-balancing and attention-enhanced architectures (
Figure 5).
The Bi–LSTM model achieves improved recall (5.9%) and a balanced accuracy of approximately 72.2%. However, this comes with a cost—false positives increase to 42, raising the false-positive rate to 6.2%. Precision remains relatively low (14.3%), but the RF score (~8.5%) is nearly double that of CVAE. Statistically, the Matthews correlation coefficient (MCC ~0.09%) is weak but better than random classification. The model’s bidirectional structure aids in extracting temporal dependencies, improving fraud signal recognition.
- C.
Hybrid RF + LSTM Model:
The hybrid model yields similar results to Bi–LSTM, with the same recall (~5.95) but slightly lower specificity. Accuracy remains high (~80.4%), and precision is standard at 12.9%. While the random forest component slightly increases the variance handling, its integration does not enhance TP detection. The false-positive rate increases to 6.8%, and Cohen’s Kappa again remains low (~0.07), confirming minimal predictive power enhancement.
All three models demonstrate high proficiency in classifying genuine claims but struggle significantly in detecting fraudulent ones.
Figure 6 The performance degradation on the minority class confirms the known limitations of deep and ensemble models in imbalanced settings. From a statistical learning perspective, the high false negatives observed in CVAE, Bi-LSTM, and hybrid RF + Bi–LSTM reaffirm the need for targeted optimization, such as the following:
Class rebalancing through SMOTE.
A normally scoring model with threshold calibration.
Model stacking with cost-sensitive learning objectives.
While CVAE showed the lowest false-positive rate, it also had the lowest recall. The Bi–LSTM model improved recall at the cost of increased false positives, and the hybrid model did not significantly outperform Bi–LSTM, indicating saturation (
Table 9).
- 2.
Performance Matrices:
Table 10 represents the performance matrix, where the CVAE model maintains the highest accuracy (83.75%), primarily due to its highly conservative classification strategy that minimizes false positives (1.77%). However, this is at the expense of a critically low recall (3.28%) and F1-score (5.79%), making it unsuitable for detecting fraudulent events. In comparison, the Bi-LSTM model achieves a more balanced result, offering a recall of 5.98%—an 82.3% improvement over CVAE with a higher F1-score (8.43%). This makes Bi-LSTM notably better at identifying fraud while accepting a manageable increase in false positives (FP rate: 6.15%). The hybrid RF+Bi-LSTM model shows nearly identical recall (5.98%) to Bi-LSTM but slightly lower precision (12.96%) and F1-score (8.19%). Its false-positive rate rises further to 6.88%, and it does not improve the true-positive count of Bi-LSTM despite its added complexity. Therefore, the marginal performance gains highlight that simply increasing ensemble complexity does not guarantee significant improvements; targeted feature integration is crucial.
Recall gain: Bi-LSTM improves recall from 3.28% (CVAE) to 5.98%, marking a significant increase in sensitivity.
F1-score increases: The F1-score rises from 5.79 in CVAE to 8.43% in Bi-LSTM, a relative improvement of over 45.6%.
Accuracy trade-offs: Although CVAE appears superior with 83.75% accuracy, its recall and F1-score performance render it impractical for real-world fraud detection. Bi-LSTM achieves a more effective balance with slightly reduced accuracy (81%) but significantly enhanced fraud detection.
The accuracy reported via confusion matrices and matrix outputs shows mild divergence. For instance, CVAE’s matrix-based calculation (83.75%) aligns precisely with the reported score. However, accuracy metrics derived via libraries like Sklearn may vary based on averaging (macro/micro), class weighting, or data subset splits during evaluation. In the end, there is a statistically justified advantage in both the CVAE and hybrid models, particularly under conditions where detecting fraud is more important than avoiding false alarms.
- 3.
Classification Report Analysis:
The classification reports for the CVAE, Bi-LSTM, and hybrid RF+LSTM models provide a granular perspective on how each algorithm handles fraud detection within an imbalanced dataset. The analysis focuses primarily on matrices for class (fraud cases), where performance is critical and typically underrepresented.
- A.
CVAE Performance:
The CVAE model achieves the highest overall accuracy (84%) and best precision (25%) for the fraud class. This suggests that when CVAE predicts a case as fraudulent, it is more likely to be correct. However, its recall is extremely low (3%), indicating that it misses the vast majority of fraudulent claims. The F1-score (6%), the harmonic mean of precision and recall, remains weak, reinforcing its bias towards the majority claims. This conservative prediction strategy minimizes false positives but fails in detecting true frauds.
- B.
Bi-LSTM Performance:
Bi-LSTM is a significant improvement on CVAE in terms of recall (6%), showing its strength in identifying more fraudulent instances by learning sequential dependencies in data. Although it sacrifices precision (14%) and accuracy (81%), the model achieves the highest F1-score (8%).
Figure 7 This indicates a more balanced trade-off between identifying fraud and minimizing false alarms. It also demonstrates a 100% improvement in recall compared to CVAE, effectively doubling the number of correctly identified fraud cases.
- C.
Hybrid Random Forest + Bi-LSTM performance:
The hybrid model attempts to merge the strength of Bi-LSTM sequence learning and random forest decision-making robustness. It produces a similar recall (5.98%) and slightly lower precision (12.96%) compared to B-LSTM. Its F1-score (8.19%), while marginally better than Bi-LSTM’s, is still within the same operational threshold.
Figure 8 However, the slight decrease in accuracy (80.37%) and increased false-positive rate suggest that the added complexity does not translate into meaningful gains. Thus, the hybrid model’s performance closely mimics that of Bi-LSTM but does not significantly outperform it.
From the above result, the CVAE model, despite its high accuracy, exhibits very low fraud detection capability, with a recall of only 3%. This limitation severely impacts its usability in real-world scenarios where undetected fraudulent processes pose a financial risk. Bi-LSTM gives a better balance by doubling the recall rate over CVAE and raising the F1-score by nearly 40%, offering an optimal compromise between fraud detection and false-positive control. The hybrid model, while theoretically expected to outperform Bi-LSTM, shows only marginal improvement in F1-score and similar recall, indicating diminishing returns on model complexity without a smarter integration mechanism.
- 4.
ROC Curve:
The receiver operating characteristics (ROC) curve was evaluated for three models—CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM—to understand their discriminating power between genuine and fraudulent insurance claims. The area under the curve (AUC) serves as a comprehensive matrix for comparing their performance across various thresholds. The CVAE model achieved an AUC of approximately 0.52, indicating that its classification capability is only marginally better than a random guess. This aligns with earlier performance scores, where CVAE had high accuracy but extremely poor recall (0.0327), failing to detect the minority fraud class effectively (
Figure 9).
The Bi-LSTM model, while achieving a similar AUC (0.50), performed better in terms of recall (0.0598) and F1-score (0.0843), confirming its better sensitivity towards fraudulent patterns through limited true discrimination. Notably, the hybrid RF+Bi-LSTM model reported a higher AUC of 0.54, suggesting a moderate improvement in class separation.
This model benefits from the complementary strengths of random forest ensemble learning and Bi-LSTM’s sequence awareness. Despite a similar recall to Bi-LSTM, the hybrid model offers a slight edge in true-positive detection with a better precision–recall trade-off, reflecting a balanced yet incremental performance improvement.
ROC analysis reinforces the view that while CVAE appears strong through accuracy alone, its ROC curve exposes weak sensitivity. The Bi-LSTM and hybrid models exhibit better alignment with fraud detection goals, albeit still needing performance enhancements for real-world deployment.
- 5.
Precision–Recall Curve Analysis:
With the significant class imbalance in fraud datasets, the precision–recall (PR) curve offers a more informative evaluation than the ROC curve. In this context, the CVAE model showed an initial surge in precision (v10) of zero recall, followed by a steep drop-off, eventually stabilizing around 0.17 average precision. This shape suggests that while CVAE is very selective and avoids false positives, it misses most fraud cases, consistent with a low true-positive count (TP = 4) and minimal F1-score (0.057) (
Figure 10).
In contrast, the Bi-LSTM model demonstrated a more spread-out PR curve with average precision around 0.14, maintaining performance more steadily across a broader range of recall values. This improvement reflects its higher F1-score (0.084) and doubled true positives (TP = 7) compared to CVAE. The hybrid RF + Bi-LSTM model improved this balance, producing a smoother PR curve and an average precision of around 0.15–0.17, with stable performance across recall levels. While it matched Bi-LSTM in TP count (7), the lower false-positive rate (FP = 47) helped optimize precision.
The PR trends validate that both Bi-LSTM and the hybrid model manage the precision–recall trade-off more effectively than CVAE, a key factor in fraud detection, where recall is critical. The hybrid model’s better average precision and smoother curve mark it as the most balanced among the three, though its advantage is incremental.