Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM

Ebinezer, Markapurapu John Dana; Krishna, Bondalapu Chaitanya

doi:10.3390/app15168869

Open AccessArticle

Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM

by

Markapurapu John Dana Ebinezer

^1,*

and

Bondalapu Chaitanya Krishna

²

¹

Department of CSE, Koneru Lakshmaiah Education Foundation, Vijayawada 522302, India

²

Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur Dist., Vijayawada 522302, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 8869; https://doi.org/10.3390/app15168869

Submission received: 4 June 2025 / Revised: 23 July 2025 / Accepted: 25 July 2025 / Published: 12 August 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Insurance fraud detection is a significant challenge due to increasing fraudulent claims, class imbalance, and the increasing complexity of fraudulent behaviour. Traditional machine learning models often struggle to generalize effectively when applied to high-dimensional and imbalanced datasets. This study proposes a data-driven framework for intelligent fraud detection employing three distinct modelling strategies: chaotic variational autoencoders (CVAEs), idirectional long short-term memory (Bi-LSTM), and a hybrid random forest + Bi-LSTM technique. This study aims to evaluate and compare the effectiveness of generative, sequential, and ensemble-based models in identifying rare fraudulent claims within created datasets of 4000 life insurance applications containing 83 features. Following extensive preprocessing and model training, CVAEs achieved the highest accuracy (83.75%) but failed to detect many fraudulent cases due to its low recall (3.28). The Bi-LSTM model outperformed the CVAEs in recall (5.98%) and F1-score, effectively capturing temporal dependencies within the data. The hybrid RF + Bi-LSTM model matched Bi–LSTM in recall but showed more stable ROC and precision–recall curves, indicating robustness and misinterpretability. This hybrid approach balances the strengths of feature-driven and sequential modelling, making it suitable for operational deployment. While Bi–LSTM achieved the best statistical performance, the hybrid model offers enhanced reliability in threshold-sensitive fraud applications.

Keywords:

insurance fraud detection; Bi-LSTM; chaotic variational autoencoder; hybrid model; precision–recall; deep learning; imbalanced classifier

1. Introduction

The exponential growth in digital transactions and online insurance services has created a parallel risk in fraudulent activities, particularly in life insurance. According to the Coalition Against Insurance Fraud, global insurance fraud accounts for a loss of USD 80 billion annually, with life insurance claims increasingly manipulated using forged documentation and identity theft. Traditional claim processing methods are resource intensive and offer delayed results, making manual fraud detection ineffective, unsustainable, and prone to errors. These limitations have encouraged the adoption of artificial intelligence (AI) and machine learning (ML) techniques to automate and enhance the fraud detection process in the insurance industry [1,2].

This research is necessary due to several limitations, as recent trends in digital insurance workflows have shown an increasing occurrence of synthetic identities, false documentation, and abnormal behavioural patterns in claim submissions. These developments demand an intelligent, scalable solution for fraud detection that can operate in real time while minimizing false positives and maintaining trust. Additionally, the complexity of high-dimensional insurance datasets and the rarity of fraud cases present challenges that traditional role-based or shallow learning models cannot effectively address [3]. Therefore, advanced AI-driven systems capable of learning subtle patterns, modelling temporal dependencies, and generating interpretable results are urgently needed to improve operational efficiency, reduce financial losses, and ensure regulatory compliance within the insurance industry.

Despite the growing utilization of machine learning models for fraud detection, the identification of fraudulent transactions within highly unbalanced datasets—where these instances constitute just a small proportion of the overall data—remains a significant difficulty. Many existing models optimize accuracy, a metric that can be misleading when identifying rare but high-impact fraudulent claims [3,4]. Furthermore, traditional machine learning techniques like random forest and support vector machine often struggle to uncover latent dependencies. In contrast, deep learning models, despite their complexity, require careful tuning and validation to perform effectively [4,5].

Despite the growing use of machine learning and deep learning models for insurance fraud detection, several important limitations remain in the current literature. First, many studies focus on maximizing overall accuracy, which can be misleading in imbalanced datasets where fraudulent claims are only a small part of the data. A model that classifies all claims as genuine might achieve high accuracy but fail at identifying fraudulent activity, making it ineffective in real-world applications.

Second, traditional machine learning techniques such as random forest, XGBoost, support vector machine, and logistic regression perform well on structured tabular data and are interpretable. However, these models are limited in their ability to model temporal patterns or evolving fraud behaviours and often struggle to detect latent dependencies that are crucial for fraud detection [2].

Third, while deep learning methods like LSTM and Bi-LSTM can capture temporal dependencies, they lack interpretability. This “black-box” nature makes them difficult to justify in high-stakes, regulated environments like insurance, where decision transparency is essential for auditing and compliance [3].

Fourth, unsupervised models, including autoencoders and variational autoencoders (VAEs), have been used for anomaly detection without labelled data. However, their effectiveness is limited in highly imbalanced datasets because the latent space representations of fraudulent and genuine cases often overlap, reducing their ability to distinguish between them [4].

Lastly, hybrid models such as CNN-LSTM, LSTM-XGBoost, and RF-CNN have shown potential by combining sequence modelling and ensemble learning. Yet, they usually optimize accuracy or precision, often neglecting recall, which is vital for reducing false negatives in fraud detection. Additionally, prior studies have rarely incorporated chaotic transformations in the latent space, which could improve the separation of complex fraud patterns and heighten anomaly sensitivity [5].

These limitations highlight the need for a unified approach that leverages models while explicitly addressing the interpretability–performance trade-off. This study aims to close this gap by proposing a novel hybrid framework, as detailed in Section 3.

This research presents timely, relevant, scalable, adaptive, and explainable AI systems that rely on digital life insurance to flag suspicious claims in real time. While earlier studies explored isolated deep learning or machine learning models, few have examined comparative and hybrid architectures. Given the socio-economic impact of undetected fraud and the costs of false positives (which may harm genuine users), there is a strong need for a balanced, performance-oriented framework [6,7].

There is a gap in the literature where fraud detection models are often optimized for accuracy instead of fraud sensitivity (recall), and their evaluation lacks depth in curve-based metrics like ROC and precision–recall curves, which are better suited to imbalanced data. Few studies have examined hybrid models that combine the temporal learning capabilities of Bi-LSTM with the anomaly detection features of VAEs, alongside the interpretability of RF, particularly in the context of life insurance fraud. Furthermore, earlier research may have neglected assessment methods using curve-based metrics appropriate for uneven data. This research introduces an innovative hybrid architecture that integrates RF and Bi-LSTM, evaluating its performance against independent CVAE and Bi-LSTM models using a life insurance fraud dataset. Unlike prior studies, it emphasizes recall and the robustness of the PR curve rather than solely accuracy, thereby enhancing its relevance to real-world fraud detection.

This study contributes a comprehensive, interpretable, and recall-sensitive framework for life insurance fraud detection by integrating generative, sequential, and ensemble modelling approaches. Specifically, it introduces a novel hybrid architecture that combines the anomaly detection capabilities of a CVAE, the temporal sequence modelling strength of Bi-LSTM, and the interpretability of random forest classifiers. A carefully constructed and balanced dataset of 4000 life insurance applications was developed to simulate real-world fraud patterns. The proposed models were evaluated using metrics appropriate for imbalanced classification, such as recall, F1-score, and precision–recall curves compared to traditional accuracy-driven models. The hybrid RF+Bi-LSTM model demonstrated improved recall and curve stability, making it suitable for deployment in operational fraud detection systems. Overall, this research provides both methodological advancement and a deployable solution for real-time fraud detection in the insurance sector.

Objectives of the Study

To develop and pre-process a life insurance fraud detection dataset consisting of 4000 applications and 83 features.
To design and implement three distinct models: CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM.
To evaluate three models using confusion matrices, classification reports, ROC curves, and PR curves.
To compare their performance based on recall, F1-score, and precision consistency rather than only accuracy.
To identify the ideal model for real-time insurance fraud detection by assessing its interpretability, scalability, and ability to accurately detect fraudulent activities.

To address the limitations identified in the existing literature, such as poor fraud sensitivity in imbalanced datasets, the lack of temporal modelling, limited interpretability, and poor latent space separability, this study proposes a unified, data-driven framework integrating generative, sequential, and ensemble learning models. This approach is specifically tailored to life insurance fraud detection, a domain where rare event identification, decision transparency, and real-time applicability are essential. The proposed framework consists of three complementary components:

1. Chaotic variational autoencoder (CVAE):

The generative model is augmented with sinusoidal perturbations within the latent space to enhance the separability between genuine and fraudulent claims. This mechanism improves unsupervised anomaly detection by capturing non-linear dependencies and subtle deviations in fraudulent behaviour that may not be distinguishable through traditional latent representations.

2. Bidirectional long short-term memory (BI-LSTM)

This is a deep sequential learning model that captures both past and future contextual dependencies in structured claim data. This allows the model to identify evolving repeated fraud behaviour that may otherwise be overlooked by unidirectional or shallow classifiers.

3. Hybrid random forest + Bi-LSTM ensemble:

This is a two-stage architecture, where the Bi-LSTM outputs are concatenated with original features and passed through a random forest classifier. This fusion leverages Bi-LSTM temporal representation with random forest’s decision-level interpretability and stability, making the model suitable for regulated domains.

These models were implemented and validated using carefully constructed datasets consisting of 4000 synthetic life insurance applications with 83 domain-relevant features. The dataset was pre-processed through imputation, encoding, normalization, and class balancing using SMOTE to reflect real-world imbalance scenarios. Three distinct models—CVAE, Bi-LSTM, and RF + Bi-LSTM—were evaluated through extensive experimentation involving confusion matrices, classification reports, ROC curves, and precision–recall curves.

In contrast to previous studies that emphasize overall accuracy, this study’s evaluation focuses on recall, F1-score, and PR-curve robustness. These metrics are more appropriate for fraud detection, where false negatives carry a high cost. The results demonstrated that while the CVAE exhibits high overall accuracy, it lacks fraud sensitivity. The Bi-LSTM model significantly improves recall, and the hybrid RF+Bi-LSTM model achieves a better balance between detection performance and interpretability, showing improved PR-curve stability and real-time suitability. Thus, the proposed approach directly aligns with the research objectives by

Developing a realistic, balanced dataset reflecting life insurance claim fraud patterns.
Comparing these models based on sensitivity, precision, and robustness.
Identifying a model suitable for operational deployment in insurance ecosystems.

This framework not only advances methodological innovation but also bridges practical gaps in real-world fraud detection by offering a scalable and interpretable solution deployable in industry settings.

The rest of this paper is divided into five sections. Section 2 provides a detailed explanation of the proposed methodology, beginning with the data collection and preprocessing strategies applied to the custom-built life insurance dataset. This section outlines the steps taken to handle class imbalance, missing values, feature encoding, and normalization, followed by the normalization descriptions of the three models: chaotic variational autoencoder (CVAE), bidirectional long short-term memory (Bi-LSTM), and the hybrid random forest + Bi-LSTM ensemble. Section 3 focuses on the experimental results, including confusion matrices, classification reports, performance metrics, and curve-based evaluations such as receiver operating characteristic (ROC) and precision–recall (PR) curves. In Section 4, a comparative discussion is presented interpreting each model’s behaviour concerning recall, precision, F1-score, and robustness in curve evaluation. This section also discusses the implications of dataset characteristics, model complexity, and metric-driven analysis while referencing the relevant literature. Finally, Section 5 concludes the paper by summarizing the major findings, identifying the best-performing model, and offering insights for future research directions, including the integration of reinforcement learning and attention-based mechanisms. The paper’s structure is designed to ensure logical progression from problem definition to practical implications, making it accessible to both academic and applied AI audiences.

2. Literature Survey

Fraud detection in the insurance sector is a critical area, where timely identification of fraudulent claims can greatly reduce operational losses. Various ML and DL techniques have been used to address this issue. Traditional models like random forest (RF), XGBoost, logistic regression, and support vector machine (SVM) perform well on structured data, providing interpretable results and low computational costs. However, these models cannot capture temporal dependencies, making them less effective in identifying complex fraud patterns that evolve over time.

To address these limitations, sequential models such as long short-term memory (LSTM) and bidirectional LSTM (Bi-LSTM) have been explored because of their ability to capture dependencies in both directions within sequential insurance claim data. These models are especially effective at identifying evolving fraud schemes over time, like repeated claims and timing-based fraud behaviour inconsistencies. However, LSTM-based models lack interpretability, which is crucial in domains like insurance, where audit trails and justification are required for every decision.

On the other hand, autoencoders and variational encoders (VAEs) have been employed for unsupervised anomaly detection, learning the latent distribution of normal transactions, and flagging outliers. These models perform well in scenarios with sparse labels but often struggle to clearly separate fraud from legitimate claims in the latent space. Although this approach enhances feature separation, its effectiveness remains limited in the fraud detection task. Table 1 shows the Literature summary.

To improve accuracy and robustness, hybrid models have emerged, including combinations such as CNN-LSTM, RF-CNN, and LSTM-XGBoost, which aim to integrate sequential modelling with ensemble or convolutional layers. However, these models primarily optimize accuracy or precision and often fail to prioritize recall, which is more important in fraud detection due to the higher cost of false negatives. Moreover, most hybrids are still black boxes, lacking the transparency needed for real-time operational integration.

Research Gap

Despite the proliferation of models in insurance fraud detection, several persistent gaps remain unaddressed. Traditional machine learning algorithms such as random forest, XGBoost, logistic regression, and SVM have proven effective for structured datasets due to their interpretability and high baseline accuracy. However, they fail to model the temporal dynamics present in evolving fraud schemes, making them less effective for long-term behaviour patterns.

Deep learning models like LSTM and Bi-LSTM have attempted to capture these sequential dependencies, but their black-box nature limits interpretability, a critical feature in domains where decision traceability and compliance are paramount.

Additionally, while autoencoders and VAEs have been leveraged for unsupervised anomaly detection, their latent space often lacks separability between normal and fraudulent claims, especially in an imbalanced dataset. Hybrid approaches such as CNN-LSTM and RF-CNN have shown improved robustness across various domains, yet they generally prioritize accuracy and precision, neglecting recall, which is vital in detecting fraudulent events. Most importantly, none of these studies incorporates chaotic transformations within the latent space, which recent research suggests could significantly enhance fraud separability and detection sensitivity.

In response to the limitations identified in prior studies, including the overemphasis on accuracy rather than recall, the lack of model interpretability, insufficient handling of rare event class imbalance, and the absence of chaotic latent modelling, our study introduces a novel hybrid model that integrates random forest and Bi-LSTM. This integrated approach effectively addresses all the aforementioned challenges by enhancing latent feature separability, capturing temporal fraud patterns, improving recall, and ensuring decision transparency through an interpretable classifier.

3. Methodology

3.1. Data Collection Process

This research utilized the dataset from Scratch to ensure a realistic representation of fraud detection in life insurance applications. The dataset comprises 4000 life insurance applications with 83 features, capturing key aspects such as policyholder demographics, financial history, claim details, transaction behaviour, and fraud risk indicators. The development process included domain knowledge from insurance fraud specialists, statistical analysis of existing fraud trends, and synthetic data generation to balance fraudulent-to-genuine claim ratios.

An initial pool of 120 features was generated based on industry fraud patterns and previous studies to construct the dataset. These features covered policy-related information, claim history, financial risk factors, and transaction behaviour. Following feature selection and correlation analysis, 83 features were finalized based on predictive importance, domain relevance, and data source availability. The dataset was designed with an approximate fraudulent-to-genuine ratio of 15:85, necessitating resampling techniques to address claim imbalance issues.

The fraudulent applications were generated using statistical sampling and synthetic data augmentation. A fraud pattern was simulated based on historical trends, including abnormal claim amounts, policyholders’ consistency, unusual transaction frequencies, and high-risk financial attributes. A controlled proportion of fraudulent cases was introduced into the dataset, mimicking real-world fraud actions.

3.2. Dataset Structure and Key Features

The dataset consists of numerical and categorical attributes, ensuring a comprehensive representation of fraud indicators in life insurance applications. These features were cautiously selected based on their predictive significance in fraud detection, statistical correlation analysis, and expert validation from insurance fraud analysts. Table 2 and Table 3 show the breakdown of the dataset into numerical and categorical feature categories.

3.2.1. Numerical Features: (Continuous and Discrete Variables)

Numerical features include policy details like financial indicators, claim history, and behaviour metrics, which provide quantitative insights into fraudulent activity.

3.2.2. Categorical Features (Nominal and Ordinal Variables)

Categorical features include policyholder demographics, claims-related classification, and behaviour patterns, which are useful in detecting fraud risk.

3.2.3. Key Insights on Dataset Structure

Numerical features provide detailed quantitative indicators of fraudulent activity, including financial factors, claim behaviour trends, and transaction anomalies.
Categorical features capture policyholder demographic claim classifications and behavioural risk factors, which help to identify fraudulent claims based on historical trends.
Fraudulent activities tend to be characterized by anomalies in both numerical and categorical features, such as unusual claim amounts, high debt ratios, frequent claim filings, and inconsistency in submitting documents.

This structured dataset enables the fraud detection models to leverage both structured numerical insight and categorical risk patterns for improved predictive accuracy.

3.3. Dataset Preprocessing

Dataset preprocessing is essential for ensuring data quality, consistency, and proper feature representation before training fraud detection models. This section outlines the preprocessing techniques applied to the dataset of 4000 life insurance applications, which contains 83 features, incorporating mathematical analysis for handling missing values, duplicates, categorical encoding, scaling, class imbalance, and feature selection.

3.3.1. Handling Missing Values

Missing values were observed in 7.5% of numerical variables and 5.2% of categorical variables. To address this, different imputation techniques were applied based on feature distribution, i.e., mean and median imputation. For normally distributed features (for example, annual income, total outstanding debt), mean imputation was used:

x^{'} = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

where

x^{'}

is the imputed value, N is the total number of available values, and

x_{i}

represents existing values for skewed distribution (for example, claim amount debt-to-income ratio, etc.), median imputation is applied.

x^{'} = m e d i a n (x_{1}, x_{2}, x_{3} \dots x_{N})

For skewed distribution (like claim amount, debt–income rate), median imputation was applied to prevent the influence of outliers in the financial data.

For categorical missing values used as mode imputation (for example, employment, statistical education level, etc.)

x^{'} = \arg \max f (x)

where f(x) is the frequency of occurrence for category x.

For time-dependent variables (for example, claim submission time stamps), forward fill and backward fill techniques were used.

For forward fill,

x_{t} = x_{t - 1}

For backward fill,

x_{t} = x_{t + 1}

3.3.2. Handling Duplication and Data Consistency

Duplicate records were identified based on a combination of policy ID, claim ID, and submission time stamps, ensuring no overlapping fraudulent entries. The duplicate removal function was formulated as follows:

D (x) = \sum_{i = 1}^{N} 1 {x_{i} = x_{j}, \forall_{j} \neq i}

where D(x) detects the duplicate entries and 1 can indicate the function that returns 1 if

x_{i} = x_{j};

otherwise 0.

After removal, 3.2% of records were identified as duplicates and eliminated to ensure data consistency.

3.4. Feature Selection Process

From the initial dataset of 83 features, we employed a multi-stage approach to identify the five most influential features for interpretability and model refinement. First, a random forest classifier was trained, and feature importance was computed using the mean decrease in “Gini impurity”. This yielded a ranked list based on its contribution to classification accuracy.

Next, we shortlisted the top 10 features and applied a sequential forward selection technique using the hybrid RF + Bi-LSTM model. This method evaluated a combination of features incrementally, selecting those that improve model performance, particularly the F1-score and recall. The optimal subset of five features was finalized based on this performance-driven criterion.

Finally, an interpretability check was conducted to ensure that each selected feature held practical relevance in the context of fraud detection. These included indicators such as high claims-to-premium ratios, unusual hospitalization durations, and temporal anomalies in claim submissions. The selected features are discussed in Supplementary File S1, along with their domain relevance.

3.5. Feature Scaling and Normalization

Feature scaling was applied to improve model convergence and ensure a fair weight distribution among features.

3.5.1. F1-Score Standardization for Normally Distributed Data

Features like annual income, total outstanding loans, and credit score standardization were considered as follows:

x^{'} = \frac{x - μ}{σ}

where

μ

is the mean and

σ

is the standard deviation.

3.5.2. Min–Max Scaling for Bounded Data

For features like the debt-to-income ratio and the number of prior claims applied, Min-Max scaling was as follows:

x^{'} = \frac{x - m i n}{x_{m a x} - x_{m i n}}

ensuring that the values fall within the range [0, 1].

3.6. Handling Class Imbalance

The dataset had an imbalance ratio of 15:85 (fraudulent-to-genuine cases), leading to biased model predictions. This was addressed using synthetic minority over-sampling techniques (SMOTEs).

For each fraud sample

x_{i}

, the synthetic instance was generated as follows:

x_{n e w} = x_{i} + λ (x_{n e a r e s t} - x_{i}), λ ~ μ (0, 1)

where

x_{n e a r e s t}

represents the k-nearest neighbour fraud instances and

μ

is the random weight. After applying SMOTE, the dataset was balanced to a 50:50 fraudulent-to-genuine ratio, improving model sensitivity to fraud cases.

3.7. Feature Selection

To reduce dimensionality and return the model’s relevant fraud detection features, feature selection techniques were applied.

Chi-Square Test for Categorical Variables

x^{2} = \sum \frac{(o - E)^{2}}{E}

where 0 is the observed value and E is the expected value.

B.: Mutual Information for Numerical Variables

I (x, y) = \sum P (x, y) l o g \frac{P (x, y)}{P (x) P (y)}

This means how much knowing one variable reduces uncertainty in the target (fraud status).

C.: Random Forest Features Importance Ranking

The final feature importance ranking was computed as follows:

F e a t u r e i m p o r t a n c e = \frac{1}{T} \sum_{t = 1}^{T} (\frac{I_{s p l i t}}{N_{n o d e s}})

where T is the number of trees and

I_{s p l i t}

is the information gained from splits in the random forest. The final five most predictive features were the following:

Claim amount (0.24)
Credit score (0.18)
Number of prior claims (0.16)
Policy type (0.10)
Debt-to-income ratio (0.09)

3.8. Final Proposed Dataset

From Table 4, the final preprocessing dataset contained 4000 records, with 83 optimized features, balanced for fraud detection. The preprocessing step eliminated inconsistencies, addressed claim imbalance, transformed categorical data, and selected the most predictive attributes, ensuring better model generalized fraud classification accuracy.

3.9. Proposed Methodology/Proposed Framework

This study introduces three robust methodologies for the detection of insurance fraud using both deep learning and hybrid architectures: chaotic variational autoencoder (CVAE), bidirectional long short-term memory (Bi-LSTM), and a hybrid random forest+ Bi-LSTM (RF-Bi LSTM). Each method is structured to extract high-level representations and exploit different aspects of temporal and probabilistic behaviour inherent in fraudulent and healthcare transactions.

CVAE represents an advanced variant of the traditional variational autoencoder (VAE), tailored to incorporate chaotic perturbation in the latent space. This enables the model to respond sensitively to anomalous patterns by enhancing its latent representations through sinusoidal transformation, which amplifies the divergence between normal and fraudulent inputs. The theoretical justification lies in the ability of chaotic dynamics to capture non-linear interactions among features, making CVAE particularly suitable for anomaly detection in complex fraud data.

Bi-LSTM, a variant of RNN, is designed to address the limitations of unidirectional sequential learning. By processing input sequences in both forward and backward directions, Bi-LSTM captures contextual dependencies that may occur before or after a given point in the sequence. This bidirectional learning mechanism allows the model to better understand features, which is especially beneficial in identifying time-dependent fraud patterns.

The hybrid RF-Bi LSTM model synergizes deep learning’s sequential modelling strengths with ensemble learning’s interpretability and robustness. The Bi-LSTM component extracts temporary features and output probability scores, which are then appended to the original feature vector and passed into a random forest classifier. This fusion ensures that the deep temporal features guide the ensemble decision-making, enhancing the model’s capacity to differentiate subtle fraud cases without sacrificing interpretability.

Together, these three models offer complementary perspectives: CVAE excels in detecting anomalous behaviour without requiring labels; Bi-LSTM combines predictive strength and explainability, making the architecture suitable for real-world fraud detection systems; and their systematic comparison offers insights into the advantages and limitations of different methodologies and approaches to insurance fraud detection.

3.9.1. Algorithms for CVAE

The chaotic variational autoencoder (CVAE) enhances the classic VAE framework by incorporating a chaotic function into the latent space to increase the model’s sensitivity to anomalies. Figure 1 The following algorithms give the end-to-end CVAE process with corresponding theoretical and mathematical foundations (Table 5).

Step 1: Input Data Preparation:

Let

x \in R^{n}

represent an input vector from the insurance dataset. All features are normalized using standard techniques like Min–Max normalization.

Step 2: Encoding to Latent Space:

The encoder neural network maps the input x to the parameters of a Gaussian distribution:

μ = f_{μ^{(x)}}

, log

σ^{2}

=

f_{σ^{(x)}}

using the representation trick

z = μ + σ ϵ, ϵ ~ N (0, 1)

. This allows backpropagation through stochastic sampling.

Step 3: Applying Chaotic Perturbation:

To increase robustness to subtle deviations and capture non-linear fraud behaviour, apply a chaotic sinusoidal transformation:

Z^{2} = Z + η \sin (π Z),

where

η

is the chaos control parameter that governs perturbation amplitude.

Step 4: Decoding and Reconstruction:

The decoder attempts to reconstruct the input from the chaotic latent vector

\hat{x} = D (\overset{ˇ}{z})

. This encourages the network to learn compact representations capable of capturing key data characteristics.

Step 5: Loss Function Computation:

The objective function of CVAE combines the reconstruction error and the KL divergence between the learned latent distribution and a prior standard normal distribution:

L_{C V A E} = E_{q (z | x)} [\log P (x | \bar{z}) - O_{{K L}^{q (z \ x)}} ‖ P (z)]

The first term represents accurate reconstruction, another term 2 regularizes the encoder to maintain a smooth latent representation.

Step 6: Model Training:

Minimizing

L_{C V A E}

using gradient descent over multiple epochs, parameters like learning rate, batch size, latent dimension, and

η

are tuned via cross-validation.

Step 7: Anomaly Detection:

After training, the reconstruction loss

‖ x - \hat{x} ‖^{2}

is computed. This loss quantifies the dissimilarity between the original input and its reconstruction. A high reconstruction error implies that the input deviates significantly from patterns that the model considers normal, which may indicate fraudulent activity. To distinguish between genuine and fraudulent cases, an empirical threshold

δ

is established:

y = \{\begin{matrix} 1, if | | x - \hat{x} | |^{2} > δ \\ 0 otherwise \end{matrix}

Here, y = 1 identifies a transaction as potentially fraudulent and y = 0 as genuine. The threshold

δ

is typically chosen based on validation data and domain expertise to optimize performance.

The CVAE model offers a powerful, flexible, and unsupervised approach to insurance fraud detection. Its ability to model complex non-linear patterns through chaotic transformation, combined with probabilistic latent encoding, makes it highly suitable for scenarios where fraudulent data is sparse and diverse. The integration of anomaly detection via reconstruction loss further enhances its applicability to real-world fraud detection systems. This complete step-by-step framework outlines a scalable methodology that can be extended to other domains where anomalous behaviour is difficult to label or predict.

3.9.2. Bi-LSTM Model

From Figure 2, bidirectional long short-term memory (Bi-LSTM) networks are an extension of traditional LSTM networks that process sequential data in both forward and backward directions. This dual processing allows the model to capture contextual information from both past and future time steps, which is particularly valuable in detecting fraud patterns embedded in structural data sequences (Table 6).

Step 1: Input Data Preparation:

Given a dataset with an instance

x^{(i)} ϵ R^{n}

, all features are normalized using standard scaling. Each instance is reshaped into a sequence format suitable for time-series modelling:

x^{(i)} ϵ R^{T \times F}

, where T = 1 (single step) and F L is the number of features (83).

Step 2: Embedding into LSTM Structure:

The Bi LSTM consists of two LSTM layers:

Forward LSTM: $\vec{h_{t}} = L S T M_{f} (x_{t})$
Backward LSTM: $\overset{\leftarrow}{h_{t}} = L S T M_{b} (x_{t})$

The output from both directions is concatenated:

h_{t} | \vec{h_{t},} \overset{\leftarrow}{h_{t}} |

. This representation incorporates temporal context from both the past and the future.

Step 3: Fully Connected Layer and Prediction:

The combined representation

h_{t}

is passed through a dense layer:

\hat{y} = σ (w_{o} h_{t} + b_{o})

, where

σ

is the sigmoid activation function used for binary classification. The output

\hat{y} \in [0, 1]

represents the fraud probability: 0000.

Step 4: Loss Function and Optimization:

The loss function used in binary cross-entropy (BCE) is as follows:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} y^{(i)} \log (1 - {\hat{y}}^{i})

where

y^{i} \in \{0, 1\}

are true labels and

{\hat{y}}^{i}

are predicted probabilities.

Step 5: Model Training:

The Bi-LSTM model is trained using the Adam optimizer. Typical hyperparameters include the following:

Epochs: 10–20
Batch size: 32
Hidden units: 64
Dropout is 0.3 to prevent overfitting; training continues until the validation loss stabilizes or early stopping criteria are met.

Step 6: Evaluation and Thresholding:

After training, the model generates fraud probabilities for the test data. A classification threshold

θ

is applied:

y = \{\begin{matrix} 1 if \hat{y} > θ \\ 0 otherwise \end{matrix}

The predictions are then evaluated using accuracy, precision, recall, F1-score, and ROC and precision–recall curves.

The Bi-LSTM architecture provides a powerful sequential modelling approach for structured fraud detection datasets. Its ability to learn contextual and temporal dependencies allows it to outperform simple feed-forward models in many classification tasks. In insurance fraud, where relationships among application fields may have implicit sequence-like characteristics, Bi-LSTM provides a robust, end-to-end supervised solution that integrates well with both performance and interpretability goals.

3.9.3. Hybrid Random Forest + Bi LSTM Model:

Figure 3 shows the hybrid random forest + Bi-LSTM model combines the deep temporal learning ability of Bi-LSTM with the ensemble classification power of random forests. These two-stage pipelines are designed to take advantage of the sequential modelling from Bi-LSTM to produce high-quality feature representations, which are then used to enhance the discriminative performance of traditional machine learning classifiers (Table 7).

Step 1: Input Data Preprocessing:

Given a dataset with an instance

x^{i} \in R^{T \times F}

, where T = 1 and F = 83.

Step 2: Bi-LSTM Training and Prediction:

A Bi-LSTM model is trained on the reshaped data using the same structure as defined earlier:

Forward and backward hidden states:

\vec{h_{t},} \overset{\leftarrow}{h_{t}}

;

Concatenated output:

h_{t} = \vec{{[h}_{t},} \overset{\leftarrow}{h_{t}}]

;

Dense output layer with sigmoid

{\hat{p}}^{i} = σ {(w}_{o} h_{t} + b_{o});

Each instance generates a predicted fraud probability

{\hat{p}}^{i} \in [0, 1]

.

Step 3: Feature Augmentation:

The predicted probabilities from the Bi-LSTM model are concatenated with the original input features:

z^{i} = [x^{(i)}; {\hat{p}}^{(i)}] .

This augmented feature vector

z^{i} = \in R^{n + 1}

becomes the new input for the random forest model.

Step 4: Random Forest Training:

A random forest classifier is trained on the augmented data

z \in R^{N \times (n + 1)}

. Each decision tree

f_{k}

in the ensemble learns to predict class labels:

y^{(i)} = m a j o r i t y v o t e (f_{1} (z^{(i)}), f_{2} (z^{(i)}), \dots . f_{k} (z^{(i)}))

Step 5: Prediction and Evaluation:

The test data undergoes the same Bi-LSTM transformation and augmentation. The trained random forest predicts the final class labels.

The hybrid RF + Bi-LSTM model provides a novel mechanism to boost fraud detection performance by integrating the temporal representation power of Bi-LSTM with the ensemble decision-making of random forests. This approach helps in reducing false positives and improving model confidence, offering both depth and interpretability in handling complex fraud detection tasks.

Table 8 summarizes the architectural configurations and training hyperparameters for the CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM models used in the study. These configurations were chosen based on grid search optimization and validation performance to ensure a robust evaluation of model behaviour in imbalanced insurance fraud detection scenarios.

4. Results

Confusion Matrix:

A confusion matrix evaluates fraud detection within insurance datasets. It provides insight into each model’s behaviour, identifying minority class instances, particularly fraudulent activities. This section evaluates three advanced architectures—CVAE, Bi-LSTM, and hybrid RF + Bi–LSTM—based on their confusion matrices, with vigorous interpretation grounded in the principles of statistical learning and anomaly detection. Figure 4

A.: CVAE Model:

The CVAE model exhibits strong generalization to the majority classes (genuine claims), reflected in a high true-negative rate of over 98%. This outcome signifies the model’s robustness in preserving customer experience by minimizing false alarms. However, the recall for the minority class is critically low (~3.31%), revealing the model’s inability to capture suitable fraudulent anomalies in the latent space. While CVAE’s performance aligns with its anomaly detection nature, characterized by supervised learning and a focus on reconstructive fidelity, the trade-off in fraud sensitivity limits its standard-alone applicability in high-risk domains. This validates the hypothesis that unsupervised fine-tuning or ensemble augmentation boosts fraud detection.

The CVAE model demonstrated strong generalization to genuine claims, misclassifying only 12 out of 678 non-fraud cases, yielding a low false-positive rate (~1.8%). However, it detected actual fraud in only 4 out of 122 cases (recall ~3.3%). Despite the unsupervised anomaly detector, the model underperformed in identifying minority class frauds, emphasizing the need for further latent space tuning or integration with supervised learning.

B.: Bi–LSTM Model:

The Bi–LSTM model captured seven fraud cases, improving recall by ~5.9%. However, it increased the number of false positives (42). The model performs better than CVAE in fraud recognition due to its ability to model temporal dependencies and complex sequential interactions in data. Nevertheless, the missed 110 fraud cases present a critical limitation, indicating the need for class-balancing and attention-enhanced architectures (Figure 5).

The Bi–LSTM model achieves improved recall (5.9%) and a balanced accuracy of approximately 72.2%. However, this comes with a cost—false positives increase to 42, raising the false-positive rate to 6.2%. Precision remains relatively low (14.3%), but the RF score (~8.5%) is nearly double that of CVAE. Statistically, the Matthews correlation coefficient (MCC ~0.09%) is weak but better than random classification. The model’s bidirectional structure aids in extracting temporal dependencies, improving fraud signal recognition.

C.: Hybrid RF + LSTM Model:

The hybrid model yields similar results to Bi–LSTM, with the same recall (~5.95) but slightly lower specificity. Accuracy remains high (~80.4%), and precision is standard at 12.9%. While the random forest component slightly increases the variance handling, its integration does not enhance TP detection. The false-positive rate increases to 6.8%, and Cohen’s Kappa again remains low (~0.07), confirming minimal predictive power enhancement.

All three models demonstrate high proficiency in classifying genuine claims but struggle significantly in detecting fraudulent ones. Figure 6 The performance degradation on the minority class confirms the known limitations of deep and ensemble models in imbalanced settings. From a statistical learning perspective, the high false negatives observed in CVAE, Bi-LSTM, and hybrid RF + Bi–LSTM reaffirm the need for targeted optimization, such as the following:

Class rebalancing through SMOTE.
A normally scoring model with threshold calibration.
Model stacking with cost-sensitive learning objectives.

While CVAE showed the lowest false-positive rate, it also had the lowest recall. The Bi–LSTM model improved recall at the cost of increased false positives, and the hybrid model did not significantly outperform Bi–LSTM, indicating saturation (Table 9).

2.: Performance Matrices:

Table 10 represents the performance matrix, where the CVAE model maintains the highest accuracy (83.75%), primarily due to its highly conservative classification strategy that minimizes false positives (1.77%). However, this is at the expense of a critically low recall (3.28%) and F1-score (5.79%), making it unsuitable for detecting fraudulent events. In comparison, the Bi-LSTM model achieves a more balanced result, offering a recall of 5.98%—an 82.3% improvement over CVAE with a higher F1-score (8.43%). This makes Bi-LSTM notably better at identifying fraud while accepting a manageable increase in false positives (FP rate: 6.15%). The hybrid RF+Bi-LSTM model shows nearly identical recall (5.98%) to Bi-LSTM but slightly lower precision (12.96%) and F1-score (8.19%). Its false-positive rate rises further to 6.88%, and it does not improve the true-positive count of Bi-LSTM despite its added complexity. Therefore, the marginal performance gains highlight that simply increasing ensemble complexity does not guarantee significant improvements; targeted feature integration is crucial.

Recall gain: Bi-LSTM improves recall from 3.28% (CVAE) to 5.98%, marking a significant increase in sensitivity.

F1-score increases: The F1-score rises from 5.79 in CVAE to 8.43% in Bi-LSTM, a relative improvement of over 45.6%.
Accuracy trade-offs: Although CVAE appears superior with 83.75% accuracy, its recall and F1-score performance render it impractical for real-world fraud detection. Bi-LSTM achieves a more effective balance with slightly reduced accuracy (81%) but significantly enhanced fraud detection.

The accuracy reported via confusion matrices and matrix outputs shows mild divergence. For instance, CVAE’s matrix-based calculation (83.75%) aligns precisely with the reported score. However, accuracy metrics derived via libraries like Sklearn may vary based on averaging (macro/micro), class weighting, or data subset splits during evaluation. In the end, there is a statistically justified advantage in both the CVAE and hybrid models, particularly under conditions where detecting fraud is more important than avoiding false alarms.

3.: Classification Report Analysis:

The classification reports for the CVAE, Bi-LSTM, and hybrid RF+LSTM models provide a granular perspective on how each algorithm handles fraud detection within an imbalanced dataset. The analysis focuses primarily on matrices for class (fraud cases), where performance is critical and typically underrepresented.

A.: CVAE Performance:

The CVAE model achieves the highest overall accuracy (84%) and best precision (25%) for the fraud class. This suggests that when CVAE predicts a case as fraudulent, it is more likely to be correct. However, its recall is extremely low (3%), indicating that it misses the vast majority of fraudulent claims. The F1-score (6%), the harmonic mean of precision and recall, remains weak, reinforcing its bias towards the majority claims. This conservative prediction strategy minimizes false positives but fails in detecting true frauds.

B.: Bi-LSTM Performance:

Bi-LSTM is a significant improvement on CVAE in terms of recall (6%), showing its strength in identifying more fraudulent instances by learning sequential dependencies in data. Although it sacrifices precision (14%) and accuracy (81%), the model achieves the highest F1-score (8%). Figure 7 This indicates a more balanced trade-off between identifying fraud and minimizing false alarms. It also demonstrates a 100% improvement in recall compared to CVAE, effectively doubling the number of correctly identified fraud cases.

C.: Hybrid Random Forest + Bi-LSTM performance:

The hybrid model attempts to merge the strength of Bi-LSTM sequence learning and random forest decision-making robustness. It produces a similar recall (5.98%) and slightly lower precision (12.96%) compared to B-LSTM. Its F1-score (8.19%), while marginally better than Bi-LSTM’s, is still within the same operational threshold. Figure 8 However, the slight decrease in accuracy (80.37%) and increased false-positive rate suggest that the added complexity does not translate into meaningful gains. Thus, the hybrid model’s performance closely mimics that of Bi-LSTM but does not significantly outperform it.

From the above result, the CVAE model, despite its high accuracy, exhibits very low fraud detection capability, with a recall of only 3%. This limitation severely impacts its usability in real-world scenarios where undetected fraudulent processes pose a financial risk. Bi-LSTM gives a better balance by doubling the recall rate over CVAE and raising the F1-score by nearly 40%, offering an optimal compromise between fraud detection and false-positive control. The hybrid model, while theoretically expected to outperform Bi-LSTM, shows only marginal improvement in F1-score and similar recall, indicating diminishing returns on model complexity without a smarter integration mechanism.

4.: ROC Curve:

The receiver operating characteristics (ROC) curve was evaluated for three models—CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM—to understand their discriminating power between genuine and fraudulent insurance claims. The area under the curve (AUC) serves as a comprehensive matrix for comparing their performance across various thresholds. The CVAE model achieved an AUC of approximately 0.52, indicating that its classification capability is only marginally better than a random guess. This aligns with earlier performance scores, where CVAE had high accuracy but extremely poor recall (0.0327), failing to detect the minority fraud class effectively (Figure 9).

The Bi-LSTM model, while achieving a similar AUC (0.50), performed better in terms of recall (0.0598) and F1-score (0.0843), confirming its better sensitivity towards fraudulent patterns through limited true discrimination. Notably, the hybrid RF+Bi-LSTM model reported a higher AUC of 0.54, suggesting a moderate improvement in class separation.

This model benefits from the complementary strengths of random forest ensemble learning and Bi-LSTM’s sequence awareness. Despite a similar recall to Bi-LSTM, the hybrid model offers a slight edge in true-positive detection with a better precision–recall trade-off, reflecting a balanced yet incremental performance improvement.

ROC analysis reinforces the view that while CVAE appears strong through accuracy alone, its ROC curve exposes weak sensitivity. The Bi-LSTM and hybrid models exhibit better alignment with fraud detection goals, albeit still needing performance enhancements for real-world deployment.

5.: Precision–Recall Curve Analysis:

With the significant class imbalance in fraud datasets, the precision–recall (PR) curve offers a more informative evaluation than the ROC curve. In this context, the CVAE model showed an initial surge in precision (v10) of zero recall, followed by a steep drop-off, eventually stabilizing around 0.17 average precision. This shape suggests that while CVAE is very selective and avoids false positives, it misses most fraud cases, consistent with a low true-positive count (TP = 4) and minimal F1-score (0.057) (Figure 10).

In contrast, the Bi-LSTM model demonstrated a more spread-out PR curve with average precision around 0.14, maintaining performance more steadily across a broader range of recall values. This improvement reflects its higher F1-score (0.084) and doubled true positives (TP = 7) compared to CVAE. The hybrid RF + Bi-LSTM model improved this balance, producing a smoother PR curve and an average precision of around 0.15–0.17, with stable performance across recall levels. While it matched Bi-LSTM in TP count (7), the lower false-positive rate (FP = 47) helped optimize precision.

The PR trends validate that both Bi-LSTM and the hybrid model manage the precision–recall trade-off more effectively than CVAE, a key factor in fraud detection, where recall is critical. The hybrid model’s better average precision and smoother curve mark it as the most balanced among the three, though its advantage is incremental.

5. Discussion

5.1. Influence of Dataset and Preprocessing on Model Performance

The dataset consists of 4000 life insurance applications with 83 diverse features, which played a central role in the model outcomes. These attributes, such as applicant age, policy type, claim amount, submission frequency, and beneficiary relationships, are highly relevant to fraud risk profiles [7,8]. Proper preprocessing steps, including imputation, encoding, normalization, and duplicate handling, ensured data consistency, reduced noise, and minimized model bias. These steps align with the fundamental principle that “garbage in, garbage out” applies strongly in classification tasks. Poor feature representations or unaddressed imbalance typically lead to misleading results [9].

5.2. Confusion Matrices—Reflection of Model Sensitivity

The confusion matrix underscores the challenge of extreme class imbalance in fraud detection. CVAE showed high true negatives (TN = 666) but low true positives (TP = 4) and lacked fraud sensitivity. Bi-LSTM and the hybrid model increased TP to 7, improving sensitivity but also raising false positives. Despite these small numerical changes, recall (0.33 to 0.059) and the F1-score (0.057 to 0.084) demonstrate that even minor gains in true positives are critical for evaluating rare event detection models [10,11].

5.3. Relationship Between Classification Report and Model Behaviour

The classification report is a critical evaluation tool that quantifies a model’s ability to differentiate between classes. It provides four primary metrics, precision, recall, F1-score, and support, each of which reveals a distinct aspect of model behaviour. Understanding how these matrices relate to confusion matrix components (true positives, false positives, false negatives, and true negatives) allows for a deeper interpretation of model performance, especially in fraud detection, where class imbalance is common.

5.3.1. Precision and Its Implications

Precision measures the proportion of correct positive predictions (TP) out of all predicted positives (TP+FP). CVAE demonstrated high precision (0.25) but very low recall, reflecting a careful approach that minimizes false positives but frequently fails to identify fraudulent activity. This indicates a conservative model that prioritizes avoiding false positives over detecting all fraudulent claims.

In contrast, Bi-LSTM and the hybrid RF + Bi-LSTM model struck a trade-off. Although their precision was lower (0.14 and 0.13, respectively), they were more willing to flag potential fraud cases, thus increasing the number of true positives. This willingness to accept some false positives to improve the detection of fraud is crucial in high-stakes detection tasks [12,13].

5.3.2. Recall and Minority Class Sensitivity

Recall, which is the ratio of true positives (TP) to the total of true positives and false negatives (TP + FN), evaluates how well a fraud detection system can accurately detect the current fraud situation. In fraud detection, this metric may be critical to detecting a fraudulent claim (FN), which can lead to significant financial losses. The classification report indicated that the CVAE model had negligible recall in predicting fraud despite achieving high accuracy.

In contrast, both the Bi-LSTM and the hybrid model showed an improvement in recall to 0.0598, effectively doubling the number of detected fraudulent cases. This finding aligns with previous research suggesting that temporal models like LSTM are more effective at identifying infrequent sequential anomalies [14].

5.4. F1-Score Balance Between Precision and Recall

The F1-score highlights the balance between precision and recall. In CVAE, the F1-score was weak (0.057), revealing its insignificant ability to balance detection sensitivity and specificity. Bi-LSTM achieved 0.084 and the hybrid 0.081, reflecting more reliable fraud identification. This aligns with ensemble learning theory, combining classifiers that can reduce variance [15].

5.5. Impact of Recall—False-Positive Trade-Off

While enhancing recall is crucial for detecting fraudulent claims, it is equally important to consider the operational and financial implications of false positives. A high false-positive rate (FPR) can result in a significant number of legitimate claims being flagged, leading to unnecessary manual investigations, increased costs, and reduced customer satisfaction. In the current evaluation, the hybrid RF + Bi-LSTM model, despite its improved F1-score of 8.99%, exhibits an FPR of 6.88%, which corresponds to approximately 275 false alarms out of 4000 cases. At an estimated cost of USD 100 per investigation, this translates to a potential operational overhead of USD 27,500, a non-negligible burden for insurance firms.

To mitigate this, model deployment should involve threshold tuning, aligned with business risk tolerance, and employ risk-based prioritization strategies. Low-confidence predictions can be flagged for a soft review instead of full audits. Additionally, integrating cost-sensitive learning can help balance detection accuracy with operational feasibility. These considerations ensure that fraud detection systems are not only technically sound but also practically viable in the real-world insurance environment.

5.6. Support and Weighted Averages

Support, representing the number of samples in each class, is critical for understanding macro, micro, and weighted averages in the classification report. Weighted averages are directly influenced by the support of each class. Consequently, scenarios with highly imbalanced data, such as the significant disparity between genuine claims and fraud cases, may not lead to a noticeable change in the overall weighted F1-score. This is where accuracy appears high, even in poorly performing models, reinforcing the misleading nature of accuracy in imbalanced classifications [16,17].

The classification report serves as a link between raw confusion metrics outputs and strategic model evaluation, transforming complex prediction distributions into actionable metrics that highlight the strengths and weaknesses of each model.

CVAE appears high performing based on accuracy but fails to detect fraud.
Bi-LSTM improves recall and F1-score by recognizing sequential fraud behaviour.
Hybrid RF + Bi-LSTM slightly enhances precision and generalization due to feature interaction and temporal modelling.

5.7. ROC Curve and Precision–Recall Curve Behaviour

In the context of imbalanced classification tasks such as insurance fraud detection, the receiver operating characteristics (ROC) and precision–recall (PR) curve serve as essential diagnostic tools that reveal model performance beyond simple accuracy. The ROC curve, which plots the true-positive rate (TPR) against the false-positive rate (FPR), offers a global view of class separability at various threshold levels. However, its applicability is limited in the real world, where fraudulent cases are typically underrepresented, as it assumes balanced distribution and equal misclassification of costs. In this study, the CVAE model achieved an AUC-ROC of approximately 0.52, Bi-LSTM reached around 0.50, and the hybrid RF + Bi-LSTM model slightly achieved a slight improvement at 0.54. These results indicate poor to marginal discrimination power across all models, aligning with established findings that the ROC curve can be misleading in the presence of severe class imbalance [18,19].

In contrast, PR curves provide a more reliable representation of model behaviour in such skewed datasets by directly focusing on the performance of the minority class. Precision captures the proportion of true fraud cases among all predicted frauds, while recall measures the fraction of fraud that was correctly identified. The CVAE model displayed high precision, near-zero recalls, and a significantly conservative prediction with very few correct fraud detections. However, this high precision collapsed rapidly as recall increased, revealing that the model fails to generalize in identifying fraud. Bi-LSTM showed a more balanced PR curve, with average precision around 0.14, reflecting its improved ability to trade off precision for increased recall. The hybrid model demonstrated the most stable PR trajectory, maintaining an average precision of approximately 0.15–0.17 across a broader range of recall values. These trends are in line with findings from recent studies that advocate the PR curve over ROC for evaluating performance in fraud detection and medical diagnosis contexts where the cost of false negatives is higher [20,21].

Moreover, the trends observed in the PR curves are consistent with the confusion matrix and classification report results. CVAE, despite a high overall accuracy, exhibited minimal recall (3.28%) and TP counts, which were visually reflected in its rapidly declining PR curve. The Bi-LSTM model and hybrid model with higher recall and F1-scores demonstrated broader and smoother curves, reinforcing their superior sensitivity to the minority class. This confirms the theoretical foundation that models tuned for higher recall may sacrifice some precision but are better suited to detecting rare events, a critical factor in applications such as insurance fraud detection [22,23].

In contrast, the ROC curve offered limited insight into classifier effectiveness due to class imbalance, and the PR curve provided a nuanced and meaningful comparison, emphasizing the superiority of the hybrid model. The hybrid approach balanced the strengths of the random forest’s features’ interpretability with Bi-LSTM’s sequential learning capability, resulting in inconsistent precision–recall performance across the threshold. These findings validate the growing consensus that the PR curve is more informative and actionable for high-risk, low-frequency classification tasks and support the deployment of hybrid architectures in real-time fraud detection systems [24,25].

5.8. Comparative Evidence from Prior Investigation

An earlier study employed autoencoders and Gaussian mixture models to detect fraud dataset anomalies. They found anomalies in fraud datasets but low fraud-class sensitivity without tailored thresholds or hybrid techniques. LTSMs demonstrated superior ability in modelling sequential behaviours in time-series anomaly tasks [26,27]. Hybrid models that use both deep learning and RF have shown potential in ensemble diversity for minority class detection.

Our findings in this work: In particular, CVAE alone was insufficient, Bi-LSTM improved class detection via temporal modelling, and hybrid RF + Bi-LSTM achieved the best balance in fraud recall and FP management.

Comparative Analysis of Models:

Table 11 presents a comparative evaluation of traditional machine learning, deep learning, and the proposed hybrid models in terms of accuracy, recall, precision, F1-score, and interpretability.

While the proposed hybrid RF + Bi-LSTM model exhibits a slightly lower overall accuracy (80.37%) compared to traditional models like XGBoost (86.1%) and random forest (85.4%), this trade-off is methodologically justifiable in the context of insurance fraud detection, where datasets are highly imbalanced. Relying on accuracy alone can be misleading in such settings, as models may classify the majority of the non-fraud class correctly while missing actual fraud.

More importantly, we acknowledge that the recall values across all models, including our own, remain relatively low. These recall levels, while slightly better than existing baselines, are not optimal for real-world deployment. However, the hybrid model still represents a significant improvement in terms of F1-score (8.99%) and interpretability, achieving a better trade-off between sensitivity and transparency compared to all the other baseline models evaluated.

5.9. Interpretation of the Hybrid Approach

The hybrid RF + Bi-LSTM model is novel in its architecture. It merges the robustness of RF in handling diverse features with Bi-LSTM’s temporal modelling capabilities. While the gain in metrics was incremental, it consistently improved the AUC (0.54), F1-score (0.0819), and PR stability. This indicates a more reliable prediction landscape. Fraud patterns may evolve and have long-term dependencies, like how fraudulent patterns change speed over time; this hybrid approach is theoretically grounded in the concept of hierarchical modelling, where static and dynamic patterns are captured simultaneously [28,29].

Overall Model Comparison

The comparative evaluation of the three models (Table 12)—CVAE, Bi-LSTM, and hybrid RF + Bi-LSTM—highlights the distinct strengths in varying priorities in insurance fraud detection. The CVAE model achieved the highest overall accuracy (83.75%) but showed a low recall (3.28%) and F1-score (0.058), indicating its limited effectiveness in identifying fraudulent claims. However, its unsupervised structure and conservative predictions make it highly effective in minimizing false positives, highlighting its potential as an anomaly pre-filter in multi-stage fraud detection pipelines [30,31].

In contrast, the Bi-LSTM model demonstrated the highest recall (5.98%) and F1-score (0.084), outperforming other models in terms of fraud sensitivity by leveraging temporal patterns within the data. Despite minor fluctuations in precision, its performance makes it ideal for detecting fraudulent cases [32,33]. The hybrid RF + Bi-LSTM model maintained the same recall level but provided more stable ROC and precision–recall curves due to the synergistic integration of RF’s feature in handling the Bi-LSTM’s sequence modelling [34,35]. Although slightly lower in F1-score (0.0819), the hybrid model’s curve stability and interpretability offer distinct advantages in environments where model robustness and consistent threshold performance are critical. Therefore, the hybrid model provides a compromise between stability and performance, which makes it appropriate for real-time and explainable AI applications, even if Bi-LSTM is excellent at detecting fraud. CVAE, on the other hand, works better for use in cases where low risk and great precision are priorities.

6. Conclusions

A robust dataset of 4000 life insurance applications with 83 features was successfully curated and pre-processed. Class imbalance, missing values, and feature diversity were effectively addressed, enabling the deployment of fraud detection models under realistic industry conditions.
Among the three models, the Bi-LSTM model achieved the highest fraud detection capability with the best F1-score (0.084 and 5.98%), validating effectively the model’s temporal dependencies in fraudulent behaviour.
The hybrid RF + Bi-LSTM model offered strong performance across ROC and precision–recall curves, confirming its value in an environment where interpretability and precision consistency are operationally critical, despite its slightly lower F1-score (0.0819).
Although CVAE demonstrated the highest accuracy (83.75%), its poor recall and F1-score make it less suited for fraud detection. However, it remains useful as a baseline anomaly detector or as a prefiltering layer in a multi-stage pipeline.
BI-LSTM-based metrics and the use of multiple models helped to find the best balance between detecting fraud accurately and performing well in real-time use. This permitted the selection of an efficient model for the detection of insurance fraud in real time.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app15168869/s1, Supplementary S1: Detailed Feature Description and Domain Relevance for the Selected Features.

Author Contributions

Conceptualization, M.J.D.E. and B.C.K.; methodology, M.J.D.E.; software, B.C.K.; validation, M.J.D.E. and B.C.K.; formal analysis, M.J.D.E.; investigation, M.J.D.E.; resources, M.J.D.E.; data curation, B.C.K.; writing—original draft preparation, M.J.D.E.; writing—review and editing, M.J.D.E. and B.C.K.; visualization, B.C.K.; supervision, M.J.D.E.; project administration, M.J.D.E.; funding acquisition, B.C.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sharma, G.; Chandra, S.; Yadav, A.K.; Gupta, R. Enhancing solar radiation forecasting accuracy with a hybrid SA-Bi-LSTM-Bi-GRU model. Earth Sci. Inform. 2025, 18, 269. [Google Scholar] [CrossRef]
Shanmughan, G.D.; Silpa, S.K.; Jayamohan, S. Empowering fraud detection in medical insurance: Comparative study of deep learning models. In AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2025; Volume 3237. [Google Scholar]
Olowe, O.T.; Adebiyi, A.A.; Marion, A.O.; Tobi, O.M.; Olaniyan, D.; Olaniyan, J.; Emmanuel, A.; Akindeji, K. Enhancing Cybersecurity Through Advanced Fraud and Anomaly Detection Techniques: A Systematic Review. In Proceedings of the 2024 International Conference on Science, Engineering and Business for Driving Sustainable Development Goals (SEB4SDG), Omu-Aran, Nigeria, 2–4 April 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Padmavathi, B.; Bhagyalakshmi, A.; Kavitha, D.; Indumathy, P. An optimized Bi-LSTM with random synthetic over-sampling strategy for network intrusion detection. Soft Comput. 2024, 28, 777–790. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G.; Obaido, G. Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information 2024, 15, 517. [Google Scholar] [CrossRef]
Alzakari, S.A.; Menaem, A.A.; Omer, N.; Abozeid, A.; Hussein, L.F.; Abass, I.A.M.; Rami, A.; Elhadad, A. Enhanced heart disease prediction in remote healthcare monitoring using IoT-enabled cloud-based XGBoost and Bi-LSTM. Alex. Eng. J. 2024, 105, 280–291. [Google Scholar] [CrossRef]
Abbasi, A.A.; Zameer, A.; Raja, M.A.Z. An enhanced strategy for minority class detection using bidirectional GRU employing penalized cross-entropy and self-attention mechanisms for imbalance network traffic. Eur. Phys. J. Plus 2024, 139, 530. [Google Scholar] [CrossRef]
Deshmukh, K.H.; Bamnote, G.R. Natural disaster prediction optimized light gradient boosting enabled hybrid convolutional neural network and bidirectional long short-term memory. Intell. Decis. Technol. 2024, 19, 1054–1073. [Google Scholar] [CrossRef]
Rahayu, Y.D.; Fatichah, C.; Yuniarti, A.; Rahayu, Y.P. Advancements and Challenges in Video-Based Deception Detection: A Systematic Literature Review of Datasets, Modalities, and Methods. IEEE Access 2025, 13, 28098–28122. [Google Scholar] [CrossRef]
Mary, D.S.; Dhas, L.J.S.; Deepa, A.; Chaurasia, M.A.; Sheela, C.J.J. Network intrusion detection: An optimized deep learning approach using big data analytics. Expert Syst. Appl. 2024, 251, 123919. [Google Scholar] [CrossRef]
Bahaa, M.; Hany, M.; Zakaria, E.E. Advancing Automated Deception Detection: A Multimodal Approach to Feature Extraction and Analysis. In International Conference on Intelligent Systems, Blockchain, and Communication Technologies; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Kim, B.; Kang, J.W.; Kim, C.-S.; Kwon, O.K.; Gwak, J. Hybrid Transformer for Anomaly Detection on Railway HVAC Systems Through Feature Ensemble of Spatial–Temporal with Multi-channel GADF Images. J. Electr. Eng. Technol. 2024, 19, 2803–2815. [Google Scholar] [CrossRef]
Bukhari, S.M.S.; Zafar, M.H.; Houran, M.A.; Moosavi, S.K.R.; Mansoor, M.; Muaaz, M.; Sanfilippo, F. Secure and privacy-preserving intrusion detection in wireless sensor networks: Federated learning with SCNN-Bi-LSTM for enhanced reliability. Ad Hoc Netw. 2024, 155, 103407. [Google Scholar] [CrossRef]
Kamat, P.; Kumar, S.; Sugandhi, R. Vibration-based anomaly pattern mining for remaining useful life (RUL) prediction in bearings. J. Braz. Soc. Mech. Sci. Eng. 2024, 46, 290. [Google Scholar] [CrossRef]
Priyadarshini, I. Anomaly detection of IoT cyberattacks in smart cities using federated learning and split learning. Big Data Cogn. Comput. 2024, 8, 21. [Google Scholar] [CrossRef]
Rafique, S.H.; Abdallah, A.; Musa, N.S.; Murugan, T. Machine learning and deep learning techniques for internet of things network anomaly detection—Current research trends. Sensors 2024, 24, 1968. [Google Scholar] [CrossRef]
Nemati, Z.; Mohammadi, A.; Bayat, A.; Mirzaei, A. Metaheuristic and data mining algorithms-based feature selection approach for anomaly detection. IETE J. Res. 2024, 70, 6040–6054. [Google Scholar] [CrossRef]
Addisu, E.G.; Yirga, T.G.; Yirga, H.G.; Yehuala, A.D. Transfer learning-based hybrid VGG16-machine learning approach for heart disease detection with explainable artificial intelligence. Front. Artif. Intell. 2025, 8, 1504281. [Google Scholar] [CrossRef]
Shukla, S.S.P.; Singh, M.P. Stacked classification approach using optimized hybrid deep learning model for early prediction of behaviour changes on social media. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 159. [Google Scholar] [CrossRef]
Badere, S.; Daf, T.; Chetule, S.; Barange, S.; Auchat, A.; Jadhav, N.; Chavhan, S. An Intelligent System for Identifying Fake Job Ads Using CNN-BiGRU and CNN-BiLSTM. In Proceedings of the 2024 4th International Conference on Advancement in Electronics & Communication Engineering (AECE), Ghaziabad, India, 22–23 November 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
Bala, B.; Behal, S. AI techniques for IoT-based DDoS attack detection: Taxonomies, comprehensive review and research challenges. Comput. Sci. Rev. 2024, 52, 100631. [Google Scholar] [CrossRef]
Zulfiqar, Z.; Malik, S.U.; Moqurrab, S.A.; Zulfiqar, Z.; Yaseen, U.; Srivastava, G. DeepDetect: An innovative hybrid deep learning framework for anomaly detection in IoT networks. J. Comput. Sci. 2024, 83, 102426. [Google Scholar] [CrossRef]
Dang, Y.; Wang, Y. Design and Research of Network Edge Device Security Monitoring System Based on Embedded System and Bi-LSTM. J. Cyber Secur. Mobil. 2025, 14, 181–204. [Google Scholar] [CrossRef]
Ayub, H.; Khan, M.-A.; Naqvi, S.S.A.; Faseeh, M.; Kim, J.; Mehmood, A.; Kim, Y.-J. Unraveling the potential of attentive Bi-LSTM for accurate obesity prognosis: Advancing public health towards sustainable cities. Bioengineering 2024, 11, 533. [Google Scholar] [CrossRef]
Zhao, S.; Xu, Z.; Zhu, Z.; Liang, X.; Zhang, Z.; Jiang, R. Short and Long-Term Renewable Electricity Demand Forecasting Based on CNN-Bi-GRU Model. IECE Trans. Emerg. Top. Artif. Intell. 2025, 2, 1–15. [Google Scholar] [CrossRef]
SrinivasVellela, S.; Praveen, S.P.; Roja, D.; Krishna, A.R.; Purimetla, N.; Rao, T.; Kumar, K.K. Fusion-Infused Hypnocare: Unveiling Real-Time Instantaneous Heart Rates for Remote Diagnosis of Sleep Apnea. In Proceedings of the 2024 International Conference on Knowledge Engineering and Communication Systems (ICKECS), Chikkaballapur, India, 18–19 April 2024; IEEE: Piscataway, NJ, USA, 2024; Volume 1. [Google Scholar]
Algarni, A.; Acarer, T.; Ahmad, Z. An edge computing-based preventive framework with machine learning-integration for anomaly detection and risk management in maritime wireless communications. IEEE Access 2024, 12, 53646–53663. [Google Scholar] [CrossRef]
Pillai, S.; Sharma, A. Hybrid unsupervised web-attack detection and classification—A deep learning ap-proach. Comput. Stand. Interfaces 2023, 86, 103738. [Google Scholar] [CrossRef]
Alqahtany, S.S.; Shaikh, A.; Alqazzaz, A. Enhanced Grey Wolf Optimization (EGWO) and random forest based mechanism for intrusion detection in IoT networks. Sci. Rep. 2025, 15, 1916. [Google Scholar] [CrossRef]
Li, D.; Qi, Z.; Zhou, Y.; Elchalakani, M. Machine Learning Applications in Building Energy Systems: Review and Prospects. Buildings 2025, 15, 648. [Google Scholar] [CrossRef]
Bellar, O.; Baina, A.; Ballafkih, M. Sentiment analysis: Predicting product reviews for e-commerce recommendations using deep learning and transformers. Mathematics 2024, 12, 2403. [Google Scholar] [CrossRef]
Tajrian, M.; Rahman, A.; Kabir, M.A.; Islam, R. A review of methodologies for fake news analysis. IEEE Access 2023, 11, 73879–73893. [Google Scholar] [CrossRef]
Govindarajan, P.; Venkatanathan, N. Towards real-time earthquake forecasting in Chile: Integrating intelligent technologies and machine learning. Comput. Electr. Eng. 2024, 117, 109285. [Google Scholar] [CrossRef]
Shakir, V.; Mohsin, A. A Comparative Analysis of Intrusion Detection Systems: Leveraging Classification Algorithms and Feature Selection Techniques. J. Appl. Sci. Technol. Trends 2024, 5, 34–45. [Google Scholar] [CrossRef]
Dev, D.G.; Bhatnagar, V.; Bhati, B.S.; Gupta, M.; Nanthaamornphong, A. LSTMCNN: A hybrid machine learning model to unmask fake news. Heliyon 2024, 10, e25244. [Google Scholar] [CrossRef]

Figure 1. Framework for chaotic variational autoencoder.

Figure 2. Bi-LSTM network architecture.

Figure 3. Architecture of hybrid RF + Bi-LSTM model.

Figure 4. Confusion matrix for CVAE.

Figure 5. Confusion matrix analysis for Bi-LSTM model.

Figure 6. Confusion matrix for hybrid random forest + Bi-LSTM.

Figure 7. Model accuracy comparison.

Figure 8. Precision, recall, and F1-score for fraud class.

Figure 9. Merged ROC curve.

Figure 10. Merged precision–recall curve.

Table 1. Literature review summary.

References	Model/Study	Dataset Type	Achievement	Limitations
[1]	Random forest, XGBoost, Logistic regression, SVM	Structured tabular insurance dataset	Fast training interpretable output, good baseline accuracy	No temporal modelling, misses time-evolving
[2]	LSTM, Bi-LSTM	Sequential insurance claim records	Captures forward/backwards temporal dependencies	Lack of interpretability, black-box behaviour
[3]	Autoencoder VAE	Unabled anomaly detection datasets	Good for outlier detections without tables	Latent space overlap, poor separability in imbalanced data
	CNN-LSTM, RF-CNN, LSTM-XGBoost	Hybrid fraud datasets (Various domains)	Fuses sequential + structural learning improves robustness	Optimizes for accuracy, not recall, and lacks explainability
This study	Synthetic life insurance dataset (4000 records, 83 features)	Improved recall, balanced PR curve, interpretable hybrid	Address recall, interpretability, and latent anomalies simultaneously	Unified solution with chaos + sequenced + explainability

Table 2. Numerical feature category.

Sl. No	Feature Category	Feature Name
1	Policy Information	Premium amount, Policy duration (in years), Policy coverage amount
2	Claim Details	Claim amount, Number of claims filed, Claim frequency (per year)
3	Financial Attributes	Annual income, Debt-to-income ratio (%), Credit score, Total outstanding loans
4	Transaction Behaviour	Number of IP address changes, Time between transactions (days), Device count used for claim submission

Table 3. Categorical features.

Sl. No.	Feature Category	Feature Name	Type
1	Policy Information	Policy Type	Nominal
1	Policy Information	Premium Payment Frequency	Ordinal
2	Claim Details	Claim Type	Nominal
2	Claim Details	Claim Status	Nominal
3	Demographics	Age Group	Ordinal
		Gender	Normal
		Marital Status	Normal
		Education Level	Ordinal
4	Behavioural Pattern	Geographical Location	Nominal
		High-risk Flag	Nominal
		Suspicious Activity Indicator	Nominal

Table 4. Data type structure.

Step	Before Processing	After Processing
Total Records	4000	4000
Duplicate Records	124 removed	0
Missing Values	7.5% in numerical, 5.2 in categorical	Fully imputed
Categorical Encoding	Raw text	OHE and label encoding
Class Distribution	15% fraudulent, 85% genuine	SMOTE 50% fraudulent, 50% genuine
Final Feature Count	120	83 (optimized)

Table 5. Proposed CVAE architecture.

Module	Layer	Size	Activation
Input	Dense	83 (input features)	-
Encoder 1	Dense	64	ReLu
Encoder 2	Dense	32	ReLu
Latent space	$L$ -mean $L$ -log-var	16 each	Linear
Sampling	Lambda Layer	16	$Choose L = μ + σ \times s n (\in)$
Decoder 1	Dense	32	ReLu
Decoder 2	Dense	64	ReLu
Output	Dense	83	Sigmoid

Table 6. Proposed architecture for Bi-LSTM model.

Layer	Details
Input	Shape: (T, F) = (1, 83)
Bi-LSTM	64 units in each direction
Dropout	Rate = 0.3
Dense hidden layer	32 units with ReLU activation
Output layer	1 unit with sigmoid activation

Table 7. Proposed architecture for hybrid RF + Bi-LSTM model.

Layer	Details
Input	Shape: (1, 83)—83 input features
Random forest feature selection	Top-K important features selected on RF importance score
Bi-LSTM layer	64 units in each direction (featured and backward directions)
Dropout	Rate = 0.3
Dense hidden layer	32 units with ReLU activation
Fusion layer	Concentration on the majority of the voting layer (MLP optimal)
Output layer	1 unit with sigmoid activation for classification

Table 8. Model hyperparameter summary.

Model	Layer/ Units	Activation	Optimizer	Learning Rate	Epoch	Batch Size	Dropout	Notes
CVAE	3 Encoder: 128-64-32 Decoder: 32-64-128/	ReLU, Sigmoid	Adam	0.001	50	64	0.2	Latent DIM = 16
Bi-LSTM	2 Bi-LSTM Layer (64 units)	Sigmoid (output)	Adam	0.0005	50	64	0.3	Time-distributed input
Hybrid RF+BiLSTM	RF with Bi-LSTM score as feature	-	-					Combining temporal + explainability

Table 9. Statistical summary of model performance.

	CAVAE	Bi-LSTM	Hybrid RF + Bi-LSTM
Accuracy	93.7%	80.8%	80.4%
Recall	3.3%	5.9%	5.9%
Precision	2.5%	14.3%	12.9%
F1-score	5.9%	8.5%	8.0%
False-positive rate	1.8%	6.2%	6.8%

Table 10. Performance matrix.

Model	Accuracy	Precision	Recall	F1-Score	FP Rate	FN Rate
CVAE	83.75%	25%	3.28%	5.79%	1.77%	96.72%
Bi-LSTM	81.00%	14.29%	5.98%	8.43%	6.15%	94.02%
Hybrid RF-LSTM	80.37%	12.96%	5.98%	8.99%	6.88%	94.02%

Table 11. Comparative analysis of existing model.

Ref	Model	Accuracy	Recall	Precision	F1-Score	Interpretability	Remarks
1	Random	85.4	4.2	11.0	6.3	High	Interpretable but low recall
1	XGBoost	86.1	5.0	13.4	7.3	Medium	Higher precision still weak recall
2	LSTM	78.9	4.6	12.2	6.8	Low	Misses the context pattern
2	Bi-LSTM	81.0	5.98	14.29	8.43	Low	Good sequential pattern learning
3	Auto encoder	82.0	2.9	10.1	4.4	Low	Too generic for fraud
3	VAE	83.1	3.28	25.0	5.79	Low	Fails to detect subtotal fraud
6	CNN-LSTM	80.5	5.2	13.2	7.1	Low	No transparency in decision-making
	CVAE (proposed)	83.75	3.28	25.0	5.79	Low	Good anomaly detection, poor recall
	Bi-LSTM (proposed)	83.75	3.28	25.0	5.79	Low	Balanced recall, limited precision
	Hybrid RF + Bi-LSTM (proposed)	80.37	5.98	12.96	8.99	High	Best F1-score balance and transparency

Table 12. Overall model comparison.

Model	Accuracy	Recall (Fraud)	F1-Score (Fraud)	AUC (ROC)	Avg. Precision (PR)	Weakness	Strength
CVAE	0.8375	0.0327	0.0579	0.52	$~$ 0.17	Poor recall & F1-score weak at detecting fraud	High overall accuracy, unsupervised is good at reducing false positives
Bi-LSTM	0.81	0.0598	0.0843	0.50	$~$ 0.14	Sensitive to overfitting, less stable at thresholds	Best recall & F1-score model temporal dependencies
Hybrid RF +Bi-LSTM	0.8037	0.0598	0.0819	0.54	$~$ 0.15–0.17	Slightly lower F1-score than Bi-LSTM, increased complexity	More stable PR and ROC curve, better feature interaction via RF

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ebinezer, M.J.D.; Krishna, B.C. Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM. Appl. Sci. 2025, 15, 8869. https://doi.org/10.3390/app15168869

AMA Style

Ebinezer MJD, Krishna BC. Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM. Applied Sciences. 2025; 15(16):8869. https://doi.org/10.3390/app15168869

Chicago/Turabian Style

Ebinezer, Markapurapu John Dana, and Bondalapu Chaitanya Krishna. 2025. "Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM" Applied Sciences 15, no. 16: 8869. https://doi.org/10.3390/app15168869

APA Style

Ebinezer, M. J. D., & Krishna, B. C. (2025). Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM. Applied Sciences, 15(16), 8869. https://doi.org/10.3390/app15168869

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Life Insurance Fraud Detection: A Data-Driven Approach Utilizing Ensemble Learning, CVAE, and Bi-LSTM

Abstract

1. Introduction

Objectives of the Study

2. Literature Survey

Research Gap

3. Methodology

3.1. Data Collection Process

3.2. Dataset Structure and Key Features

3.2.1. Numerical Features: (Continuous and Discrete Variables)

3.2.2. Categorical Features (Nominal and Ordinal Variables)

3.2.3. Key Insights on Dataset Structure

3.3. Dataset Preprocessing

3.3.1. Handling Missing Values

3.3.2. Handling Duplication and Data Consistency

3.4. Feature Selection Process

3.5. Feature Scaling and Normalization

3.5.1. F1-Score Standardization for Normally Distributed Data

3.5.2. Min–Max Scaling for Bounded Data

3.6. Handling Class Imbalance

3.7. Feature Selection

3.8. Final Proposed Dataset

3.9. Proposed Methodology/Proposed Framework

3.9.1. Algorithms for CVAE

3.9.2. Bi-LSTM Model

3.9.3. Hybrid Random Forest + Bi LSTM Model:

4. Results

5. Discussion

5.1. Influence of Dataset and Preprocessing on Model Performance

5.2. Confusion Matrices—Reflection of Model Sensitivity

5.3. Relationship Between Classification Report and Model Behaviour

5.3.1. Precision and Its Implications

5.3.2. Recall and Minority Class Sensitivity

5.4. F1-Score Balance Between Precision and Recall

5.5. Impact of Recall—False-Positive Trade-Off

5.6. Support and Weighted Averages

5.7. ROC Curve and Precision–Recall Curve Behaviour

5.8. Comparative Evidence from Prior Investigation

Comparative Analysis of Models:

5.9. Interpretation of the Hybrid Approach

Overall Model Comparison

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI