A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds

He, Qingyang; Ding, Qi; Zheng, Conghui; Pan, Li; Liu, Ning; Li, Wensheng

doi:10.3390/electronics14163268

Open AccessArticle

A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds

by

Qingyang He

¹

,

Qi Ding

²,

Conghui Zheng

³,

Li Pan

^3,4

,

Ning Liu

¹ and

Wensheng Li

^1,*

¹

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China

²

The First Procuratorial Department, Shanghai People’s Procuratorate, Shanghai 200052, China

³

Institute of Cyber Science and Technology, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

⁴

Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai 200240, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3268; https://doi.org/10.3390/electronics14163268

Submission received: 29 June 2025 / Revised: 7 August 2025 / Accepted: 12 August 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Network Security and Cryptography Applications)

Download

Browse Figures

Versions Notes

Abstract

Medical insurance fraud, especially organized drug resale schemes, has become increasingly sophisticated, challenging traditional supervision methods. This paper presents an AI-powered legal supervision model that automatically detects fraudulent drug resale activities in medical insurance claims. Unlike rule-based approaches, our solution employs multi-dimensional behavioral analysis and adaptive clustering techniques to identify both individual anomalies and organized fraud networks. The proposed model follows a three-stage detection pipeline: (1) automated clue generation through feature aggregation across frequency, cost, and behavioral dimensions; (2) group behavior analysis using spatiotemporal patterns and medication similarity metrics; (3) risk stratification via FLASC clustering to dynamically determine suspicion thresholds. Key innovations include a data-driven threshold generation mechanism that eliminates expert bias and a cross-dimensional fraud pattern recognition system that connects individual outliers with group behaviors. Validated on real-world medical insurance data (8917 insurance cards, 1.1 million records), the model achieved 89% precision, 42% recall, and 87% accuracy in detecting high-risk fraud cases while uncovering previously unnoticed organized fraud rings. This research provides a scalable framework for intelligent healthcare fund supervision, with potential applications in other social security domains.

Keywords:

medical insurance fraud; intelligent supervision system; anomaly detection

1. Introduction

In the digital era, the continuous maturation of cutting-edge technologies such as big data, blockchain, and artificial intelligence (AI) has created significant opportunities and strong momentum for the deep integration of digital procuratorial work and AI. In particular, in the field of medical insurance fund supervision, traditional manual review and oversight mechanisms are increasingly inadequate [1,2], given the vast number of regulatory targets, the scale of funds involved, and the massive volume of settlement data. Therefore, it is important to employ intelligent technologies to construct an efficient medical insurance supervision model. Medical insurance drug resale refers to the act of exploiting medical insurance policies to obtain medications in excessive quantities or through fraudulent means, which are subsequently resold for profit. This form of fraud imposes substantial economic losses and social harm. It not only results in the depletion of medical insurance funds but also severely undermines fair market competition and infringes upon the rights and interests of insured individuals. As one of the most prevalent forms of medical insurance fraud, it presents significant challenges to the security of healthcare funds and the effectiveness of regulatory oversight [3].

Traditional medical insurance fraud detection methods primarily include rule-based approaches and statistics-based distribution methods [4]. The rule-based approach formalizes expert knowledge into predefined rules and relies on automated computerized decision-making processes. Thornton et al. [5] developed a multi-dimensional data model and analysis technique based on the extension of expert rules to predict potential medical insurance fraud. However, expert rules require rule design according to different scenarios, leading to certain limitations and dependence on expert judgment. In contrast, statistics-based distribution methods operate under the assumption that data conforms to a specific, statistically meaningful distribution. Deviations from this distribution are flagged as potential anomalies, which may indicate fraud risks. Peng et al. [6] analyze that the Apriori algorithm generates a large number of candidate item sets, which leads to frequent construction and release of trees when constructing the FP-Growth algorithm, resulting in significant time consumption. Based on the original FP-Growth algorithm, they first construct maximal cliques and then design frequent patterns within each maximal clique to discover specific diagnostic and treatment patterns.

In recent years, with the development of deep learning technologies, numerous deep learning solutions have been applied to medical insurance fraud detection [7,8,9]. Wang et al. [10] improve group and partition methods through label propagation algorithms in graph neural networks, enabling their application to multiple types of fraud detection. Ma et al. [11] propose that fraudulent accounts form within community networks, where nodes in generated subgraphs are considered abnormal. Based on this, they construct GraphRAD using the idea of label propagation to detect risky accounts in online shopping through community partitioning, achieving significant efficiency improvements compared to baseline methods. Tan et al. [12] address graph data with extremely low proportions of anomalous labels by modifying the Graph Convolutional Network (GCN) [13] structure, fusing node labels, node features, and edge information. They introduce the Label Propagation Algorithm (LPA) as a regularization term into GCN to enable learnable edge weights and design trainable weights for different features to enhance model expressiveness. However, existing supervision models still exhibit numerous limitations. For instance, most models can only simplistically simulate manual judgment processes, with their effectiveness heavily reliant on threshold settings for key fields. This makes them vulnerable to extreme data values, resulting in unstable judgment outcomes. Additionally, some supervisory logic depends on experts to set fixed thresholds. However, due to variations in medical insurance policies and socioeconomic environments across different regions, a unified threshold-based decision-making logic often proves inapplicable, thereby limiting the cross-regional applicability of these models. This further hinders the deep integration of intelligent legal supervision.

Considering the above limitations, we propose a legal supervision model for scenarios of drug resale. Our solution conducts comprehensive multi-dimensional data analysis and adaptively detects medical insurance fraud involving drug resale by following a three-stage logic of clue generation, group analysis, and risk stratification. The main contributions of our proposed method are as follows:

Construction of a Legal Supervision Model Based on Real Medical Insurance Data: We construct a novel legal supervision model based on real-world medical insurance data, following the stages of clue generation, group detection, and anomaly judgment. Through systematic data processing and analysis, the model significantly enhances the efficiency and accuracy of medical insurance fraud detection.
Multi-dimensional Group Aggregation Analysis: To address the group aggregation patterns typical of drug resale fraud, the model conducts a comprehensive analysis across temporal, spatial, and drug similarity dimensions. By combining group-level aggregation features with individual-level outliers, it achieves more accurate identification of potential fraud risks, thereby ensuring the security of medical insurance funds.
Adaptive Risk Stratification Based on Clustering Methods: In contrast to traditional approaches that rely on expert-defined rules to determine anomalies, the proposed model utilizes clustering algorithms to automatically generate decision thresholds. This data-driven strategy eliminates subjective biases introduced by expert-defined rules and improves the model’s adaptability to diverse data distributions.

The rest of this paper is organized as follows. In Section 2, we provide an overview of the relevant literature and previous studies related to our proposed method. In Section 3, we elaborate on our medical insurance fraud model for drug resale, which is based on clue generation, group detection, and adaptive risk assessment. In Section 4, we demonstrate comprehensive experimental results to prove the effectiveness and superiority of our approach. In Section 5, we discuss the limitations of our approach. Finally, in Section 6, we summarize the main contributions and a brief recap of our method.

2. Related Work

The current mainstream solutions for medical insurance fraud detection can be categorized into traditional rule-based and statistical analysis approaches or solutions based on supervised, unsupervised, and semi-supervised learning. Rule-based methods extract expert knowledge to formulate computational rules combined with statistical analysis to mine data distributions under specific rule sets. These approaches then identify anomalous states based on predefined thresholds. Purely rule-based and statistical analysis approaches suffer from poor transferability and strong subjectivity. These methods cannot adapt to variations caused by different regional healthcare policies, and their detection efficiency and accuracy heavily rely on expert experience.

Machine learning-based approaches can be categorized into supervised, unsupervised, and semi-supervised methods. Supervised anomaly detection techniques train models on labeled datasets and subsequently employ the trained models to classify unlabeled data. Typical methods include Bayesian networks, decision trees, and support vector machines (SVMs) [14]. For instance, Kumaraswamy et al. [15] propose a Bayesian Belief Network (BBN) model for healthcare fraud detection, which demonstrated superior scalability and interpretability compared to baseline models when detecting anomalies in Texas Medicaid prescription claims. Nalluri et al. [16] integrate four machine learning methods—Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), and Multilayer Perceptron (MLP)—to perform feature extraction and learning on 19 key features. Unsupervised machine learning methods typically employ clustering analysis to examine the inter-sample proximity within datasets, partitioning the data collection into distinct groups. This approach amplifies the disparity between normal and anomalous data points, thereby facilitating anomaly detection. Capelleveen et al. [17] proposed an unsupervised outlier detection technique to analyze post-payment medical claims and developed a decision support tool, enabling domain experts to make prompt determinations regarding potential fraudulent cases. De et al. [18] employed categorical embedding to process high-cardinality categorical variables, utilizing Isolation-based Nearest-Neighbor Ensembles (INNE) [19] for anomaly detection, and applied Shapley Additive Explanation (SHAP) to interpret model outputs. Semi-supervised learning methods operate without external interaction by automatically leveraging unlabeled samples to enhance learning from an entirely normal training set. These approaches enable joint learning between limited labeled samples and abundant unlabeled data. For instance, Tan et al. [12] constructed patient-label and medical-behavior graph relationships, detecting healthcare fraud through graph-based anomaly detection methods. Clustering methods are frequently integrated with anomaly detection processes. Prova et al. [20] employ an ensemble learning approach by integrating multiple machine learning models, including Random Forest, XGBoost, and SVM, as base models. These models are subsequently combined through a meta-model to detect fraudulent activities in healthcare systems. Mohammed et al. [21] propose a novel system architecture employing machine learning to detect and subsequently prevent fraudulent transactions in blockchain networks. The study effectively analyzes medical data collected from sensors and blockchain transactions through an ensemble of machine learning algorithms, thereby blocking anomalous data and flagging suspicious transactions. Xiao et al. [22] integrate Bayesian Networks (BNs) into Extreme Gradient Boosting (XGBoost). The obtained medical fraud predictions are subsequently applied to risk management decision-making to minimize costs for medical insurance institutions. In our approach, the adaptive risk assessment phase employs the FLASC clustering algorithm to dynamically generate thresholds. Clustering algorithms can be categorized by methodology into partition-based clustering, density-based clustering, and hierarchical clustering. Common partition-based methods such as k-means [23,24] and its variants require predefined cluster numbers and perform poorly when the cluster count is undefined. This limitation motivated density-based approaches like DBSCAN [25] and OPTICS [26], which can identify arbitrarily shaped clusters but exhibit sensitivity to density parameter selection [27]. Hierarchical clustering reveals cluster hierarchies, as demonstrated by HDBSCAN [28,29]. The FLASC [30] clustering method builds upon HDBSCAN, performing initial clustering through data point density analysis and subsequently identifying branch structures via intra-cluster connectivity analysis.

3. Methods

In this section, we introduce a comprehensive data-driven approach designed to detect high-risk organized fraud in medical insurance claims with a focus on drug resale schemes. In Section 3.1, we detail the multi-dimensional clue generation process that aggregates features across frequency, cost, and behavioral dimensions to identify potential fraudulent activities. This step is crucial for extracting clues from large datasets by leveraging both expert knowledge and statistical analysis. In Section 3.2, we describe the spatio-temporal group anomaly analysis technique, which identifies organized fraud networks by analyzing patterns of behavior across time, location, and medication similarity. This method enhances our ability to uncover coordinated fraudulent activities that might not be apparent through individual-level analysis alone. Finally, in Section 3.3, we present a dynamic adaptive threshold risk stratification assessment model that utilizes an ensemble of Entropy Weight Method (EWM), Technique for Order Preference by Similarity to Ideal Solution (TOPSIS), and FLASC [30]. This model dynamically adjusts thresholds based on multi-dimensional aggregated indicators, ensuring robust and adaptable risk classification that minimizes false positives and maximizes detection accuracy. As shown in Figure 1, the overall architecture of our proposed method consists of: (1) an iterative clue generation module for updating individual and group anomalies, (2) anomaly score aggregation using EWM+TOPSIS-based weighting, and (3) adaptive threshold determination via the FLASC clustering approach.

3.1. Multi-Dimensional Clue Generation

The key to detecting individual abnormal insurance cards lies in mining clue features of drug resale behaviors and constructing corresponding mathematical expressions. On this basis, appropriate supervision thresholds are designed for each judgment logic by integrating expert experience. Through this data-driven approach, clues related to drug resale by abnormal cards are extracted from medical insurance card visit records. This method summarizes low-probability events commonly found in similar cases from medical insurance record data and case documents. Subsequently, by integrating the logical features corresponding to each event, they are transformed into more specific manifestations. For example, “when the drug purchase interval is shorter than the drug administration cycle, there is a suspicion of hoarding and reselling drugs in large quantities,” and its corresponding specific manifestation is “frequent purchases of the same type of drug in a short period,” thus requiring supervision of drug purchase frequency. In the model corresponding to the above process, this is achieved by calculating the drug purchase frequency and determining whether it exceeds the threshold.

In the analysis of multi-dimensional supervision rules, to achieve abnormal card detection for medical insurance drug resale supervision, this study proposes setting supervision rules for three dimensions: “visiting frequency, visiting cost, and visiting behavior”. Each medical insurance card is traversed based on these rules to count the cases where its corresponding records hit each rule, forming a clue form. To improve the computational efficiency and stability of subsequent steps, only card records that hit the rules are counted, and all rows with column values of 0 are removed after generating the form. The specific supervision rules involve three primary dimensions: consultation frequency, consultation cost, and consultation behavior. Each dimension is further refined with detailed rules, as presented in Table 1. In the consultation frequency dimension, rules such as the upper limit of daily consultations and monthly consultation count are established to identify abnormally high-frequency consultation behaviors. In the consultation cost dimension, rules including single-consultation cost thresholds and monthly cumulative cost thresholds are implemented to detect abnormally high medical expenses. Finally, in the consultation behavior dimension, rules governing cross-regional consultations and frequent changes of medical institutions are formulated to capture abnormal medical trajectories.

3.2. Spatio-Temporal Group Anomaly Analysis

Although the abnormal medical insurance card detection process can identify suspicious clues of individual cards in drug resale, with the continuous escalation of illegal activities, some criminals have begun to utilize groups of individuals eligible for medical insurance drug purchases to illegally obtain drugs through coordinated multi-person drug purchase schemes. In such cases, each medical insurance card within the group may appear normal in isolation, requiring cross-correlation and comparison of these cards to reveal abnormal clues of organized activities. The definition of similar behaviors encompasses multiple dimensions, such as drug purchase time, location, institution, and amount. By analyzing the behavioral patterns of medical insurance cards across multiple dimensions, hidden group drug purchase behaviors can be identified, enabling further detection and correlation of abnormal drug purchase activities within the group. This approach enhances the capability to investigate complex criminal methods. The mining process of abnormal co-frequency cards in medical insurance funds is shown in Figure 2. Following the steps of “co-frequency card screening–normal card exclusion–correlation calculation”, abnormal co-frequency card sets are filtered out by combining threshold values. The input is the detected abnormal card numbers and business data, and the output is the co-frequency card set matching the input abnormal cards. In the co-frequency card screening stage, spatio-temporal feature analysis technology is comprehensively used to focus on screening co-frequency cards with adjacent settlement times in the same medical institution. Meanwhile, to avoid misjudgment, normal cards are excluded by using amount features. Finally, the correlation between medical insurance cards is calculated in the time dimension and content dimension, and the co-frequency card group set is obtained.

In response to the aforementioned process, this study proposes a screening strategy for co-frequency cards based on geographical location and temporal dimensions, combined with a correlation calculation algorithm incorporating both temporal and content-based dimensions. By mining the similarities between the temporal sequences of patient behaviors and their behavioral objectives, an ensemble of abnormal co-frequency cards is generated. Specifically, let the set of medical institutions be

H = \{h_{1}, h_{2}, \dots, h_{n}\}

and the set of settlement times be

T = \{t_{1}, t_{2}, \dots, t_{n}\}

. The screening conditions for co-frequency cards are shown in Formula (1), where t is a preset time threshold used to determine the adjacency of settlement times.

\forall (h_{i}, t_{i}) \in H \times T, \exists (h_{j}, t_{j}) \in H \times T, s . t . h_{i} = h_{j} \land |t_{i} - t_{j}| \leq Δ t,

(1)

Subsequently, in the normal card exclusion stage, normal medical insurance cards are excluded through preset rules based on amount features such as medical insurance settlement amount and proportion of drug amount. After completing rule filtering, behavioral comparisons are conducted between associated cards based on temporal and content dimensions, and correlation degrees are further calculated based on behavioral comparison results. In the temporal dimension, the similarity of medical insurance transaction time records of medical insurance cards over the past year is statistically analyzed, where the features of medical insurance card A in the temporal dimension are set as

T_{A} = \{t_{A_{1}}, t_{A_{2}}, \dots, t_{A_{m}}\}

, where m denotes the feature length, including indicators such as transaction frequency and transaction time interval. The similarity between medical insurance card A and card B in the temporal dimension is calculated by Formula (2), where i represents the i-th calculation dimension,

S_{t_{i}}

is the similarity function for the i-th dimension, and

α_{i}

is the corresponding weight. Similarly, a similarity calculation function

S_{c}

is constructed for the content dimension, and based on this, the similarity of medical insurance items participated in by the medical insurance cards over the past year is calculated, involving elements such as the type, amount, and usage frequency of medical insurance items. For each

S_{t_{i}}

, we specifically employed Euclidean distance, Jaccard similarity, and cosine similarity. Given the multidimensional similarity inherent in the temporal feature distributions of medical insurance card users, Euclidean distance was utilized to measure the temporal differences between groups of time vectors. Cosine similarity was applied to assess the directional similarity between two vectors, indicating whether their activity patterns are consistent. Jaccard similarity treated the time vectors as sets of dates, comparing their intersections and unions to quantify the overlap in multiple transactions. Compared to complex nonlinear methods, linear weighting has lower computational complexity, making it suitable for handling large-scale data. Additionally, the contribution of each measure is clear and explicit, facilitating interpretation, adjustment, and optimization based on domain knowledge.

S_{t} (A, B) = \sum_{i = 1}^{N} α_{i} S_{t_{i}} (A, B) .

(2)

S_{t_{1}} (A, B) = \frac{A \cdot B}{∥ A ∥ \times ∥ B ∥} = \frac{\sum_{i = 1}^{n} (A_{i} \times B_{i})}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \times \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}},

(3)

S_{t_{2}} (A, B) = e^{- \sqrt{\sum_{i = 1}^{n} {(A_{i} - B_{i})}^{2}}},

(4)

S_{t_{3}} (A, B) = \frac{| A ⋂ B |}{| A ⋃ B |} .

(5)

After calculating the similarities of the two dimensions, the similarity scores of the temporal and content dimensions are comprehensively evaluated and compared with preset thresholds. When the comprehensive score exceeds the preset threshold, the medical insurance card is identified as an abnormal co-frequency card, indicating potential violations. Finally, by iterating the above process, a set of co-frequency cards with similar violations to the input abnormal cards is continuously mined. Additionally, after completing the division of co-frequency card groups, all medical insurance cards within the co-frequency card group are deemed to have suspected drug resale and are directly classified as medium-to-high-risk cards.

Based on the above processes of “individual abnormal card detection” and “group co-frequency card mining”, medical insurance records are classified into three categories: individual abnormal cards, individual normal cards, and abnormal co-frequency cards. Specifically, individual abnormal cards refer to medical insurance cards with abnormal usage behaviors within a specific time period; individual normal cards denote those with usage behaviors conforming to conventional patterns; abnormal co-frequency cards represent groups of medical insurance cards exhibiting similar abnormal usage patterns within the same time period. Subsequently, the three categories of medical insurance card data are utilized to generate clues for drug resale. Figure 2 systematically illustrates the individual card status update strategy based on co-frequency cards. The rationale here is that some normal cards may demonstrate legitimate drug purchase behaviors at the individual level but exhibit suspicious clustered purchasing patterns at the group level. Therefore, based on the status of co-frequency cards, if a normal card is identified within an abnormal co-frequency card group, its status is updated to abnormal, and it is further added to the clue set for drug resale.

In summary, the individual abnormal card detection stage can identify medical insurance cards with obvious individual abnormalities and extract clues to drug resale. However, some individual cards that appear to behave normally may actually be involved in organized and clustered resale activities. Such cards are difficult to detect directly through individual-level analysis of consultation, drug purchase, and consumption records, requiring joint judgment by correlating with co-frequency cards exhibiting similar behaviors. The group co-frequency card mining process designed in this model effectively addresses this need.

3.3. Adaptive Risk Stratification Assessment

Most existing legal supervision models rely on preset thresholds to screen medium-to-high-risk clues, overlooking the differences across regions and medical institutions. Additionally, as the generated clues contain multi-dimensional information, medium-to-high-risk clues filtered directly based on single-dimensional information neglect the extreme circumstances of patient visits. To address these issues, this study proposes generating aggregated indicators using multi-dimensional clue information and constructing indicator thresholds by integrating adaptive abnormal grade classification algorithms. Based on the aggregated indicators and threshold values, medium-to-high-risk clues are further screened out. In this section, our approach presents a risk stratification assessment model, and the overall process of this model is as follows: In the clue generation step, tabular structured data recording legal supervision clues is generated, where each row represents the specific information of clues for each medical insurance card, and each column represents the number of times or confidence levels that clues for each medical insurance card “hit” each rule. During clue risk assessment, following the process shown in Algorithm 1, based on the generated clue table, the aggregated weights of each column of data are calculated to further obtain aggregated indicators. Combined with an adaptive grade classification algorithm, indicator thresholds are calculated to determine whether the clues belong to medium-to-high-risk categories.

Algorithm 1 Risk Stratification Assessment Model

Require:: Medical insurance data table D, rule hit counts and confidence levels.
Ensure:: High-risk and medium-risk clues.
1:: procedure EntropyWeightMethod (EWM)
2:: for each column j in D do
3:: Normalize column values $p_{i j}$ to range $[0, 1]$ .
4:: Calculate entropy value $E_{j} = - \frac{1}{\ln m} \sum_{i = 1}^{m} p_{i j} \ln (p_{i j})$ .
5:: Compute information quantity $1 - E_{j}$ .
6:: Calculate weight $w_{j} = \frac{1 - E_{j}}{\sum_{j = 1}^{n} (1 - E_{j})}$ .
7:: end for
8:: end procedure
9:: procedure Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)
10:: Construct weighted normalized decision matrix using $w_{j}$ .
11:: Determine the positive ideal solution $A^{+}$ and negative ideal solution $A^{-}$ .
12:: for each alternative i do
13:: Compute distance to positive ideal solution $S c o r e_{i}^{+} = \sqrt{\sum_{j = 1}^{n} w_{j} {({\bar{p}}_{i j}^{+} - {\bar{p}}_{i j})}^{2}}$ .
14:: Compute distance to negative ideal solution $S c o r e_{i}^{-} = \sqrt{\sum_{j = 1}^{n} w_{j} {({\bar{p}}_{i j}^{-} - {\bar{p}}_{i j})}^{2}}$ .
15:: Calculate comprehensive evaluation score $S c o r e_{i} = \frac{S c o r e_{i}^{-}}{S c o r e_{i}^{+} + S c o r e_{i}^{-}}$ .
16:: end for
17:: end procedure
18:: procedure FLASC( $S c o r e_{i}$ )
19:: Apply FLASC clustering on $S c o r e_{i}$ .
20:: Dynamically determine threshold $θ$ to distinguish between high-risk and medium-risk.
21:: if $S c o r e_{i} > θ$ then
22:: Mark corresponding clue as high-risk.
23:: else
24:: Mark corresponding clue as medium-risk.
25:: end if
26:: end procedure

Before model execution, data cleaning and preprocessing are required. For data columns representing the number of rule hits, normalization processing is needed, i.e., standardizing the number of hits for all medical insurance cards in the column to adjust their numerical range to the [0, 1] interval. This approach eliminates dimensional differences between indicators. Second, for data columns representing the confidence level of medical insurance cards hitting rules, since their values are already within a standard range, no additional processing is required. These confidence values can be directly used in subsequent aggregation calculations to maintain the original characteristics and accuracy of the data. Through the above steps, the quality and consistency of the data are further ensured, preparing for subsequent aggregated indicator calculation and indicator threshold calculation.

In calculating the aggregated indicators, a multi-dimensional and multi-level indicator aggregation strategy is proposed. The core of this strategy lies in leveraging the Entropy Weight Method (EWM) for data-driven indicator aggregation, thereby automatically evaluating the importance of each rule and effectively avoiding the influence of subjective preferences. As shown in Formula (6), where

p_{i j}

represents the value in the j-th column of the i-th medical insurance card. By computing the score of each indicator and its logarithm, the entropy value of the feature can be obtained. Subsequently, each indicator’s content is weighted and summed, and a normalization factor ensures the entropy value falls within the standardized range. This strategy optimizes the indicator aggregation effect by considering the mutual influence among indicators of different dimensions, significantly enhancing the accuracy and efficiency of decision-making when dealing with complex data.

E_{j} = - \frac{1}{\ln m} \sum_{i = 1}^{m} (p_{i j} \ln p_{i j})

(6)

Next, the indicator aggregation weight for the j-th supervision rule is calculated using Formula (7). First, the amount of information for the j-th supervision rule is calculated by

1 - E_{j}

, where a larger information value indicates that the indicator of this dimension contains more effective information, and

E_{j}

represents the entropy value of the j-th supervision rule. Subsequently, the proportion of the j-th supervision dimension’s indicator is calculated by its share in the total information amount of all supervision rules, which is then used as the indicator aggregation weight. Finally, after calculating the aggregated indicators for each supervision dimension, an anomaly index

s c o r e

is obtained through weighted summation.

w_{j} = \frac{1 - E_{j}}{n - \sum_{j = 1}^{n} E_{j}}

(7)

{Score}_{i} = \sum_{j} ω_{j} p_{\ddot{η}}

(8)

In the aforementioned process, if the indicator of a supervision dimension is extremely large or small, it may cause extreme value interference in the indicator aggregation process, thus affecting the judgment. To achieve a more comprehensive indicator evaluation, this study proposes to combine the TOPSIS algorithm to effectively avoid extreme value interference. TOPSIS is a commonly used comprehensive evaluation method that can fully utilize the information in the original data, and its results can accurately reflect the differences between evaluation schemes. The TOPSIS algorithm first calculates the ideal solution of the overall score based on the maximum values of each indicator and then calculates the difference from the actual score, as shown in Formula (9). Similarly, it calculates the negative ideal solution of the overall score based on the minimum values of each indicator and then calculates the difference from the actual score, as shown in Formula (10). Finally, the final score, i.e., the aggregated indicator, is calculated according to Formula (11). This score reflects the proximity of each evaluation indicator to the ideal and negative ideal solutions, enabling a more accurate assessment of the comprehensive performance of each solution.

{Score}_{i}^{+} = \sqrt{\sum_{j = 1}^{n} w_{j} \times {(p_{i j}^{+})}^{2}} - {Score}_{i}

(9)

{Score}_{i}^{-} = {Score}_{i} - \sqrt{\sum_{j = 1}^{n} w_{j} \times {(p_{i j}^{-})}^{2}}

(10)

{Score}_{i} = \frac{{Score}_{i}^{-}}{{Score}_{i}^{+} + {Score}_{i}^{-}}

(11)

In the process of calculating aggregated indicators, the introduction of the TOPSIS algorithm not only effectively avoids extreme value interference but also enhances the comprehensiveness and accuracy of indicator evaluation, providing a more reliable basis for subsequent risk indicator classification.

In the calculation of aggregated scores for clue data, due to the typically significant non-uniformity of data distribution and the universal presence of extreme values, direct adoption of traditional fixed threshold classification methods has the following limitations: First, fixed thresholds are susceptible to extreme values, causing extremely high or low indicators to mislead overall evaluation results and ignoring risk characteristics in the intermediate range. Second, fixed thresholds cannot accurately reflect the true distribution characteristics of data, easily leading to unbalanced risk classification. Existing clustering methods, such as the K-means algorithm, require pre-specifying the number of clusters and are sensitive to the selection of initial centroids, making them difficult to adapt to the complex distribution characteristics of medical insurance data. The DBSCAN algorithm, while capable of identifying clusters of arbitrary shapes, is highly sensitive to density parameter selection and prone to ambiguous clustering boundaries when processing medical insurance data. Hierarchical clustering algorithms, although able to generate hierarchical clustering structures, have high computational complexity and struggle to meet the processing requirements of large-scale medical insurance data.

To address these issues, the FLASC (Flare-Sensitive Clustering Algorithm) is employed here. This algorithm performs preliminary clustering by detecting the distribution density of data points and further identifies the branch structures of clusters based on connectivity analysis within clusters. Specifically, the FLASC algorithm offers the following advantages: (1) it can adaptively generate risk levels according to the actual distribution characteristics of data, avoiding the subjectivity of manually setting fixed thresholds; (2) it effectively handles noise points and outliers in data through a density-sensitive mechanism; (3) it can automatically identify natural clustering structures in data without requiring pre-specified numbers of clusters; (4) it exhibits high computational efficiency when processing large-scale data.

The FLASC algorithm identifies subpopulations in data by detecting branches within clusters. It builds upon traditional density-based clustering algorithms and introduces the concept of “eccentricity.” Compared to conventional clustering algorithms, FLASC can distinguish topologically connected branches without obvious density separations. Eccentricity, a key concept in the FLASC algorithm, is used to describe the position of data points within a cluster: the higher the eccentricity, the farther the data point is from the cluster center. Its definition is as follows:

e (x_{i}) = d ({\bar{x}}_{C_{j}}, x_{i})

(12)

where

{\bar{x}}_{C_{j}}

is the centroid of cluster

C_{j}

. FLASC first takes high-density regions as clusters according to the density distribution of aggregated scores. Then, for each detected cluster

C_{j}

, FLASC calculates the eccentricity

e (x_{i})

of each point

x_{i}

, defined as the distance from point

x_{i}

to the centroid of cluster

C_{j}

. Then, FLASC constructs two approximate graphs to describe the connectivity within clusters: a full approximate graph and a core approximate graph. The full approximate graph includes edges where the distance between any two points is less than or equal to the maximum distance in the minimum spanning tree (MST) of the cluster, while the core approximate graph includes edges where the distance between any two points is less than or equal to the maximum of their respective core distances. Finally, clusters with the closest distances are gradually merged hierarchically according to distance, forming a branch hierarchy that encompasses all aggregated scores of points. Each branch structure represents a risk level, thereby avoiding interference from distribution extremes and achieving comprehensive generation of medium-to-high-risk clues.

In summary, this section first outlines the overall process of generating medium-to-high-risk clues, then details how the constructed legal supervision risk assessment model further mines medium-to-high-risk clues based on the generated clues. From the next section onward, the supervisory models developed in this study will be described in detail. Although different supervisory models are constructed for each supervisory scenario, the risk assessment model designed in this study is applicable to all subsequent supervisory models. Therefore, the risk assessment steps will not be repeated in the following content. After generating clues, each supervisory model described below can obtain medium-to-high-risk clues through the medium-to-high-risk clue generation steps outlined above.

4. Results

In this section, the experimental details of our method are introduced, including descriptions of the dataset used and evaluation metrics. Subsequently, through comprehensive comparative experiments and ablation experiments, we analyze the superiority and effectiveness of the proposed method.

4.1. Experimental Details

In this study, we construct a dataset using real partial medical insurance data from a city during the period of 2023 to 2024. The original medical insurance data exhibited four key issues: (1) duplication, errors, and omissions in fields; (2) storage across multiple tables, leading to slow multi-table query speeds and long indexing times; (3) isolation of fields with insufficient inter-field associations, making it difficult to provide adequate information for clue generation. To address these challenges, we propose a processing pipeline of “data cleaning-data association-data encoding” for the original data, and designed data partitioning and annotation strategies to obtain a dataset suitable for drug resale supervision. In compliance with privacy regulations, the dataset is also de-identified. Finally, based on the detailed item codes and expense category fields in the original data, records containing medical-related fields such as drug items and drug payments are filtered and included in the drug resale dataset. The dataset we used includes the following 22 fields: Transaction ID, Card Number, Name, ID Number, Institution Code, Institution Name, Department Code, Department Name, Physician ID, Physician Name, Expense Category, Item Code, Item Name, Unit Price, Item Quantity, Transaction Cost, Medical Insurance Settlement Cost, Item Unit, Settlement Date/Time, Visit Type, Institution District/County, and Institution Type. The dataset comprises 1,107,985 medical records from 8917 medical insurance cards, of which 1782 were flagged as anomalous. To ensure compliance with privacy protection requirements while constructing a legal supervision model, we apply data anonymization to sensitive information, ensuring that the processed data cannot be traced back to specific patients or physicians. The specific data anonymization scheme is shown in Table 2.

In the evaluation of our model’s performance, precision, recall, and accuracy serve as crucial metrics. Precision P measures the proportion of true positive predictions among all positive predictions made by the model, indicating the accuracy of positive identifications. It is calculated as follows:

P = \frac{T P}{T P + F P},

(13)

where

T P

represents the number of true positives, and

F P

denotes the number of false positives. Recall R, also known as sensitivity or the true positive rate, quantifies the model’s ability to correctly identify all positive instances within the dataset. Its formula is given by:

R = \frac{T P}{T P + F N}

(14)

Accuracy refers to the proportion of correctly predicted positive and negative instances out of the total number of instances. It is expressed as:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(15)

These metrics collectively offer a comprehensive assessment of our model’s effectiveness in detecting the target instances accurately.

4.2. Experimental Results

We design comparative experiments with three methods to demonstrate that our proposed method exhibits better performance.

KMeans assigns proximate samples to the same subset, minimizing intra-subset dissimilarity while maximizing inter-subset dissimilarity. It starts by randomly selecting K initial cluster centers, calculates the distance from each sample point to these centers, and assigns each point to the nearest center. The centers are then iteratively updated.
HDBSCAN [28] is based on density-based clustering concepts, defining clusters as high-density regions separated by low-density areas. It uses core distance and mutual reachability distance to describe data point connectivity, avoiding the need to specify the number of clusters in advance. It constructs a density contour tree by analyzing the hierarchy of merging clusters as the density threshold decreases, and simplifies this tree using a minimum cluster size to obtain the final clustering result.
XGBoost [31] is based on a gradient boosting framework, constructs additive regression trees to minimize prediction loss by iteratively fitting residual errors. It uses second-order gradient information for node splitting, introduces regularization terms to prevent overfitting and supports parallel computation for efficiency. The algorithm starts with an initial constant prediction, and each new tree aims to correct the errors of the previous ensemble. We use specific parameters and adjustment ranges, as shown in Table 3.

To demonstrate that our FLASC-based method is more suitable for adaptive threshold generation in risk assessment compared to other clustering algorithms (i.e., k-means and HDBSCAN), we substituted FLASC with k-means during the clustering-based threshold analysis. As shown in Table 4, our method achieves higher precision and recall rates than both k-means and HDBSCAN, proving that the adaptive thresholds we generate better reflect the actual distribution characteristics of the data. For the XGBoost approach, we preprocessed numerical attributes in the raw data through cleaning and imputation, then extracted statistical features, including means, standard deviations, and maximum/minimum values as model inputs. The dataset was partitioned into 10 subsets according to the distribution of fraud labels, with each subset used to train a separate XGBoost model before final result aggregation. As evidenced by Table 4, XGBoost demonstrates inferior performance compared to our proposed method, primarily due to its lack of rule-based information integration and group anomaly detection capabilities.

Table 3. XGBoost parameters and adjustment ranges.

Parameter Name	Value	Range
max depth	5	3–10
learning rate	0.01	0.01–0.2
subsample	0.8	0.5–1.0
colsample bytree	0.8	0.5–1.0
estimators	50	50–200

4.3. Ablation Studies

To verify the effectiveness of each module, we design ablation studies, whose configurations are as follows:

Without multi-dimension: To validate the effectiveness of our proposed multi-dimensional clue generation module, we designed corresponding ablation studies. Specifically, we conducted experiments using only individual dimensions (frequency, cost, and behavioral dimensions), as well as their pairwise combinations for clue generation, thereby demonstrating the necessity of each rule dimension.
Without GA: To assess the effectiveness of Group Anomaly analysis proposed in Section 3.2, we only apply anomaly analysis on a single card in multi-dimensional clue generation. Additionally, to demonstrate the necessity of combining temporal and spatial dimensions in our proposed group anomaly detection approach, we designed ablation experiments using solely the spatial dimension or the temporal dimension in isolation.
Without indicator aggregating: To validate the effectiveness of our proposed indicator aggregation module, we conducted an ablation study by removing the EWM-TOPSIS component, where the statistically derived anomaly scores were directly fed into the subsequent FLASC algorithm for threshold generation.
Without adaptive threshold: To demonstrate the necessity of the adaptive threshold generation module, we set the threshold to a fixed value corresponding to the top 10% of statistical scores.

As evidenced in Figure 3, when employing either single-dimensional or merely two-dimensional rules, the recall and precision rates both decline due to incomplete supervisory dimensions and inferior clue quality. This conclusively demonstrates the necessity of our proposed multi-dimensional rule framework, where each dimensional rule proves indispensable.

The ablation study on group anomaly detection demonstrates that updating the anomaly clue set based on group anomalies significantly enhances supervision effectiveness, with a notable improvement in recall rate. This indicates the existence of group fraud phenomena in medical insurance fraud scenarios involving drug reselling and reveals the limited effectiveness of individual anomaly detection alone in identifying group fraud, thereby confirming the necessity of the proposed group anomaly detection module. Furthermore, we conducted corresponding ablation experiments on the temporal and spatial dimensions in group anomaly detection. As shown in Figure 4, the precision significantly decreases when either dimension is removed. This demonstrates that the joint analysis of both temporal and spatial dimensions is essential for accurately identifying group anomalous behaviors. Without this combined approach, individual spatial or temporal anomaly detection may yield false positives, potentially misclassifying some normal cases as anomalous. In the ablation study for the adaptive risk assessment module, we first removed the EWM-TOPSIS indicator aggregation component and simply summed the violation scores instead. As shown in Figure 5, both precision and recall decreased when the indicator aggregation was eliminated. Finally, to demonstrate the effectiveness of our clustering-based adaptive threshold generation approach, we classified the top 10% of anomaly scores as high-risk and the remainder as low-risk. As shown in the experimental results, simply relying on statistical score percentiles proves inadequate for accurately identifying medication resale fraud risks, since their characteristics exhibit more complex multi-dimensional patterns. This confirms that clustering algorithms are more appropriate for distinguishing between anomalous and normal clusters by capturing their inherent aggregation features.

The system is built on the Spring Boot v2.1.17 and Spring Security v5.7.11 frameworks and is deployed and evaluated in relevant institutions. The persistence layer is implemented using MyBatis v3.5.1, while the presentation layer employs Vue.js v2.6.10 and Element-UI v2.15.7 to achieve a responsive, multi-device interface. The framework integrates a Java-based rapid development framework with a code generator, which not only supports efficient development but also ensures code consistency and standardization. Based on this framework, backend standards for transaction management, exception handling, system logging, scheduled tasks, and system interfaces are defined and implemented.

5. Discussion

Balancing Privacy Protection and Data Utilization: The application of big data and AI technologies in medical insurance fraud detection presents a critical challenge: maximizing data utility while ensuring robust protection of personal privacy [32]. To effectively identify complex fraudulent activities, systems require access to extensive sensitive information, including but not limited to medical records and medication purchase histories. Consequently, a key research direction involves developing technical frameworks capable of both preserving patient privacy with high efficiency and accurately detecting insurance fraud. This endeavor necessitates not only technological innovation but also multidisciplinary coordination across legal, ethical, and policy-making dimensions.

Legal and Ethical Considerations of AI in Medical Insurance Oversight: The widespread adoption of artificial intelligence (AI) technologies in medical insurance supervision has brought increasingly prominent legal and ethical challenges [33,34]. For instance, when AI systems determine potential fraudulent activities, ensuring transparent and interpretable decision-making processes becomes critical. Furthermore, questions arise regarding liability attribution if algorithmic biases lead to misjudgments. These issues not only impact the protection of individual rights but also involve public trust and social equity. Particularly in cases involving vulnerable populations or uneven resource distribution, AI applications may exacerbate existing societal inequalities. Therefore, establishing comprehensive legal and ethical guidelines is essential to foster the responsible development of AI in medical insurance oversight. This includes, but is not limited to, defining clear accountability mechanisms, enhancing algorithmic transparency, and strengthening the monitoring and management of potential biases.

Biometric fraud detection: Biometric fraud also poses certain challenges in the deployment of medical insurance supervision systems. Criminals may forge medical insurance records or even steal personal privacy information through fake profiles and identity theft. To address this, many approaches focus on protecting the authentication process through patterns such as pattern locks [35], facial recognition verification [36], and voiceprint recognition [37]. However, with the rapid development of large AI models, attack techniques are also constantly advancing. As a result, effectively identifying more sophisticated biometric fraud remains a challenge for various fraud detection and supervision systems.

6. Conclusions

In this paper, we present an AI-driven legal supervision model to combat increasingly sophisticated medical insurance fraud, particularly organized drug resale schemes. By integrating multi-dimensional behavioral analysis with adaptive clustering techniques (FLASC), the proposed framework effectively identifies both individual anomalies and coordinated fraud networks, overcoming the limitations of rule-based detection methods. The three-stage pipeline—group behavior analysis, automated clue generation, and dynamic risk stratification—enables precise fraud detection while eliminating reliance on subjective expert thresholds. Experimental validation on real-world insurance data (8917 cards, 1.1 million records) demonstrates the model’s robustness, achieving 89% precision, 42% recall, and 87% accuracy in high-risk case detection and uncovering previously hidden fraud rings.

Author Contributions

Methodology, Q.H., C.Z. and N.L.; Software, Q.H.; Validation, Q.H. and C.Z.; Investigation, Q.D.; Resources, Q.D. and N.L.; Data curation, Q.H. and Q.D.; Visualization, Q.H.; Writing—original draft, Q.H. and W.L.; Writing—review & editing, Q.H., L.P. and W.L.; Project administration L.P. and N.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Plan in China (No. 2023YFC3306100).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Najar, A.V.; Alizamani, L.; Zarqi, M.; Hooshmand, E. A global scoping review on the patterns of medical fraud and abuse: Integrating data-driven detection, prevention, and legal responses. Arch. Public Health 2025, 83, 43. [Google Scholar] [CrossRef]
Wang, Z.; Chen, X.; Wu, Y.; Jiang, L.; Lin, S.; Qiu, G. A robust and interpretable ensemble machine learning model for predicting healthcare insurance fraud. Sci. Rep. 2025, 15, 218. [Google Scholar] [CrossRef]
Safitri, A.; Nurcihikita, T. The analysis of the implementation of the national health insurance fraud prevention program. J. Health Manag. Adm. Public Health Policies HealthMAPs 2024, 2, 52–63. [Google Scholar] [CrossRef]
Hamid, Z.; Khalique, F.; Mahmood, S.; Daud, A.; Bukhari, A.; Alshemaimri, B. Healthcare insurance fraud detection using data mining. BMC Med. Inform. Decis. Mak. 2024, 24, 112. [Google Scholar] [CrossRef]
Thornton, D.; Brinkhuis, M.; Amrit, C.; Aly, R. Categorizing and describing the types of fraud in healthcare. Procedia Comput. Sci. 2015, 64, 713–720. [Google Scholar] [CrossRef]
Peng, J.; Li, Q.; Li, H.; Liu, L.; Yan, Z.; Zhang, S. Fraud Detection of Medical Insurance Employing Outlier Analysis. In Proceedings of the 2018 IEEE 22nd International Conference on Computer Supported Cooperative Work in Design (CSCWD), Nanjing, China, 9–11 May 2018; pp. 341–346. [Google Scholar]
Mao, Y.; Li, Y.; Xu, B.; Han, J. XGAN: A Medical Insurance fraud Detector based on GAN with XGBoost. J. Inf. Hiding Multim. Signal Process. 2024, 15, 36–52. [Google Scholar]
Zhang, R.; Cheng, D.; Yang, J.; Ouyang, Y.; Wu, X.; Zheng, Y.; Jiang, C. Pre-trained online contrastive learning for insurance fraud detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2024; Volume 38, pp. 22511–22519. [Google Scholar]
Alam, M.S.; Rai, P.; Tiwari, R.K. Machine Learning for Healthcare Fraud Detection: A Comprehensive Review Literature. In Leveraging Futuristic Machine Learning and Next-Generational Security for e-Governance; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 229–254. [Google Scholar]
Wang, J.; Guo, Y.; Wen, X.; Wang, Z.; Li, Z.; Tang, M. Improving graph-based label propagation algorithm with group partition for fraud detection. Appl. Intell. 2020, 50, 3291–3300. [Google Scholar] [CrossRef]
Ma, J.; Zhang, D.; Wang, Y.; Zhang, Y.; Pozdnoukhov, A. GraphRAD: A graph-based risky account detection system. In Proceedings of the ACM SIGKDD Conference, London, UK, 19–23 August 2018; Volume 9. [Google Scholar]
Tan, X.; Yang, J.; Zhao, Z.; Xiao, J.; Li, C. Improving Graph Convolutional Network with Learnable Edge Weights and Edge-Node Co-Embedding for Graph Anomaly Detection. Sensors 2024, 24, 2591. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef]
Arockiam, J.M.; Pushpanathan, A.C.S. MapReduce-iterative support vector machine classifier: Novel fraud detection systems in healthcare insurance industry. Int. J. Electr. Comput. Eng. IJECE 2023, 13, 756. [Google Scholar] [CrossRef]
Kumaraswamy, N.; Ekin, T.; Park, C.; Markey, M.K.; Barner, J.C.; Rascati, K. Using a Bayesian Belief Network to detect healthcare fraud. Expert Syst. Appl. 2024, 238, 122241. [Google Scholar] [CrossRef]
Nalluri, V.; Chang, J.R.; Chen, L.S.; Chen, J.C. Building prediction models and discovering important factors of health insurance fraud using machine learning methods. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 9607–9619. [Google Scholar] [CrossRef]
Van Capelleveen, G.; Poel, M.; Mueller, R.M.; Thornton, D.; van Hillegersberg, J. Outlier detection in healthcare fraud: A case study in the Medicaid dental domain. Int. J. Account. Inf. Syst. 2016, 21, 18–31. [Google Scholar] [CrossRef]
De Meulemeester, H.; De Smet, F.; van Dorst, J.; Derroitte, E.; De Moor, B. Explainable unsupervised anomaly detection for healthcare insurance data. BMC Med. Inform. Decis. Mak. 2025, 25, 14. [Google Scholar] [CrossRef]
Xu, H.; Pang, G.; Wang, Y.; Wang, Y. Deep Isolation Forest for Anomaly Detection. IEEE Trans. Knowl. Data Eng. 2023, 35, 12591–12604. [Google Scholar] [CrossRef]
Islam Prova, N.N. Healthcare Fraud Detection Using Machine Learning. In Proceedings of the 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), Coimbatore, India, 28–30 August 2024; pp. 1119–1123. [Google Scholar]
Mohammed, M.A.; Boujelben, M.; Abid, M. A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning. Future Internet 2023, 15, 250. [Google Scholar] [CrossRef]
Xiao, F.; Li, H.X.; Wang, X.K.; Wang, J.Q.; Chen, S.X. Predictive analysis for healthcare fraud detection: Integration of probabilistic model and interpretable machine learning. Inf. Sci. 2025, 719, 122499. [Google Scholar] [CrossRef]
Sinaga, K.P.; Yang, M.S. Unsupervised K-means clustering algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
Singh, H.V.; Girdhar, A.; Dahiya, S. A Literature survey based on DBSCAN algorithms. In Proceedings of the 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 25–27 May 2022; pp. 751–758. [Google Scholar]
Tang, C.; Wang, H.; Wang, Z.; Zeng, X.; Yan, H.; Xiao, Y. An improved OPTICS clustering algorithm for discovering clusters with uneven densities. Intell. Data Anal. 2021, 25, 1453–1471. [Google Scholar] [CrossRef]
Kanagala, H.K.; Krishnaiah, V.J.R. A comparative study of K-Means, DBSCAN and OPTICS. In Proceedings of the 2016 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 7–9 January 2016; pp. 1–6. [Google Scholar]
Campello, R.J.G.B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data 2015, 10, 5. [Google Scholar] [CrossRef]
Stewart, G.; Al-Khassaweneh, M. An implementation of the HDBSCAN* clustering algorithm. Appl. Sci. 2022, 12, 2405. [Google Scholar] [CrossRef]
Bot, D.M.; Peeters, J.; Liesenborgs, J.; Aerts, J. FLASC: A flare-sensitive clustering algorithm. PeerJ Comput. Sci. 2025, 11, e2792. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD ’16: 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Parikh, D.; Radadia, S.; Eranna, R.K. Privacy-Preserving Machine Learning Techniques, Challenges and Research Directions. Int. Res. J. Eng. Technol. 2024, 11, 499. [Google Scholar]
Miller, S. Machine learning, ethics and law. Australas. J. Inf. Syst. 2019, 23, 1–13. [Google Scholar] [CrossRef]
Galiana, L.I.; Gudino, L.C.; González, P.M. Ethics and artificial intelligence. Rev. Clín. Esp. Engl. Ed. 2024, 224, 178–186. [Google Scholar]
Chen, Y.; Ni, T.; Xu, W.; Gu, T. SwipePass: Acoustic-based Second-factor User Authentication for Smartphones. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2022, 6, 106. [Google Scholar] [CrossRef]
Kong, J.; Song, X.; Huai, S.; Xu, B.; Luo, J.; He, Y. Do Not DeepFake Me: Privacy-Preserving Neural 3D Head Reconstruction Without Sensitive Images. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; Volume 39, pp. 4383–4391. [Google Scholar]
Duan, D.; Sun, Z.; Ni, T.; Li, S.; Jia, X.; Xu, W.; Li, T. F2Key: Dynamically Converting Your Face into a Private Key Based on COTS Headphones for Reliable Voice Interaction. In MOBISYS ’24: Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services, Tokyo, Japan, 3–7 June 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar]

Figure 1. The workflow of our method.

Figure 2. Generation of individual abnormal clues and update of group aggregation detection.

Figure 3. Ablation study of multi-dimensional clue generation.

Figure 4. The ablation study of group abnormal analysis.

Figure 5. Ablation study of adaptive risk stratification assessment.

Table 1. Supervision dimensions and corresponding rules.

Dimension	Rule
Medical Frequency	Monthly OP count ≥15
	≥4 daily consultations for ≥3 days
	Annual OP count >100
Medical Cost	Monthly OP expenses ≥5000 RMB
	Annual OP + EM expenses ≥25,000 RMB
	Annual total insurance >30,000 RMB,
	drug >80%, inspection <10%
Medical Behavior	Same drug at ≥3 institutions in 1 week
Medical Behavior	>10 drug types at multiple institutions in 1 week

OP: Outpatient; EM: Emergency.

Table 2. Data anonymization scheme.

Field Name	Anonymization Strategy
Patient/Physician name	Retain only the lastname
ID number	Retain the first 6 digits and digits 7 to 10
Home address	Remove
Medical Insurance Card Number	Map to a random string of characters
Institution Code	Map to a random string of characters

Table 4. Comparison of different methods.

Method	Precision	Recall	Accuracy	F1	TP	FP	TN	FN
Our Method	0.89	0.42	0.87	0.57	748	92	7043	1034
HDBSCAN	0.73	0.34	0.84	0.47	611	226	6909	1171
Kmeans	0.73	0.38	0.85	0.50	674	251	6884	1108
XGBoost	0.82	0.41	0.86	0.55	730	160	6975	1052

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Q.; Ding, Q.; Zheng, C.; Pan, L.; Liu, N.; Li, W. A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds. Electronics 2025, 14, 3268. https://doi.org/10.3390/electronics14163268

AMA Style

He Q, Ding Q, Zheng C, Pan L, Liu N, Li W. A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds. Electronics. 2025; 14(16):3268. https://doi.org/10.3390/electronics14163268

Chicago/Turabian Style

He, Qingyang, Qi Ding, Conghui Zheng, Li Pan, Ning Liu, and Wensheng Li. 2025. "A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds" Electronics 14, no. 16: 3268. https://doi.org/10.3390/electronics14163268

APA Style

He, Q., Ding, Q., Zheng, C., Pan, L., Liu, N., & Li, W. (2025). A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds. Electronics, 14(16), 3268. https://doi.org/10.3390/electronics14163268

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Data-Driven Intelligent Supervision System for Generating High-Risk Organized Fraud Clues in Medical Insurance Funds

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Multi-Dimensional Clue Generation

3.2. Spatio-Temporal Group Anomaly Analysis

3.3. Adaptive Risk Stratification Assessment

4. Results

4.1. Experimental Details

4.2. Experimental Results

4.3. Ablation Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI