Medical Health Beneﬁt Management System for Real-Time Notiﬁcation of Fraud Using Historical Medical Records

: This paper presents a novel framework for fraud detection in healthcare systems which self-learns from the historical medical data. Historical medical records are required for training and testing of machine learning models. The main problem being faced by both private and government health supported schemes is a rapid rise in the amount of claims by beneﬁciaries mostly based on fraudulent billing. Detection of fraudulent transactions in healthcare systems is a strenuous task due to intricate relationships among dynamic elements including doctors, patients, service. In light of aforementioned challenges in health support programs, there is a need to develop intelligent fraud detection models for tracing the loopholes in procedures which may lead to successful reimbursement of fraudulent medical bills. In order to address the issue of fraud in healthcare programs our solution proposes a framework based on three entities (patient, doctor, service). Firstly, the framework computes association scores for three elements of the healthcare ecosystem namely patients, doctors or services. The framework ﬁlters out identiﬁed cases using association scores. The Conﬁdence values, after G-means clustering of transactional data, are computed for each service in each specialty. Rules are generated based on the conﬁdence values of services for each specialty. Then, an evaluation of identiﬁed cases is done using rule engine. The framework classiﬁes cases into fraudulent activities based on the similarity bit’s value. The validation of framework is performed on local hospital employees transactional data which includes many reported cases of fraudulent activities in addition to some introduced anomalies.


Introduction
'Fraud' and 'abuse', these two phrases are generally used to identify the major medical reimbursement issues that defeat the ultimate objective of a valid claim. We divide the healthcare frauds into two major categories, service_availing patterns and service_providing patterns. Any fraud can occur, either in the service_providing patterns or in service_availing patterns. Figure 1 explains these two categories of the healthcare frauds. The service_availing patterns capture all the services availed by the patients, duplication of either services (actually not availed) or claims against those services. In simple words, a misrepresentation of the services (or products) for which, the bills are generated but actually not availed. For example, an insurance claim provided by the patient can be inconsistent with his age or gender. There is a possibility that one patient is availing the same service again and again or he/she is availing the service less frequently. In such a case, the frequency of the visits of patients to the hospitals or doctors is either quite high or low. The service_providing patterns refer to the misrepresentation of facts by the doctors, pharmacies or hospitals. There is a possibility that these service_providers generate duplicate bills for the same provided service. The doctors or hospitals can prescribe unnecessary treatments to the patients; the pharmacies can charge patients twice for the same medicine whereas the doctors can prescribe or perform unnecessary procedures and the providers may allow the medical card's misutilization. Though many companies normally maintain their 'Special Investigation Departments' to control all the frauds and abuses in the re-imbursement of the medical bills but this is not enough to fulfil the purpose. Such departments get the guidelines from multiple sources and apply 'Conventional Surveillance Techniques' [1]. Whenever these departments detect any fraudulent payments, they proceed for the recovery of funds and then try to introduce the controls to avoid a future occurrence of such misrepresented billings. Once any claimed case is identified as a fraud/abuse, it can be recognized as an identified pattern. Such identified patterns are then utilized to make the adjustments in the billing policies of the existing system in order to prevent the reoccurrence of fraudulent activities. This type of approach commonly known as 'pay and chase', is not an efficient manner of detecting a fraud as it only generates an extra expense [2]. It is of partial use against the healthcare fraud cases because there are high degrees of variations in the clinical practices and billing patterns due to the complex healthcare services. For example a variation can be noticed in the doctors fee structures despite the fact that they are working in the same specialized departments. Many studies have demonstrated variations as high as 400 percent in the frequency of the major procedures among different doctors of the same hospital. There are four categories of the claim analysis. The first one is claim-centric which identifies whether the provided services are according to a patient's age, gender and diagnosis. The second is member-centric which identifies whether the provided services are according to the specialty of the doctor. The third one is provider-centric which identifies whether the claimed services are provided by the specific hospital and the last category is the 'network analysis' which is based on the combination of the member-centric and provider-centric analysis [3]. Our research is focused precisely on claim-centric and member-centric. In recent past, several studies proposed techniques to develop fraud detection systems. Many of these studies used payment-based analysis to detect frauds. They use one of the healthcare elements to identify any one type of healthcare fraud. To best of our knowledge no one considered delivered or availed service as separate element. Whereas our proposed framework detects fraudulent activities, using all the three main elements namely doctor, patient and service. The important part of the framework is the rule engine, which process over five years original transactional data of employees of a local hospital, for generating rules. Moreover, a self-learned fraud detection system detects patient, doctor and service level frauds.
The fraud-related claims in healthcare are the sources of burden and inconvenience to the overall society. A fraud in healthcare, affects both, the public as well as private sector employers in the form of high-cost over-runs. There are many victims of healthcare frauds who are exploited by the unnecessary treatments. In some cases the patient's data is compromised to generate any fraudulent claims. It will be a meaningless statement if we say that the healthcare fraud is not a crime or there are no victims of this fraud. Accordingly, a detection of fraud in healthcare is a hot topic of research, nowadays. There is a need to cut down this increasing cost as the victims of the healthcare fraud are none other than a common man. In most of the countries including Pakistan, the government has just initiated medical support programs through several national-level initiatives. One of these initiatives is the establishment of Prime Minister Task Force on IT and Telecom in 2018 to lay down the foundation of the data standards and annotations for incorporating the improved plans in healthcare service delivery to the common man. Our work is part of this program, proposing a framework that can be adopted for this national initiative. The major concern is to reduce/prevent the chance of fraudulent activities in such programs. This can only be achieved by implementing different adaptive-modelling techniques for detecting fraud through the healthcare data. For this we have utilized last five years insurance claim data of employees of one of the largest and well-equipped hospital of Pakistan. They provided us sufficient details for supporting this National level objective. According to the provided statistics each day thousands of patients visit this hospital and it has 62 different specialties.

Research Contributions
In recent years, the focus is more on fraud detection in the healthcare as the people in well-developed countries think that the fraud increases an overall expenditure and makes the health insurance problematic for the genuine people. In most of the developing countries, the government has started medical support programs and if such programs face any victimization of the healthcare fraud then there will be no support for genuine patients. This paper presents a novel framework for the fraud detection in healthcare; which considers all three main elements, namely, Patient (service-consumer), Doctors (service_providers) and Services (lab tests and treatments). Our proposed framework provides following significant contributions required in any health care fraud detection scheme.

•
The framework provides a self-learned knowledge base system, on the original five years transactional data of a local hospital.

•
The novel concept of generating association scores between doctors, patients and services is introduced. The association scores are computed based on frequency of visits between the above mentioned elements and used these association scores to detect anomalies.

•
Another novel idea of generating confidence values of all services in each specialty of a local hospital is introduced. As per domain knowledge only Cardiologist can recommend ECG whereas in real life even an ENT specialist can also recommend it, framework computes confidence value of service named ECG, have in ENT specialty. Similarly, even a peadiatrician can recommend kidney ultrasound and framework computes confidence value of the service named "kidney ultrasound" in peads specialty. There are many other examples of this. Based on these confidence values, rule engine is generated.

•
Another contribution is that this work is part of the national medical support program.
We consider a private hospital as our pilot project because in our country due to lack of resources, the electronic health records are not well maintained in the public sector hospitals.
Whereas private sector hospitals are using the automated and autonomous Electronic Health Record Systems and the availability of the patient's data from the private sector is also better. For this reason, we consider the transactional data of a private sector hospital. The public sector programs normally run parallel to a private sector but this research is representing the private sector in the National Health Programs.

Related Work
A drastic rise in the healthcare expenditures for the treatment of patients, has led to an introduction of fraud controlling techniques in the hospitals so as to ensure the delivery of more efficient and high quality healthcare services. In many developing countries, there is no such substantial system developed yet, for handling insurance or fraud issues in healthcare. To get better understanding of the fraud detection in healthcare, there is a need to conduct a detailed literature survey of the existing frameworks, techniques and approaches. After conducting a detailed review of the related literature, we get to know that many authors have proposed solutions for the fraud detection in healthcare and have also applied the data-mining approaches and machine-learning algorithms. Large numbers of fraud detection researches are successfully conducted around the globe from local to national levels to control healthcare frauds. The researches vary in data sets, type of healthcare frauds and analysis scale and techniques. The research by [4] is based on a detailed survey of the statistical approaches and these approaches are still being applied for the identification and classification of fraud in healthcare. Another research study, related to the application of supervised and unsupervised learning techniques for healthcare fraud, is conducted [5]. Fraud detection systems are implemented in many other industries and detailed survey is performed by [6]. Ortega [7] designed a system which applied multi-layer perceptron neural networks on the data of Chilean private health insurance company to detect the fraudulent activities; the detection rate of this system is 75 frauds per month. Another framework is proposed [8], which introduced an adaptable model using clinical ways for an automatic fraud detection. The framework for the fraud detection using unsupervised learning for a detection of the outliers in Medicaid insurance-claimed data is proposed [9]. Qi Liu [10] considered a clustering model which is based on the geographical location of Medicaid service_providers and clients to identify fraudulent claims. Moreover, a detection of fraud without considering the roles of the providers and clients is proposed [11]. It is the machine learning based system which involves the hierarchical processing along with assigning the weight to actors, the expectation maximization clustering technique is applied to find out the related groups of the actors. Thorton [12] applied the multidimensional data models and approaches for the prediction of fraudulent claims in Medicaid and the proposed system detected fraud cases. Many recent studies have utilized Public Use Files (PUF) data from CMS for detecting any fraudulent activities using the data mining techniques [13][14][15][16][17][18][19]. All the researches have focused on 'PART-B' of this above-mentioned data (PUF). Many statistical techniques are also used to generate decision rules and k-means clustering is applied on a time series-based insurance claim data for the identification of anomalies and outliers. These disease-based outliers are used to detect the fraud related activities [20].

1.
Most of the researches related to an anomaly detection in the healthcare, have considered the clinical processes for a particular disease and utilized prior knowledge and applied the unsupervised models [21][22][23].

2.
Many researchers have focused on the statistical financial data and performed analytics using a variety of tools. Fuzzy and Neurofuzzy analysis is performed in the multiple researches for extracting interesting patterns [24][25][26].

3.
Many frameworks for the fraud detection are proposed and the focus of the authors is on the correlation of the medicines, diseases and patients. Frauds are detected by assigning weights to highly correlated data. Many authors have utilized the concept of 'graph theory' for connecting the patients, diseases and medicines. Most of the times, the studies are supplemented with the prior knowledge of the medicines that were being used for the various diseases and they established a correlation between the reference set (the original knowledge) and the candidate set (the extracted knowledge) [27,28].

4.
To identify the joint fraudsters, the clustering technique is applied. The similarity adjacency graph is used along with group mining for distinguishing the normal behaviour from abnormal behaviours [29]. The treatment model for different diseases, that of, assessing the doctors' trustworthiness, which is one of the critical metrics for detecting fraud at the provider level, is introduced [30]. The association rule mining is a very important technique which generates rules for the frequently occurring items. This technique is being utilized in many previous researches for generating rules from the domain knowledge provided by the domain's experts. This technique generates the rules out of which some are significant and some are insignificant [31]. There are two most useful parameters to analyze the strength of the association rules namely: confidence and support [32][33][34]. The characteristics like uniqueness, understandability, applicability and reliability for assessing the generated rules are discussed [35]. The classification of the insurance claims are performed by using the Genetic Support Vector Machine [36]. Table 1 is providing detailed comparison between the exisiting systems and the proposed framework.
All aforementioned researches focus on disease correlations and medications whereas our proposed framework generates associations between the doctors, patients and services. Most of the existing studies use the domain knowledge to make the knowledge base but our system learns knowledge from the five years transactional data of a local hospital using the machine learning techniques. Based on this knowledge base, we classify the fraud cases. Most of the previous researches are based on the financial analysis for the detection of fraudulent activities but our research identifies anomalies using the association scores and performs the rule-engine analysis for the identification of the fraud cases. We analysed some of the most relevant researches with respect to data mining techniques used to detect types of frauds in Table 1. The comparative analysis highlights that most of the researches lack inclusion of all three key elements (doctors, patients, service) during analytical processing of data. The payment based analysis is utilized to detect patient level frauds and medication/disease associations are analyzed for detecting doctor level frauds. The most critical element missing in all these recent researches are services, which are either provided or availed.

Dataset Details
The analysis is conducted on five years [2013,2014,2015,2016,2017,2018, 2019] insurance claim transactional data of a local hospital. These are hospital employees who are availing insurance policies provided by hospital management. Based on the designation, insurance policies are allocated to each employee. The size of transactional data is shown in Table 2. The initial framework is proposed in [37]. The set of attributes which are providing details about the availed and provided services are shown in Table 3.
The framework involves an implementation of the three phases for detecting fraudulent activities: • Association scores generation and threshold application • Rule generation engine • Similarity Function We have implemented the fraud detection system by incorporating the above-mentioned three phases. Detecting a fraud from the healthcare data is actually an identification of outliers from such records. In the first phase, we identified the "outliers" and "need to be investigated" cases. In the second phase, we implemented rule engine for further analyzing the identified cases from the first phase. In the third phase, we checked each current transaction against the generated rules. The proposed framework is depicted in Figure 2, the association between the doctors, patients and services are computed and whenever a case of fraud is identified, the rating score of that element, gets reduced. Based on the number of visits, the association scores are found and these are giving an in-depth understanding of the behaviour of each element.

Association Scores Computation and Threshold Application
The three main elements of the proposed framework are patients, providers (doctors, pharmacy, hospitals) and services. These three elements are actually associated with each other. There is a need to find out the association score of each element with another element. The association scores are computed based on the frequency of visits or frequency of the prescriptions. If a patient visits frequently to avail a specific service (e.g., X-rays, ECGs). In this case, a patient is prescribed X-rays again and again from one doctor. This is considered as outlier. We compute association scores based on the frequency of the patient visits to the providers and services. The purpose of this step is to forward only those patient records to rule engine, which are identified as the "outliers" or "need to be investigated". We computed the association scores by using Equations (1)-(4).

•
Doctor (Association score) Y is computed by denoting i as number of times patient P k checked by doctor D j and D n is representing total number of patients checked by doctor D j . As shown in • Patient with services (Association score) is also computed by denoting m as number of patients availed service S h and S x is representing number of times patient P k availed S h service.
(for all patients) • Service with Doctor (Association score) is also computed by denoting T as number of times doctor D j prescribed service S h and S n is representing number of times all doctors prescribed service S h (for all services) • Patient (Association score) is computed by denoting G as number of times doctor D j examined patient P k and P n is representing total number of patient P k visits.
The association scores are between 0 and 1. After the computation of the association scores, we calculate threshold by computing an average of all the association scores for each provider, service or patient. All those transactions which are less than average but greater than the minimum threshold and equal to the average, are considered as the normal cases whereas all the association scores which are greater than the average but less than the maximum threshold, are considered as the "need to be investigated". The minimum threshold value and maximum threshold value is set up to identify the outliers. The minimum threshold indicates that anything that has happened just once is an anomaly. It means that if any patient, visits a provider only once that can be an anomaly (or any doctor prescribing any service just once to only one patient). Thus, we have kept the minimum threshold as 0.011. Similarly, we have chosen the maximum threshold by considering the fact that if a patient is visiting the same doctor and out of a total of his 100 visits, he visits the same doctor more than 70 times, there could be an anomaly. That is why, we have kept the association scores greater than 0.7, as the maximum threshold. All those association scores which are less than the minimum threshold and greater than the maximum threshold, are identified as the outliers. The flowchart of this phase is shown in Figure 3. Patient association scores are denoted as F, doctor's association scores are denoted as Y and association scores of services with respect to doctors or patients are denoted as S p and S d respectively. We set threshold for all association scores as discussed above. Figure 3 explains the flow of the first phase of proposed framework. A hash algorithm is applied for the de-identification of patient records. The variables Y, S p , S d and F holds association scores of patient with respect to doctors, service with respect to doctor, service with respect to patient and doctor with respect to patient. The threshold is computed separately for each type of association scores. The variable Z is representing function or container which holds values for all four types of association scores after computation of threshold values. We apply four checks on Y, S p , S d and F separately. Based on these check 'outlier' and 'need to be investigated' cases are identified. The Rating score is initially set as 100 for each element of the framework and after first phase rating score is updated based on the occurrence of identified cases. Each time identified case is found, rating of that particular element is decremented as shown in Figure 3. The cases of the "need to be investigated" and "outliers" are analyzed in the second phase.

Rule Engine Generation
The second phase of the proposed framework generates rules for each specialty of the local hospital. It is already mentioned that the proposed framework is validated on an original data of local hospital. There are 62 specialties in this hospital. Following are the two main tasks which are executed under this phase :

•
We perform hashing on the patients data by assigning separate identification numbers to every service, every doctor and specialty.

•
Clustering the transactional data and generated association rules.
During cluster analysis we found outliers within different clusters. For this purpose, after applying the clustering to transactional data we applied concepts of support and confidence to these generated clusters. We applied three different clustering algorithms on this transactional data: Gmeans, Xmeans and Fuzzy Cmeans. G-Means clustering algorithm, is an extension of KMeans. The G-means algorithm is density based clustering; it tries to find a subset of data that fits a Gaussian distribution. G-means executes k-means, increments value of variable k hierarchically until the data assigned to each centroid are Gaussian. It is identified by research that Gmeans is improved form of clustering which has provided an intrusion detection with the high Detection and the low False Positive Rate. This technique can approximate number of the clusters in the considered data and initialize the centroids which results in fast convergence of algorithm [38]. The X-means [39] executes K-means multiple times and during each run, it takes local decisions whether to create a subset of current centroid or not and this splitting decision is taken by the computation of the Bayesian Information Criterion (BIC) [40]. We have compared the generated clusters of all three algorithms in Table 4. We took one cluster and computed Mean of that cluster. Centriods are generated by Fuzzy C-means, G-means and X-means. Pick the centriod generated by each algorithm, which is closest to the computed Mean. Actual center is the computed mean of selected cluster. Computed center is the centriod computed by the algorithms. Difference is the subtraction of actual center from algorithm computed centriod. Based on our analysis, it is found that the G-means clustering is more efficient as compared to the other two clustering techniques for this transactional data.

Rule Engine Algorithm
Following steps are used to generate rule engine 3.4.1.
Step 1 Perform de-identification of patient records.
• Each patient assigned patient n unique number • Each doctor/specialization assigned doctor n unique identifier • Each service assigned service n unique identifier 3.4.2.
Step 2 Grouping of patient records based on the specaility_id from where they availed service. Guassian based clustering is used for the identification of clusters as shown in Figure 4.
where cluster n is the total number of elements in clusters. Transaction c n is representing transactions of the patients P k who are identified as two separate cases namely "Need to be Investigated" and "outliers". All transactions which are identified with these labels are transferred for further analysis, to the rule engine. We computed confidence value for each service within clusters. We apply threshold on confidence values for all members within clusters and all members whose confidence values are on boundaries are identified as anomaly. The flowchart for second phase is shown in Figure 4 which shows how clusters are processed to generate rules. We find support count of each specialty D j in all clusters and then find support count of each service S h for this specialty D j . Finally, these support counts are used for computing confidence values. The last condition is for checking whether confidence values are computed for all specialties or not. Based on the computed confidence values, rules are generated which are stored in database for the third phase. Figure 5 describes the complete fraud detection system. In Figure 5, there are three main elements, and each element is receiving transactional data from different hospital servers. Each element (Patient, provider, service) has its own storage. Association scores are computed between each pair namely service with respect to doctor, service with respect to patient, patient with respect to doctor, and doctor with respect to patient. Once the association scores are computed and thresholds are applied, we get set of identified cases. Transactions are identified in two cases "outliers" or "need to be investigated". The rating of each element (Patient, provider, service) whose transactions are found to be suspicious will be decremented. These cases are used as an input to the rule engine. The Rule engine further analyzes the transactions and if these cases are detected as fraud then rating score of involved element, will remain same otherwise rating score will be updated. Basically set of rules are generated for each specialty_id. Whenever any patient visits the hospital for availing the particular set of services, system first checks which specialty_id patient visits, and then evaluates according to the rules already computed for each specialty_id. The third phase of the proposed framework is shown in Figure 6, Similarity function is used for computation of similarity between current transaction c and generated rule R. Similarity bit is denoted by a and Similarity Function denoted by H Similarity bit is a equal to 1, if after the similarity computation the size of the input transaction c is equal to size of similarity function H, and if after similarity computation the size of the input transaction c is not equal to size of similarity function H then similarity bit will be equal to zero. If the similarity bit is not 1, then transaction will be marked as a fraud. Otherwise it will be marked as normal.

Case Study
The five years (2013,2014,2015,2016,2017,2018, 2019) annotated insurance claim transactional data of employees of a local hospital is considered for this analysis. The addressed problem is the constant increase in employees insurance coverage expenditures in each year as depicted in Figure 7 and it can be easily predicted as exponential increment in coming years due to increase in healthcare frauds. Fraud detection model is applied to analyze this dataset and only few results are shown to add better understanding of the work and therefore only subsets of results are shown in the figures.

First Phase
In the first phase, the association scores are computed between each pair of elements. Few of the cases are shown in this section to explain how association scores are actually computed. In this phase we identified two separate cases: Need to be investigated Association score among service Optical Coherence Tomography OCT scan and patients are shown in Figure 8. Total 21 patients avail this service of OCT scan and an average of all association scores is 0.052. We set this average value as a threshold. It can be seen from the Figure 8 that two patients are identified as "need to be investigated", and rating of this service is decremented to 98 from 100. Total score of rating is 100. Similarly, association score for all services and patients are computed in the same manner and the rating score is also adjusted accordingly.    Figure 10 shows association scores of services with respect to doctors. Service Routine Electroencephalogram "EEG" prescribed by 50 different doctors and six cases are identified as 'need to be investigated". The threshold value is 0.0476. The Rating score of this service is 94, which is decreased by 6. Complete output is shown in Appendix A.1.

Second Phase
All those records which are identified as "outlier" or "need to be investigated" are forwarded to the Rule Engine for a further investigation. Total 62 association rules are generated from this data set and separate rule is generated for each specialization and specialty_id is used to represent identifier for each specialization. Rule engine basically generate rules that describe which specialization can provide which specific service. We generated rule by computing confidence values for each service in particular specialization. By using this knowledge, we can evaluate each transaction whether it is normal or fraud. This can be done by applying the Similarity function. We have selected specialty Urology with specialty_id: 620 and we can get all services which are provided by this specialty_id as shown in Figure 12. It can be seen that there is confidence value of each service for each specialty_id. The relationship between service and specialty is depicted in Figure 13, in this plot confidence values of all service_ids for the specialty_ids are depicted. The value of confidence has provided us with an estimation, that what is the probability of prescription of considered service in this specialty_id. Based on this estimation, resources can be also allocated and budget can also be planned. Table 5, is depicting confidence values of few services for different specialties. Rule only contain service_ids whose confidence values are above 0.001 or user can define the threshold depending upon their scenarios. The rule generated by the rule engine can be explained with the help of example. Consider specialty name Pediatric Cardiology. Table 6 is showing services availed from this specialty and confidence values of these services are also provided. Table 7 is depicting rule for this specialty. If in any transaction Abdomen upper service is availed. The similarity function first check whether this service is present in the rule of Pediatric Cardiology as shown in Table 6. This case is identified as fraud, and if in any transaction service is availed whose confidence value is less than 0.001, from the considered specialty it will also be identified as fraud and passed to analyst dashboard for further investigation. The rules are generated from the medical historical data.

Third Phase
The following example explains how the similarity bit is computed using already generated rule for specialty_id: 620. In Current transaction c, patient is availing three services from specialty_id: 620, it can be seen from Table 5, that there is no service with service_id: 2. Computation of similarity function and similarity bit value generation are shown below. The value of Similarity bit is 0 this means this transaction is a fraud case. c = Transaction if Size(c) and Size(H) are equal then similarity bit will be a = 0. If the similarity bit is equal to 1 only then the current transaction is normal. This engine is generated from five years annotated transactional data, so based on the association scores we have identified cases and evaluated them against the rules, from which we have found already tagged fraud cases. After this analysis we have reached to final status and got rating of doctors, patients and services separately.

Detected Frauds
After the third phase fraud cases have been detected. Now, we are able to check status and rating of each element. As it is already mentioned that due to the large size of data only a subset of records are shown in screenshots to depict our system performance. One of the main point that must be clarified at this level is that we have considered employee insurance claim data, so detected cases are less in number because patients are either employees or their beneficiaries. We discussed detected fraud cases using the screenshots. Figure 14 depicts the final status and ratings of doctors. It can be seen that three doctors have been identified by our system as a fraud. It can be seen that the rating score of these doctors are also adjusted finally. Doctor_id 2301, 551 and 31 are identifed as the fraud cases. Initial status and rating of the identified doctors, which are generated in the first phase of the proposed system, are depicted in Figures 15-17. Figure 15 shows association score of doctor_id: 2301. Initial rating of this doctor is 95, as five identified cases of this doctor are forwarded to the Rule engine (second phase). Complete output of Figure 15 is provided in Appendix A.2. Third phase has identified doctor_id: 2301 as a fraud and updated rating score of this doctor is 99 as shown in Figure 14. Figure 16 shows initial rating of doctor_id: 31 which is −9 (negative). More than a 100 identified cases of this doctor are forwarded to the rule engine (second phase). The doctor_id: 31 is also identified as fraud and final rating is 99 as shown in Figure 14. Figure 17 shows that first phase's initial rating of doctor_id: 551 is 7 and 93 cases of this doctor are forwarded to second phase for analysis. In the third phase this doctor is identified as fraud and updated rating score is 98. Figure 18 shows that eight cases of frauds are identified in service availing patterns. As it is the subset of complete output. We can also check initial ratings and association scores of each of these Patients in first phase as we have checked for the doctors. When a service element is considered, service_id 221 and service_id 250 are detected as fraud. It is shown in Figures 19 and 20.
So we have analysed this transactional data in terms of the three element of proposed framework and detected different number of already tagged fraud cases. The rule engine has been designed on the basis of five years transactional data, each specialty_id (specialization like cardiology, urology etc.) has a set of services with confidence levels that define rules for it.

Conclusions
Many countries have recently initiated government medical support programs and in such programs there is no tolerance for any fraudulent claims. There is a critical need of system for capturing and identifying fraud cases in day to day transactions in healthcare industry. Lots of research studies have been conducted in last decade but most of them are based on financial analysis and disease/medication analysis. We have proposed framework by considering patients, doctors (providers) and services as main elements. We computed relationships between these elements by calculating association scores. By learning from the historical transactional data, we have generated Rule engine. Firstly, dataset is filtered out based on elements association scores and then forwarded identified cases to the Rule engine for further analysis. The fraud cases are finally identified and the ratings of all three elements are updated after an evaluation from the rule engine. We have validated this framework for detecting fraudulent transactions from annotated local hospital transactional data and successfully identified eight fraud cases along patient element, two cases along service element and three cases along doctor element of proposed System. We communicated our findings to the hospital management.
In future, the proposed methodology can be further improved by extracting sequences of services availed from each specialty using some data mining techniques. Upon finding a set of sequences for every specialty, fraud detection will be more effective.     Complete output of Figure 15, is depicted in Figures A5 and A6.