A Decision Support System with Artiﬁcial Intelligence and Natural Language Processing to Mitigate the Deduction Rate of Health Insurance Claims

: Globally, 20% to 40% of medical resources are wasted, which could be avoided through professional audit of health insurance claims. The professional audit can pinpoint excessive use of unnecessary medicines and medical examinations. Taiwan’s National Health Insurance Bureau (TNHIB) deducts the weight that medical resources carry if regarded as unnecessary or abused when examining health insurance claims. The ratio of the deducted weight to the total weight claimed by a hospital is deﬁned as the health insurance claim deduction rate (HICDR). A high HICDR increases the operating expenses of the hospital. In addition, it takes the hospital many resources to prepare and ﬁle appeals for the deduction. This study aims to: (1) minimize the weight deducted by the TNHIB for a hospital; and (2) facilitate efﬁcient appeals to claim denials. It is expected that HICDR will be reduced through big data analytics. In this study, evidence-based medicine (EBM) is involved to clarify the debate, dilemmas, conﬂicts of interests in examining health insurance claims. A natural language method—latent Dirichlet allocation (LDA), was used to analyze patients’ medical records. The topics derived from the LDA are used as factors in the logistic regression model to estimate the probability of each claim to be deducted. The experimental results on various medical departments show that the proposed predictive model can produce accurate results, and lead to more than 41.7% reduction to the deduction of the health insurance claims. It is equivalent to more than a 750 thousand NT dollars saving per year. The efﬁciency of application is validated compared to the manual process that is time-consuming and labor intensive. Moreover, it is expected that this study will supplement the insufﬁciency of traditional methods and propose a new and effective solution to reduce the deduction rate.


Introduction
The World Health Organization (WHO) pointed out in its annual report that high percentage of medical resources are wasted globally [1].For example, in 2016, the United States spends more on health care than any other country, with costs approaching 18% of the gross domestic product (GDP) as well as approximately 30% of health care spending may be considered waste during 2012-2019 [2].The amount of medical waste in health care remains an important issue in medical healthcare finance.Healthcare professionals perceive substance abuse patients as individuals suffering from self-inflicted illness that are unworthy of medical treatment, a burden to the medical system, and a waste of medical expense [3].
In Taiwan, a payment system for associated groups in medical diagnosis has been implemented in Taiwan since 2010 to use medical resources effectively.In most cases, after

•
Recognizing key words, phrases, and sentences which were written without specific formats, and understanding descriptive topics in patient records;

•
The development of an effective mechanism and strategy from an economic perspective that will yield high success rates with minimum human works;

•
The key factors that influence the weight deduction, including examination items, types of prescribed drugs, and patients' symptoms.
In addition, the challenges are with soundness of diverse group-specific focuses that may lead to debate, dilemmas and conflicts of interests.It is necessary to clarify the debate, dilemmas, conflicts of interests in examining the claims.Evidence-based medicine (EBM) is the focus in this study, to resolve the deficiency of the diversity, difference, and political bias.EBM could lead to various research methods, that could not only integrate collaborative and participatory approaches, but also define problems and consensus-based solutions.
To analyze patient records, text mining is used in this study to analyze the cases and determine the rules about how the weight deduction is obtained and to establish a database of the words used in medical records.In this way, the doctors can quickly and easily follow the rules to solve weight deduction problems.The objectives of this study to help hospitals minimize the deducted weight by establishing a database of successful and failed cases through big data analytics technologies, which could help avoid wrong or unsuitable weight claims and provide supporting information for an appeal to a medical claim denial.
This study focuses on three morbid entities with highest HICDR: E11 (type 2 diabetes mellitus), N18 (chronic kidney disease), K21 (gastro-esophageal reflux disease) and excludes the hospitalized records.Section 3 surveys the literature related to the claim examination and approaches to reduce the rates of health insurance claims.The proposed solution approach is presented in Section 3. The case of Pu-Chi Hospital is studied in Section 4. Section 5 concludes this study.The contribution of this study to the hospitals is to provide a solution to reduce the deduction rate of the health insurance claims and avoid medical resource waste.

Literature Review
There are some major studies of claim examination and analysis, for example, Sacks et al. [4] think that rising health insurance premiums represent a rapidly increasing burden on employer-sponsors of health insurance and their employees.A Web-Based Nutrition Program is proposed to reduce health care costs in employees with cardiac risk factors.The America's Institute of Medicine (IOM) reported that prescribing wrong medication is a big problem, and the effects can sometimes be fatal.To address this problem, Miller, Mansingh, and OptiPres designed and implemented a distributed intelligent mobile agent-based system called "OptiPres" [5].It is to assist doctors in making more informed decisions by either choosing the optimal solution from processing a repository of past decisions, or by presenting a set of possible drugs and using a specific criterion to identify the optimal prescription.Huang et al. succeeded performed the analysis of a probabilistic model for reducing medication errors with various thresholds to reduce inappropriate prescriptions [6].Sung et al. analyze bibliometric and text mining on PubMed using Taiwan's National Health Insurance claims data [7] and Chen et al. applied data mining technology to build a prediction model for this purpose and compared the performance of the model with logistic regression to the model with a neural network.The sensitivity analysis showed that an overall check-up on administrative audit can greatly reduce the load of professional audit.Henceforth, the load of auditing is reduced [8].Maass et al. validate that the timely access to up-to-date diagnostic information reduced redundant clinical re-appointments, repeated tests, and mail orders for missing data [9].
Chien et al. [10] developed a model that informs doctors of the most suitable items with high unexpected costs and the medical expenses that may exceed the payment quota when doctors write on medical records.This model improves the writing of medical records, provides supplemental information for the main diagnosis and possible complications.It even includes detailed reasons for the high costs of a service or item, thus avoiding cost deduction in health insurance sample review.Chen discussed the influence of health insurance weight claim deduction on outpatient medical practices and interviewed those who worked in a regional teaching hospital in Taiwan through qualitative research [11].Huang investigated the cases of approved insurance coverage for emergency treatments and analyzed the denied cases through a mixed method that combined a quantitative method of using the modified Delphi questionnaires and a qualitative method with using in-depth interviews to increase the success rate of the health insurance coverage claims [12].Yi-Ting Cheng used medical records to produce retrospective professional judgements, and interviewed doctors for their opinions on medical records with high outpatient rates and low deduction rate.It provides reference for increasing quality index of outpatient medical records and reducing medical cost deduction [13].
Most of the above-mentioned research was conducted via expert interview and questionnaire survey.Only a few used big data systems and text mining to determine the topics and words related to the deducted items.Moreover, doctors do not maintain medical records in a uniform format.The audit committee will judge whether a medical treatment is reasonable according to the correlation between patients' medical records and prescribed drugs.In this perspective, research in recognizing and identifying important words and topics in medical records is in a great need.The technology of text mining can be used to pinpoint the important words and phrases, which can then be used as logistic regression model variables to investigate whether significant correlation exists in these variables and other medical variables [14][15][16].The text content analysis model and logistic regression model are proposed in Section 3.
In addition, the emergence of evidence-based medicine in the early 1990s, which is defined as: 'the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients, based on skills which allow the doctor to evaluate both personal experience and external evidence in a systematic and objective manner' [17].Therefore, this study develops an evidence-based tool that can crawl/collect topics based on the following five core inspirations of evidence-based medicine: question formulation, evidence search, critical appraisal, evidence application, and outcome evaluation [17,18] as references to decision makers.

Methodology
The research procedure of this study is shown in Figure 1.After data preprocessing, evidence of documents is crawled and used for supporting the topic recognition.Structural topic modeling was used to dig latent topic from medical text data.The results of the topic model can be used to train and test the professional audit prediction model.Lastly, the factors influencing the weight deduction were determined based on the test results.

Methodology
The research procedure of this study is shown in Figure 1.After data preprocessing, evidence of documents is crawled and used for supporting the topic recognition.Structural topic modeling was used to dig latent topic from medical text data.The results of the topic model can be used to train and test the professional audit prediction model.Lastly, the factors influencing the weight deduction were determined based on the test results.Only accurate data lead to good analysis results; that is why original data needs to be preprocessed to ensure its accuracy and conformity with the research needs before research is conducted.In this study, data preprocessing consisted of four parts: 1. Data format conversion: Exported from different systems, raw data were need to be converted into a uniform format for subsequent data cleaning, integration and analysis.The common data storage formats include xml, csv, json, kml, xls, pdf, ods, txt, zip, and xls.In this study, the health care expenditure declaration data are stored in the format xml.Research team transfer all data into xlsx format for convenience of subsequent data integration and analysis.2. Data integration: In order to ensure the integrity and accuracy of the aggregated data, a unique ID across all different data source should be selected to perform data integration.After data integration, the data should have following properties: • Uniqueness: a serial number for a primary key; • Integrity: there is no missing value in a column or row; • Validity: for example, the value of months in a spreadsheet should be 1 to 12.
3. Building a thesaurus database: The doctors' orders served as original data in this study.However, many abbreviations and wrongly written words due to doctors' different habits were found.This made it necessary to compile the key words into a thesaurus by way of manual review, which was later submitted to the doctors to confirm its accuracy.4. Text cleaning: The text cleaning is an important step in this study, which can facilitate subsequent analysis of LDA.The Step 1 of text cleaning involves changing all the letters in the text to lowercase letters; Step 2 entails importing the reviewed correct key words to replace the synonyms; and Step 3 consists of deleting all special characters and numbers in the text and combine specific words.

Evidence Based Medicine
The research framework of evidence-based medicine emphasizes the knowledge, instead of position or privilege.This study will apply this research framework in order to explore the relationship between factors and have an effective evaluation, and to sublime the target problem into the cognition and interpretation of the nature of evidence by the perspective of evidence-based medicine.This is for the sense of responsibility in the professional cognition of the research object.
The strengths of evidence-based medicine are the opportunity that affords to directly compare the effectiveness of different interventions, and this development of the system could express the evidence to different research objects based on the background conditions, relationships, and interactions, also called the process of art in [19].In original Only accurate data lead to good analysis results; that is why original data needs to be preprocessed to ensure its accuracy and conformity with the research needs before research is conducted.In this study, data preprocessing consisted of four parts: 1.
Data format conversion: Exported from different systems, raw data were need to be converted into a uniform format for subsequent data cleaning, integration and analysis.The common data storage formats include xml, csv, json, kml, xls, pdf, ods, txt, zip, and xls.In this study, the health care expenditure declaration data are stored in the format xml.Research team transfer all data into xlsx format for convenience of subsequent data integration and analysis.2.
Data integration: In order to ensure the integrity and accuracy of the aggregated data, a unique ID across all different data source should be selected to perform data integration.After data integration, the data should have following properties: • Uniqueness: a serial number for a primary key; • Integrity: there is no missing value in a column or row; • Validity: for example, the value of months in a spreadsheet should be 1 to 12.

3.
Building a thesaurus database: The doctors' orders served as original data in this study.However, many abbreviations and wrongly written words due to doctors' different habits were found.This made it necessary to compile the key words into a thesaurus by way of manual review, which was later submitted to the doctors to confirm its accuracy.

4.
Text cleaning: The text cleaning is an important step in this study, which can facilitate subsequent analysis of LDA.The Step 1 of text cleaning involves changing all the letters in the text to lowercase letters; Step 2 entails importing the reviewed correct key words to replace the synonyms; and Step 3 consists of deleting all special characters and numbers in the text and combine specific words.

Evidence Based Medicine
The research framework of evidence-based medicine emphasizes the knowledge, instead of position or privilege.This study will apply this research framework in order to explore the relationship between factors and have an effective evaluation, and to sublime the target problem into the cognition and interpretation of the nature of evidence by the perspective of evidence-based medicine.This is for the sense of responsibility in the professional cognition of the research object.
The strengths of evidence-based medicine are the opportunity that affords to directly compare the effectiveness of different interventions, and this development of the system could express the evidence to different research objects based on the background conditions, relationships, and interactions, also called the process of art in [19].In original evidencebased medicine (EBM), the clinicians need to develop skills to evaluate research (critical appraisal skills) and keep up to date with research findings [20].Five core actions, such as question formulation, evidence search, critical appraisal, evidence application, and outcome evaluation [17,18], are regard as the five explicit steps of evidence-based practice.Based on the five steps of EBM [18,20], this study proposes five major/practical steps to develop an evidence-based approach to support claim examination (Figure 2).Specifically, in the task of searching evidence, such as the definition, description, type, and measurement of indicators, this study is based on the concept of "the 5S evolution of information services" [21].The 5S (studies, syntheses, synopses, summaries, systems) pyramid model is used to search the best empirical literature evidence.
evidence-based medicine (EBM), the clinicians need to develop skills to evaluate research (critical appraisal skills) and keep up to date with research findings [20].Five core actions, such as question formulation, evidence search, critical appraisal, evidence application, and outcome evaluation [17,18], are regard as the five explicit steps of evidence-based practice.Based on the five steps of EBM [18,20], this study proposes five major/practical steps to develop an evidence-based approach to support claim examination (Figure 2).Specifically, in the task of searching evidence, such as the definition, description, type, and measurement of indicators, this study is based on the concept of "the 5S evolution of information services" [21].The 5S (studies, syntheses, synopses, summaries, systems) pyramid model is used to search the best empirical literature evidence.Step 1: Collect/crawl topics.
Step 2: Searching the best empirical literature evidence.According to the hierarchy concept of the 5S (studies, syntheses, synopses, summaries, systems) pyramid model [21], the process of searching the best empirical literature evidence is applied.The 5S in this study refer to the following five stages (Figure 3): Summaries: obtain empirical conclusions problem/topic; • Storage sub-system: develop a data lake [22] to store the evidence.
Step 3: Analyze the literature to develop the topic base; Step 4: Professional review; Step 5: Modeling prediction model based on the evidence.Step 1: Collect/crawl topics.
Step 2: Searching the best empirical literature evidence.
According to the hierarchy concept of the 5S (studies, syntheses, synopses, summaries, systems) pyramid model [21], the process of searching the best empirical literature evidence is applied.The 5S in this study refer to the following five stages (Figure 3): evidence-based medicine (EBM), the clinicians need to develop skills to evaluate research (critical appraisal skills) and keep up to date with research findings [20].Five core actions, such as question formulation, evidence search, critical appraisal, evidence application, and outcome evaluation [17,18], are regard as the five explicit steps of evidence-based practice.Based on the five steps of EBM [18,20], this study proposes five major/practical steps to develop an evidence-based approach to support claim examination (Figure 2).Specifically, in the task of searching evidence, such as the definition, description, type, and measurement of indicators, this study is based on the concept of "the 5S evolution of information services" [21].The 5S (studies, syntheses, synopses, summaries, systems) pyramid model is used to search the best empirical literature evidence.Step 1: Collect/crawl topics.
Step 2: Searching the best empirical literature evidence.According to the hierarchy concept of the 5S (studies, syntheses, synopses, summaries, systems) pyramid model [21], the process of searching the best empirical literature evidence is applied.The 5S in this study refer to the following five stages (Figure 3): Summaries: obtain empirical conclusions problem/topic; • Storage sub-system: develop a data lake [22] to store the evidence.
Step 3: Analyze the literature to develop the topic base; Step 4: Professional review; Step 5: Modeling prediction model based on the evidence.Summaries: obtain empirical conclusions problem/topic; • Storage sub-system: develop a data lake [22] to store the evidence.
Step 3: Analyze the literature to develop the topic base; Step 4: Professional review; Step 5: Modeling prediction model based on the evidence.

Topic Model
Topic model is an unsupervised generation model, which is widely used in word frequency analysis and text classification.The latent Dirichlet allocation (LDA) proposed by Blei et al. [23] is one of the most classic topic models [24].This study considered the background information of each document in text as important for research analysis.Incorporating the background information into the model as covariates can effectively improve the performance of the model and better understand the model results.The LDA allows researchers to analyze the relationship between a topic and the background information of a text.Therefore, the LDA was employed as a probabilistic topic model for this research to better explore the latent topics in the text data.
Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus.The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.
LDA assumes the following generative process for each document w in a corpus D: 1.
For each of the N words w_n: i.
Several simplifying assumptions are made in this basic model, some of which we remove in subsequent sections.First, the dimensionality k of the Dirichlet distribution (and thus the dimensionality of the topic variable z) is assumed known and fixed.Second, the word probabilities are parameterized by a k × V matrix β where β ij = p(w j = 1 z i = 1) , which for now we treat as a fixed quantity that is to be estimated.Finally, the Poisson assumption is not critical to anything that follows, and more realistic document length distributions can be used as needed.Furthermore, note that N is independent of all the other data generating variables (θ and z).It is thus an ancillary variable, and we will generally ignore its randomness in the subsequent development.
A k-dimensional Dirichlet random variable θ can take values in the (k − 1)-simplex (a k-vector θ lies in the (k − 1)-simplex if θ i ≥ 0, ∑k i = 1 θ i = 1), and has the following probability density on this simplex: where the parameter α is a k-vector with components α i > 0, and where Γ(x) is the gamma function.The Dirichlet is a convenient distribution on the simplex-it is in the exponential family, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution.In Section 5, these properties will facilitate the development of inference and parameter estimation algorithms for LDA.Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by

Regression Analysis
Logistic regression (LG) model is an efficient and commonly used classification model for binary outcomes, first proposed by Cox [25].Logistic regression is widely used, e.g., in analyzing the effect of patient-physician relationship [26], the effect of web-based exercise as an effective complementary treatment for patients [27], etc. Fuurthermore, a regression model typically can serve very well in predicting a dependent variable, when a good number of independent variables are avaiable.In this research, logistic regression models are developed to predict the probability of a certain health insurance claim becoming denied.The independent variables used to perform the logistic regression analysis include the topics derived from the LDA model and the medical information, such as treatments, medicines, and prices.A list of binary variables is defined based on the results of the LDA model.For each binary variable, the value one represents that a case with a certain topic.To determine if a topic is covered in the document, the number of high-frequency words of the topic utilized in the document is measured.The common thresholds are three or four, which varied across different departments.
Like LDA, the logistic regression model is built for each department.Due to the large number of predictive variables, a variable selection algorithm is used to enhance the regression model.Zou and Hastie proposed a regularization and variable selection method for the general linear regression called elastic net [28].This algorithm can be used to solve several common issues such as overfitting and collinearity.
To realize the regression model, this research builds a generalized linear model with the elastic-net penalty by using R package glmnet [29].The function of glmnet is to solve the following problem: in this study, where: y i is the outcome of the case i, x i is the vector of predictor variables for case i, such as topic, disease and drug, and l is the negative log-likelihood function.
The hyper-parameter λ in Equation ( 1) controls the overall strength of the penalty.This research performed a cross validation to find the best λ which gives a minimum mean misclassification error.The graph in Figure 4 with the department of family medicine as example shows the binomial deviance which varies as the λ changes.
sis include the topics derived from the LDA model and the medical information, such as treatments, medicines, and prices.A list of binary variables is defined based on the results of the LDA model.For each binary variable, the value one represents that a case with a certain topic.To determine if a topic is covered in the document, the number of highfrequency words of the topic utilized in the document is measured.The common thresholds are three or four, which varied across different departments.
Like LDA, the logistic regression model is built for each department.Due to the large number of predictive variables, a variable selection algorithm is used to enhance the regression model.Zou and Hastie proposed a regularization and variable selection method for the general linear regression called elastic net [28].This algorithm can be used to solve several common issues such as overfitting and collinearity.
To realize the regression model, this research builds a generalized linear model with the elastic-net penalty by using R package glmnet [29].The function of glmnet is to solve the following problem: in this study, where: is the outcome of the case , is the vector of predictor variables for case , such as topic, disease and drug, and is the negative log-likelihood function.
The hyper-parameter in Equation ( 1) controls the overall strength of the penalty.This research performed a cross validation to find the best which gives a minimum mean misclassification error.The graph in Figure 4 with the department of family medicine as example shows the binomial deviance which varies as the changes.The elastic-net penalty controlled by α in the equation bridges the gap between lasso (α = 1) and ridge (α = 0).Based on the knowledge that ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others, this research tried many different values to find the best α that can minimize the overfitting problem.
The predictive model was built on training dataset and validated by testing dataset.The performance of the model on the testing dataset including precision, recall and area under curve (AUC)."Precision" represents the proportion of actual denied GC under the predicted denials, which was calculated as precision = TP/(TP + FP), where TP is the true positive count, and FP is the false positive count.Furthermore, "recall" represents the The elastic-net penalty controlled by α in the equation bridges the gap between lasso (α = 1) and ridge (α = 0).Based on the knowledge that ridge penalty shrinks the coefficients of correlated predictors towards each other while the lasso tends to pick one of them and discard the others, this research tried many different values to find the best α that can minimize the overfitting problem.
The predictive model was built on training dataset and validated by testing dataset.The performance of the model on the testing dataset including precision, recall and area under curve (AUC)."Precision" represents the proportion of actual denied GC under the predicted denials, which was calculated as precision = TP/(TP + FP), where TP is the true positive count, and FP is the false positive count.Furthermore, "recall" represents the proportion of predicted denied GC under the actual denials, which was calculated as recall = TP/(TP + FN), where FN is the false negative count.A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied and the sensitivity and specificity are used as the coordinates to construct the curve.The area under curve (AUC) is a statistical index that indicates the classification power of the predictive model, larger AUC, the better classification.Besides ROC and its AUC, the F β score is the harmonic mean of "precision" and "recall", which was calculated as F β = (β 2 +1)PR.β 2 P+R , where β is the weight of Recall that indicates recall is β times as important as precision [30].F β -score is not only used to measure the model performance with the given classification threshold, but also can be used to determine the optimal threshold for decision making.Once the weight β is determined, the optimal threshold can be set to the level where yield the largest F β -score.

Data
In this study, the data of 2016-2019 provided by the Pu-Ch Hospital are analyzed.The three morbid entities with the most Health Insurance Claim Deduction Rates: N18 (chronic kidney disease), K21 (gastro-esophageal reflux disease), E11 (type 2 diabetes mellitus) are focused.
Firstly, the medical records are preprocessed to eliminate duplicated data, ignore symbols and numbers, and transfer uppercases to lowercases for the topic model, for example of morbid entity E11 (Table 1).Secondly, we crawled large of documents related to each morbid entity to capture keywords.Based on the frequency of these keywords, the weight of for each document is computed by the LDA model presented next.The highest weighted documents are used in this study.

Topic Model
The topic model, LDA is applied to the documents crawled in Section 4.2 to produce exclusive topics.After numerous try and error, topic number is determined as six based on topic number and perplexity, for example, E11 in Figure 5.
The topic model is developed by LDA and the ggplot2 package of R is used.The most 20 frequent words of six topics are show, for example of E11 in Figures 6-8.The topic model is developed by LDA and the ggplot2 package of R is used.The most 20 frequent words of six topics are show, for example of E11 in Figures 6-8.The topic model is developed by LDA and the ggplot2 package of R is used.The most 20 frequent words of six topics are show, for example of E11 in Figures 6-8.The topic model is developed by LDA and the ggplot2 package of R is used.The most 20 frequent words of six topics are show, for example of E11 in Figures 6-8.The topic and words by LDA are output to csv files.The files could be used in the prediction model next, for example of E 111 in Figure 9.The distribution of topics in the deducted cases, as well as in the approved cases, are provided by the LDA model, for example, E11 in Figure 10.This aims to understand the relationship among topics in deducted cases and approved cases.The topic and words by LDA are output to csv files.The files could be used i prediction model next, for example of E 111 in Figure 9.The distribution of topics in the deducted cases, as well as in the approved case provided by the LDA model, for example, E11 in Figure 10.This aims to understan relationship among topics in deducted cases and approved cases.The distribution of topics in the deducted cases, as well as in the approved cases, are provided by the LDA model, for example, E11 in Figure 10.This aims to understand the relationship among topics in deducted cases and approved cases.

Prediction Model
The medical records are classified as 70% for training date set and 30% for testing data set.For example, of E11-type 2 diabetes mellitus, 3154 records are classified as training data and 1351 records as testing data.This study aims to predict the medical records which are judged as deducted since the medical service is regarded as "wasted and unnecessary" by the audit committee.However, the deducted medical records are with a low rate of 5.22% of all records in Pu-Chi Hospital.A small number of deducted records vs.. large number of approved records result in low accuracy in data training since deducted records are with a low probability.The synthetic minority oversampling technique (SMOTE) is therefore used to increase the percentage of deducted cases [31].In the case of E11-type 2 diabetes mellitus, the AUC is enhanced to 0.782 from 0.677 after applying the SMOTE procedure.
Table 2 shows the five factors that may led to weight deduction obtained from the results of regression analysis.( 1) Object (data of doctors' judgement and examinations), (2) D19 (primary diagnostic code), (3) D20~d23 (secondary diagnostic code), (4) p4, (codes of drugs and examinations) and ( 5) price (total weight).For text-based information such as "object", research team transfer it into binary variable and put into the regression model for prediction.Moreover, D19 (primary diagnostic code) and D20-d23 (secondary diagnostic code) were used to observe whether a disease that has a great possibility of being deducted (far higher than the deduction rate of outpatient cases at Puli Christian Hospital) is relevant to weight deduction.The price (total weight) was utilized to observe whether the denial of a medical claim is related to the amount involved in the medical claim.Lastly, p4 (codes of drugs and examinations) was employed to see whether a drug has a high denial rate of medical claims.

Prediction Model
The medical records are classified as 70% for training date set and 30% for testing data set.For example, of E11-type 2 diabetes mellitus, 3154 records are classified as training data and 1351 records as testing data.This study aims to predict the medical records which are judged as deducted since the medical service is regarded as "wasted and unnecessary" by the audit committee.However, the deducted medical records are with a low rate of 5.22% of all records in Pu-Chi Hospital.A small number of deducted records vs.. large number of approved records result in low accuracy in data training since deducted records are with a low probability.The synthetic minority oversampling technique (SMOTE) is therefore used to increase the percentage of deducted cases [31].In the case of E11-type 2 diabetes mellitus, the AUC is enhanced to 0.782 from 0.677 after applying the SMOTE procedure.
Table 2 shows the five factors that may led to weight deduction obtained from the results of regression analysis.( 1) Object (data of doctors' judgement and examinations), (2) D19 (primary diagnostic code), (3) D20~d23 (secondary diagnostic code), (4) p4, (codes of drugs and examinations) and ( 5) price (total weight).For text-based information such as "object", research team transfer it into binary variable and put into the regression model for prediction.Moreover, D19 (primary diagnostic code) and D20-d23 (secondary diagnostic code) were used to observe whether a disease that has a great possibility of being deducted (far higher than the deduction rate of outpatient cases at Puli Christian Hospital) is relevant to weight deduction.The price (total weight) was utilized to observe whether the denial of a medical claim is related to the amount involved in the medical claim.Lastly, p4 (codes of drugs and examinations) was employed to see whether a drug has a high denial rate of medical claims.
After the SMOTE was used and the prediction model was trained, the 30% data set was tested.The regression coefficients were obtained, where positive/negative refers to deducted or not, respectively, for example E11 in Figure 11.
The factors that led to medical claim denial were selected for logistic regression modeling.The model training results are shown: the area under the curve (AUC) of the model is large enough for both training set and testing set.This indicates that the discrimination power of the ROC curve was enough to be adopted.After the SMOTE was used and the prediction model was trained, the 30% data se was tested.The regression coefficients were obtained, where positive/negative refers t deducted or not, respectively, for example E11 in Figure 11.The factors that led to medical claim denial were selected for logistic regression mod eling.The model training results are shown: the area under the curve (AUC) of the mode is large enough for both training set and testing set.This indicates that the discriminatio power of the ROC curve was enough to be adopted.
In Figure 12, the E11 recall increased from 0.167 to 0.417, which shows that the pro posed prediction model improved 25% of deducted cases.In Figure 12, the E11 recall increased from 0.167 to 0.417, which shows that the proposed prediction model improved 25% of deducted cases.

Model Performance
After building the logistic regression model, the research team evaluated model performance for all morbid entities by using the test dataset.Table 3 refers   The factors that led to medical claim denial were selected for logistic regression modeling.The model training results are shown: the area under the curve (AUC) of the model is large enough for both training set and testing set.This indicates that the discrimination power of the ROC curve was enough to be adopted.
In Figure 12, the E11 recall increased from 0.167 to 0.417, which shows that the proposed prediction model improved 25% of deducted cases.

Model Performance
After building the logistic regression model, the research team evaluated model performance for all morbid entities by using the test dataset.Table 3 refers to the model  The benefits of the proposed solution approach are compared to the traditional manual approach in Table 4. records × recall × NT $ 0.9.

Discussion
The model established in this study can be used to predict the probability that a medical claim with item weights will be deducted during professional audit.Through the preliminary analysis of the system, if a claim is identified as having a low probability of being denied or weight deduction, the claim could be sent to the committee of TNHIB without further review.Otherwise, the claim must be reviewed and modified based on the system's suggestions before sending to the committee, to prevent claim denial.This study finds that not only the proposed system could provide modification suggestions pertinent to all weight deduction, but also a rule base for record writing guidance could give critical suggestions to increase the chance of passing the professional audit.
In this study, the topics explored by the LDA and medical treatments are used as the variables for the logistic regression model.The coefficients of the variables are used to analyze the topics, diagnostic results, and the influence of drugs on the denied medical claims.A variable with a positive coefficient indicated the existence of the variable can increase the probability of weight deduction, while a variable with a negative coefficient indicates the reverse effect.For example, the variables in E11, except Topic 2, all show significantly positive/negative to deducted or not.The topic model plays an important role in the perdition modeling.
In addition, the results regarding Disease I669 diagnosis shows that medical records M5116 are at greater risk of weight deduction.In this respect, it is recommended that doctors provide more supporting information when writing medical records that contain such topics, drugs and diagnoses while little information is related to main diagnosis code or requires high cost of drugs and examinations.
Moreover, it was found that the disease classification code (Codes for International Classification of Diseases II) is also significant to the probability.It can be speculated that the disease classification code exerts impact on the audit committee's approval of a claim.Apparently, only when a known disease classification exists the symptoms of a disease can be described or recorded properly based on the doctor's diagnosis.

Conclusions
The professional audit and weight deduction by TNHIB are performed to achieve sustainable development of health insurance.However, this has been a significant burden to the management of the medical institutions.Effectively recording the medical process and evaluating the impact of examination data on the weight deduction in details help lessen this burden.This study identifies the key factors that can lead to weight deduction through the investigation of the denied medical claims using patients' reports, doctors' objective statements, examination items, prescribed medicines, and patient symptoms.After preprocessing of unformatted texts, an LDA is developed for each department to classify the text-based medical records into different topics.These topics are combined with other variables, such as weight, medicine bottles, and diseases, as binary variables to establish a logistic regression model.In the regression model, the variables with significant impact on weight deduction are determined.
To achieve the optimization of the deducted rates, further studies are required: 1.
Extend beyond the three morbid entities to formalize the medical records to prevent doctors from mistakes in documentation.

2.
Extend beyond the Pu-Chi Hospital to all local hospitals to develop a platform to reduce the waste of medicine resource through medicine.

3.
The data from Pu-Chi Hospital are not of a sufficient quantity, which also have an impact on accuracy and machine learning.Extending all hospital medical records is a possible way to perform deep learning to improve the prediction modeling.

Figure 3 .
Figure 3.The 5S Pyramid Model.• Studies: search the related papers; • Syntheses: synthesize those studies in a systematic review; • Synopses: abstract reviews of individual research and retrospective documents; •Summaries: obtain empirical conclusions problem/topic; • Storage sub-system: develop a data lake[22] to store the evidence.

Figure 3 .
Figure 3.The 5S Pyramid Model.• Studies: search the related papers; • Syntheses: synthesize those studies in a systematic review; • Synopses: abstract reviews of individual research and retrospective documents; •Summaries: obtain empirical conclusions problem/topic; • Storage sub-system: develop a data lake[22] to store the evidence.

Figure 3 .
Figure 3.The 5S Pyramid Model.•Studies: search the related papers; • Syntheses: synthesize those studies in a systematic review; • Synopses: abstract reviews of individual research and retrospective documents; •Summaries: obtain empirical conclusions problem/topic; • Storage sub-system: develop a data lake[22] to store the evidence.

Figure 4 .
Figure 4.The hyper-parameter of the Department of Family Medicine.

Figure 4 .
Figure 4.The hyper-parameter λ of the Department of Family Medicine.

Figure 6 .
Figure 6.Word frequency distribution of 20 words in six topics (E11 Topics 1 and 4).Figure 6. Word frequency distribution of 20 words in six topics (E11 Topics 1 and 4).

Figure 6 .
Figure 6.Word frequency distribution of 20 words in six topics (E11 Topics 1 and 4).Figure 6. Word frequency distribution of 20 words in six topics (E11 Topics 1 and 4).
The topic and words by LDA are output to csv files.The files could be used in the prediction model next, for example of E 111 in Figure9.
to the model performance.The recalls were 0.955, 0.955, and 0.849, and AUCs are 0.782, 0.776, and 0.852 respectively for E11, N18, and K21.Recall indicates the percentage of deducted cases were identified by the model, while AUC indicates the power of classification on a deducted case and an approved case.The results show the model performs very well in predicting the binary outcomes.

Table 1 .
Results of the data preprocess.

Table 2 .
Definition of Variables.

Table 2 .
Definition of Variables.

Table 4 .
The comparison of the proposed solution approach to traditional manual approach.