Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Research on Integrated Learning Fraud Detection Method Based on Combination Classifier Fusion (THBagging): A Case Study on the Foundational Medical Insurance Dataset

Electronics 2020, 9(6), 894; https://doi.org/10.3390/electronics9060894

by Jibing Gong^1,2,*

, Hekai Zhang^1,*

and Weixia Du¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Electronics 2020, 9(6), 894; https://doi.org/10.3390/electronics9060894

Submission received: 28 April 2020 / Revised: 17 May 2020 / Accepted: 22 May 2020 / Published: 27 May 2020

(This article belongs to the Special Issue Recent Trends and Applications in Cybersecurity)

Round 1

Reviewer 1 Report

In their article the authors address a topic that has been relevant for quite some time. It is about fraud attempts in the insurance business and how they can be identified. With the huge amounts of data, it is impossible to carry out sufficiently precise checks with manual controls. Digital analysis methods, on the other hand, can systematically examine large data sets and narrow them down to suspicious cases, which can then be further processed with reasonable effort.

Several such approaches have been presented in the literature. The authors present a further procedure. A comparison of the proposed approach to common algorithms is carried out with the help of a database of a medical insurance company. With the database it is known which cases were regular and which were attempted frauds. It can therefore be used as a reference for testing.

One merit of the article is that it introduces an additional aspect into the procedure to increase the effectiveness in detecting possible fraud attempts. The argumentation is sometimes difficult to understand, as a lot of insider knowledge is required. Many acronyms are used whose meaning is not immediately apparent. A list of abbreviations would be helpful.

Details:

Title: The title describes the content of the article in a slightly vague way. It would be better if the title were to make it clearer that an extended procedure for detecting attempted fraud is proposed here.

L13 Keywords are separated by semicolons and written in lower case

L20 Social health -> social health

L26++ “..These methods have achieved certain results, but because the technical means are relatively backward, they are in many aspects There are limitations, including the inability to dig out the common characteristics of offenders, the need for manual participation in the identification process leading to low efficiency and low identification accuracy and recall [11][12].” Sentence structure is unusual

L44 Missing space before “This paper”

L54 second-leve feature -> second-level feature

L58 “Gradient Boosting Decision Tree (GBDT) + LR..” -> what is the meaning of LR. A reference would be good.

L61 “In order to facilitate further research on this task, we plan to make the source code and data of the model publicly available as contributions to the community” -> From the point of view of a successful publication, not “we plan to make” should be written here, but the place of publication.

L79 The reference is not correct. In L79 “He et al. [22]” and in L488 “ 22. Junhua, H…”

L80 CBM: acronym not explained

L81 “Eclat and Apriori” not explained

L84 Whether -> whether

L89 Training -> training

L91 BP: acronym not explained

L106 Invented -> invented

L108 “..to detect Outlier detection..” -> sentence structure is unusual

L115 Details -> details

L115-116 “..such as medical insurance fund fraud violations and violations.” -> why the second “violation”?

L125 “Therefore, in the process of data processing, this article will eliminate unnecessary data.” What is the meaning of this sentence? Why “article”? Do you mean: “Therefore, this step eliminates unnecessary data.”

L150 “The proportion of data samples.” What is the meaning of this sentence?

L150 “The smote sampling algorithm requires..” here a reference is needed.

L154 Kmeans' notation should be consistent throughout the paper. L149/172 Kmean, L154/155/161/162 kmeans, Table 2 k-means, L86 K-means, a reference should be mentioned at the first mention

L162 “..kmeans function in Python Change trend..” A reference should be mentioned.

L185 “..by the tree model classification algorithm..”” A reference should be mentioned.

L192 “Train and get the parameters of each part of the fusion classification model.” sentence structure is unusual.

L199 GBDT, XGBoost, LightGBM explanations/references are needed

L200 “..random forest algorithm..” reference is needed

L209/2010/212/228 Are there references for these statements/methods?

L252 W means ω?

L254 What means I(x ∊ R_mj)?

L261 Missing dot after “feature” and before “The importance..”

L261/262 ”The importance of and the average of all the trees is the relative importance of the featureref.” Sentence structure is unusual.

L270 Reference is needed “Open IE baseline methods”.

L280 ture->true

L286 Reference is needed “Macro-F1”

L290 Reference is needed “ROC”, “AOC-ROC”

L293 “The best model possible model is..” -> “The best possible model is..”

L305 greedyly -> greedily

L309 t3he model -> the model

L329 “Python 3.7.0.Python” missing space

L356 “..extraction feature , Indicating that..” -> “..extraction feature, indicating that..”

L392 SVM, KNN, DT, and LR references are needed

L397 THBaggingmod -> THBagging_mod

L425 “…insurance fraud is analyzed and analyzed.” ->“…insurance fraud is analyzed and discussed.”

Author Response

Dear Reviewer,

Thank you very much for your responsibility！

According to your request, we have modified the article.

Please see the attachment.

Best wishes

Author Response File: Author Response.pdf

Reviewer 2 Report

The submitted article focuses on a fraud detection method, which has a potentially high contribution for research and practice. However, the article in its current version suffers from a few weak points that should be reworked.

Here are some examples of incorrect or incomprehensible formulations.

“The behavior of medical insurance fraud criminals is changeable, […]” (line 20-21)
“[…] the technical means are relatively backward, they are in many aspects” (line 26-27)
“[…] we propose a second-leve feature […]” (line 54)

The introduction should be revised and hereby, the problem formulation should be clarified and strengthened. Additionally, it would be interesting to read something about related or existing approaches within this context. The proposed contributions are formulated very redundantly (“Aiming at […]” three times) and they are not picked up later in the article to highlight these proposed contributions. Additionally, the contributions should be clarified (e.g. which accuracy is improved).

Within the section on related work, the approaches are described a bit shallow and a comprehensable storyline is missing (this also holds true for the subsequent sections). It is hard to understand why the mentioned approaches are related. In addition, it is not clear which approaches, algorithms or systems are currently state of the art. The text could be restructured and revised, in order to showcase the relevant approaches and the corresponding use cases.

The description of the dataset should also be revised. The data stems from “[…] some areas of the country […]” (line 114): more specific information is needed. It would also be interesting how the data was collected. The information of ratio of positive and negative samples (19: 1) is superfluous because it can be read very clearly from the number of examples (19,000 normal people) and (1,000 fraudsters) beforementioned.

Data preprocessing is extensively described, yet the presentation is not ideal and additionally, some choices seem very subjective. These circumstances lower the reproducibility as well as the comprehensibility. More clarity is needed within this section. The tabular overview (table 1) is interesting. However, the layout of the table should be revised and I think it is only valuable if the dataset is published. The chart (figure 1) regarding the numbers of clustering categories should also be edited. The numbers should be shown without the ending “e+12”. The proposed approach is described quite well, yet the whole section should be specified. Additionally, it would be interesting how the proposed approach on dealing with the problem of sample imbalance performs in comparison with other approaches, for instance, SMOTE (Synthetic Minority Over-sampling Technique).

Within the section on the experiments I do not see the added value of presenting the formulas for precision, recall, f-measure, even though these are the normal and widespread formulas. Here, the whole metrics section in its current state could be shortened as it mostly contains of information that does not deliver an added value.

The compared models in section 5.2 could also be better presented in a tabular overview rather than bullet points. Figure 4 represents the feature importance ranking though feature names such as “111” or “74”, which is not informative. The description of the y-axis could be renamed: what are the values, percentages? In addition, referring to feature importance, the authors could think about putting some literature on explainable artificial intelligence in their discussion, since the proposed methods for fraud detection are complex and need transparency (also in future research). I advise to cite the following publications:

Meske, C. and Bunde, E. (2020). Transparency and Trust in Human-AI-Interaction: The Role of Model-Agnostic Explanations in Computer Vision-Based Decision Support, Proceedings of the Conference on Artificial Intelligence in Human-Computer Interaction.
Adadi A, Berrada M (2018) Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access, 6:52138-52160
Kühl, N., Lobana, J. and Meske, C. (2019). Do you comply with AI? — Personalized explanations of learning algorithms and their impact on employees' compliance behavior. Proceedings of the 40th International Conference on Information Systems (ICIS).

The implementation details could be presented in a more concise way. Another problem is table 3 which is presented first and only afterwards mentioned in the text.

Additionally, the contribution regarding the improvement of the accuracy should be clarified, as it is mentioned in section 5.4.2 that the proposed model does not achieve the highest recall or precision rate. I am also not really convinced that the improvement of one particular metric (macro f1) for one particular dataset is that big of a contribution. However, this could also be a problem that results from an unfavorable presentation and formulation. I would also recommend to redesign the different charts and consider best practices as well as an adjustment of the tables.

In summary, the presented approach and the use case is interesting. In encourage the authors to revise their manuscript. Through a more precise language and description as well as an improved presentation of the content, the manuscript can be improved.

Author Response

Dear Reviewer,

Thank you very much for your responsibility！

According to your request, we have modified the article.

Please see the attachment.

Best wishes

Author Response File: Author Response.pdf

Reviewer 3 Report

Gong et al. have presented a detailed study on the fraud detection method of integrated learning deployed on the cloud edge. In their work they have proposed new fusion algorithm (THBagging) that solves the problem of imbalance dataset, overfitting and low recognization rate. In their work the authors have argued that the proposed method delivers better results compared to the traditional methods.

The manuscript is well prepared and presented with detailed analysis. The language is simple yet the depth of study is present. There are though couple of questions/improvements that could be made to make this manuscript better. such as how the different hyperparameters are selected (for example number of nodes or classifiers or depth size) for different layers and why two layers are selected. are there any detailed analysis for choosing those numbers. also is there analogical methods exists in deep learning algorithms to compare with? with these improvements I would suggest to accept the manuscript for publication.

Author Response

Dear Reviewer,

Thank you very much for your responsibility！

According to your request, we have modified the article.

Please see the attachment.

Best wishes

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Through the last revision, the paper has been improved by the authors.

Article Menu

Research on Integrated Learning Fraud Detection Method Based on Combination Classifier Fusion (THBagging): A Case Study on the Foundational Medical Insurance Dataset

Further Information

Guidelines

MDPI Initiatives

Follow MDPI