A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning
Abstract
:1. Introduction
- I.
- Supervised Learning: adds tags to various data types by sorting them into key groups/categories;
- II.
- Unsupervised Learning: Clustered data sets do not have a particular label. It also uses previously gained knowledge to recognize the data patterns;
- III.
- Reinforced Learning: The key characteristic of this type is collecting and enhancing knowledge to communicate with external entities. After that, it assigns a penalty or reward according to the action taken [9].
2. Related Works
Author and Ref. | Year | MAIN Idea | Pros | Cons | ML Algorithm | Dataset | Result |
---|---|---|---|---|---|---|---|
Sarwar et al. [8] | 2016 | Proposes a set of countermeasures to mitigate SCs attacks | Analyzes 10 security tools to detect vulnerabilities in SCs | More research and evaluation are needed for the proposed countermeasures | No | No | Not all vulnerabilities were detected |
Haozhe et al. [12] | 2022 | Analysis of 13 vulnerabilities in Ethereum SC | Investigates security tools | Detects pre-defined vulnerabilities | No | No | Smart Check is the best |
Han Liu et al. [13] | 2018 | Proposes S-gram semantic-aware security auditing technique | Investigates different types of potential vulnerabilities | Used only with solidity SCs | No | SCs from the Etherscan repository | Accuracy 90% |
Christof et al. [14] | 2020 | ÆGIS tool that protects and detects new vulnerabilities in SCs | Shields vulnerable SCs against attacks | Limited evaluation and lack of real-world testing | No | No | Accuracy 93% |
Mohamed et al. [11] | 2019 | The DOORchain framework to improve the security of blockchain systems | Detects malicious SCs | Do not discuss the potential challenges of implementation | CNN | Etherscan platform, (SWC) registry, (SCSA) | Accuracy over 90% |
Yasser et al. [15] | 2022 | Secure authentication approach | Improves security and Reduces latency | Less transmission rate | KNN | Transmitting request medical data | Accuracy 96% |
Syed et al. [16] | 2021 | A system based on AI algorithms to protect SCs | Prediction and protect the SCs from cyber-attacks | Needs to be implemented in real and complex environments | Naive Bayes | Iris flowers, Pima- Diabetes and heart disease | Accuracy 94% 87% 62% |
Deebak et al. [17] | 2021 | Privacy-preserving in SCs | Risk assessment | Less accuracy | Decision-tree, Naive Bayes, KNN | No | Accuracy 79% |
Yingjie et al. [18] | 2021 | Analyzes the vulnerabilities of SCs based on ML | Detects many kinds of the SCs vulnerabilities | Cannot locate the line of code in the SC where the vulnerability occurs | KNN | Smart bug, SolidiFi-benchmark Smabugs-wilds. | Accuracy 91% |
Kruthika et al. [19] | 2021 | A tamperproof and transparent healthcare system | Eliminates fraudulent insurance claims | Less accuracy | KNN Random-Forest | No | Accuracy 80% |
Wesley et al. [20] | 2019 | Model to identify weaknesses of SCs using ML | Detects new attack trends relatively quickly | Cannot provide additional insights for analysis | long-short term memory (LSTM) | Ethereum blockchain dataset | Accuracy 92% |
Rajesh et al. [21] | 2020 | Investigates various tools and AI techniques for SC privacy protection | Computational Intelligence to create robust cipher hashes | Not implemented to evaluate the performance parameters | NO | No | No |
Soumya Ray et al. [22] | 2022 | A novel algorithm to detect DDoS attacks in the healthcare system | Works to prevent the access of attackers to the system | Detects one type of DDoS attack. | No | No | Prevent the DDoS attack |
3. Proposed System
3.1. Architectural Layers of BC-Based Healthcare System
- A.
- Sensor layer: The sensor layer in a smart healthcare system refers to the layer of sensors and devices that collect health data from patients. This layer is critical for enabling real-time monitoring of patient health and providing personalized care;
- B.
- Application layer: This layer contains the applications and interfaces that allow users to interact with the system. This includes applications for patients, healthcare providers, and other stakeholders;
- C.
- Blockchain layer: The blockchain layer in a smart healthcare system refers to the layer of distributed ledger technology that underpins the system. This layer provides a secure and transparent way to store and share healthcare data, while ensuring privacy and confidentiality;
- D.
- Access layer: In a smart healthcare system, the access layer refers to the layer of the system that provides access to various components and services of the system, such as sensors, devices, networks, databases, and applications. The access layer plays a critical role in enabling users to interact with the system, retrieve and analyze data, and control various aspects of the system.
3.2. First Stage: Classifying Data outside the BC
- Data collection: collect data (Electronic Health Records (EHRs) from various sources such as wearables and medical sensors;
- Data processing: pre-process and transform the collected data into a format suitable for analysis;
- AI model training: train an AI model using the pre-processed data to classify the collected data into normal or abnormal based on pre-defined parameters;
- AI model evaluation: evaluate the performance of the AI model using a test dataset to ensure its accuracy;
- Data classification: use the trained AI model to classify the collected data into normal or abnormal;
- Data transmission: transmit the classified data to the healthcare application, typically through an application programming interface (API) or other data integration tools.
3.3. Second Stage: Classifying Transactions inside the BC
- 7.
- Receive data: The blockchain node acts as a mediator between the application layer and the blockchain network. When an application layer sends data to the blockchain node, the node receives the data and verifies its validity;
- 8.
- Analysis transactions: To validate a transaction in a blockchain network, digital signatures are created using the sender’s private key, and each receiving node verifies them using the sender’s public key. The NODEID of each node in the network is compared to the broadcasting node to ensure authenticity, and timestamps may be used as well. Transactions that fail any of these checks are rejected and not added to the blockchain;
- 9.
- Transaction classification: it consists of two subsections as follows:
- Transaction classification using a marked attacks database creates a catalog of known attack patterns to identify and classify transactions, preventing fraudulent transactions in the blockchain. Characteristics such as transaction type and timestamp are used to check transactions against the database and flag them as high-risk (blocked) or low-risk;
- The proposed system using machine learning can automate the transaction classification process, allowing for fraud detection, transaction monitoring, and optimization of the blockchain network. It collects and processes transaction data, extracts relevant features, trains and evaluates the model, and deploys it in real time.
- 10.
- Transactions can be stored in the blockchain depending on their classification as normal or abnormal. Normal transactions are added to the blockchain by nodes after validation;
- 11.
- Abnormal transactions flagged as high-risk are blocked and sent to the database of marked attacks. The database is continually updated to protect against new threats, maintaining the integrity of the blockchain system.
4. Performance Evaluation
- Model Selection and Assessment. It is the process of selecting the model type to see which one performs better (regression model, classification model, etc.);
- Model training: ML models tackle two problems, classification, and regression. There are algorithms used in each type, but some can be used in both types (for example, Decision Tree and KNN). In designing the ML algorithm, training is the most important step, which consists of passing the prepared data to the ML model to find patterns and make predictions. Over time, with training, the model gets better at predicting;
- Model Evaluation: after doing training, the final stage includes model performance which is evaluated in terms of different metrics, such as accuracy and precision [22].
- Logistic regression: It is a supervised ML algorithm used for classification problems. It aims to map a function from the dataset’s features to the targets to predict the probability that a new example belongs to one of the target classes [27];
- K Nearest Neighbors (KNN): It uses the Euclidean distance between data points to obtain neighbor data to work in the coordinate plane with the linear decomposition method;
- Support Vector Machine (SVM): SVM overcomes the problem of overfitting by expanding the concept of constructional risk minimization. Additionally, it examines the optimal hyperplane between the two classes [28];
- Naive Bayes: It is named because the calculations of the probabilities for each class are simplified to make their calculations tractable. It considers a classification algorithm for binary (two-class) and multiclass classification problems;
- Decision tree: It unites a series of the basic test efficiently and cohesively, where a numeric feature is compared to a threshold value in each test [29];
- Random forest: It is used for classification and regression and works on gathering the results. The predictions of several decision trees. In the end, it chooses the best output and the mean prediction or mode of the classes [30].
- True Positive (TP): The number of times the predicted value matches the actual value. The actual value was positive, and the model predicted a positive value;
- False Positive (FP): The number of times falsely predicted, where the actual value was negative, but the model predicted a positive value;
- True Negative (TN): The number of times the actual value matches the predicted value. The actual value was negative, and the model predicted a negative value;
- False Negative (FN): The number of times falsely predicted, where the actual value was positive, but the model predicted as a negative value.
- Accuracy: It refers to the percentage of correct classifications a trained machine learning model achieves. It is the number of correct predictions divided by the total number of predictions across all classes, as in the following equation:
- 2.
- Precision: We used precision to calculate the model’s ability to classify positive values correctly. It represents the true positives divided by the total number of predicted positive values as in the following equation:
- 3.
- Recall: It refers to how many actual positive cases the model could predict correctly. The recall is the true positives divided by the total number of actual positive values.
- 4.
- F1-Score: It is the harmonic mean of recall and precision. It is useful when you need to combine them, and it has the maximum value when precision is equal to recall, as in the following equation:
- 5.
- Execution time: execution time is the amount of time it takes for an algorithm to complete its operation and produce its output, including data pre-processing, data splitting, and model evaluation [33]. In general, shorter execution times are desirable since they indicate faster performance and better efficiency of machine learning models or algorithms.
- 6.
- Receiver Operating Characteristic-Area Under Curve (ROC-AUC): It measures the ability of a binary classifier to distinguish between positive and negative classes across all possible threshold settings. It is widely used in machine learning to evaluate the performance of binary classification models, particularly in cases where the classes are imbalanced, or the cost of false positives and false negatives is different [34]. It ranges from 0 to 1, with higher values indicating better performance. The range 0.5 indicates that the classifier is performing at random, while the range 1 value indicates perfect classification.
- 7.
- Scalability: Scalability in machine learning refers to a model’s ability to handle increasing amounts of data without requiring significant increases in processing time and resources [35]. It is crucial in applications involving large datasets, such as big data analytics, image recognition, natural language processing, and speech recognition. A scalable machine learning model can efficiently handle large datasets, making it practical for real-world use.
- Scenario 1: Dataset 1
- Heart rate (HR): The number of times a person’s heart beats per minute;
- Blood pressure (BP): The force exerted by circulating blood on the walls of blood vessels, usually measured as systolic and diastolic pressures;
- Respiratory rate (RR): The number of breaths a person takes per minute;
- Body temperature (Temp): The internal temperature of the body, usually measured in degrees Celsius or Fahrenheit;
- Oxygen saturation (SpO2): The percentage of oxygen-bound hemoglobin in the blood, indicating how well oxygen is being transported to various parts of the body.
- Scenario 2: Dataset 2
5. Security Analysis
- I.
- Identity theft: Identity theft in a blockchain network is the fraudulent acquisition of a user’s identity to manipulate their assets or data. The attacker typically steals the user’s private key to impersonate them and perform unauthorized transactions. Machine learning algorithms in the proposed system can help detect identity theft attacks on a blockchain network by analyzing user behavior patterns and detecting anomalies that may indicate fraudulent activity;
- II.
- Distributed Denial of Service (DDoS): DDoS attack in a blockchain network involve overwhelming the network with traffic, rendering it unavailable to users. Blockchain networks can use consensus mechanisms that require participants to perform computational work before contributing to the network, making it more difficult for attackers to mount a DDoS attack. The proposed approach can detect any attempts to flood the network with traffic to disrupt its normal functioning;
- III.
- Sybil attacks: Sybil attacks in a blockchain network involve creating multiple identities to control a significant portion of the network’s resources. The proposed system can detect any attempts to create multiple fake identities by using identity verification processes to prevent users from creating multiple identities. Additionally, it can detect fake nodes to manipulate the network’s consensus mechanism by analyzing network traffic and identifying clusters of nodes that behave abnormally or have suspicious communication patterns;
- IV.
- Routing attacks: Routing attacks in a blockchain network involve an attacker manipulating the network’s routing protocol to redirect traffic to malicious nodes, allowing the attacker to intercept, modify, or block network traffic. This attack cannot be ensured because our proposed approach analyzes network traffic patterns and identifies potential attacks or suspicious behavior;
- V.
- Decentralized Autonomous Organization (DAO) attacks: The DAO attack occurs when the attacker exploits a vulnerability in the code of the DAO SC, allowing them to siphon off a large amount of digital currency. The proposed system can detect this type of attack because it analyzes transactions to detect abnormal or fraudulent behavior, such as repeatedly requesting refunds multiple times using different accounts;
- VI.
- Billing fraud: Billing fraud is when someone submits false or misleading information to receive payment for services that were not provided or were misrepresented. In healthcare blockchain applications, it can occur when a healthcare provider submits false claims for reimbursement. Our system can detect billing fraud by analyzing data and identifying abnormal billing patterns, such as unusually high rates of certain procedures or services, or frequent billing for services that are not typically provided;
- VII.
- Reentrancy attack: A reentrancy attack is a type of SC vulnerability where an exploiter contract leverages the loophole of the victim contract to continuously withdraw from it until the victim contract goes bankrupt by using machine learning algorithms in the proposed system that analyzes the transactions. It can detect and identify patterns and behaviors of known reentrancy attacks.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhang, J.; Long, J.; von Schaewen, A.M.E. How Does Digital Transformation Improve Organizational Resilience? —Findings from PLS-SEM and fsQCA. Sustainability 2021, 13, 11487. [Google Scholar] [CrossRef]
- Tian, S.; Yang, W.; Le Grange, J.M.; Wang, P.; Huang, W.; Ye, Z. Smart healthcare: Making medical care more intelligent. Glob. Health J. 2019, 3, 62–65. [Google Scholar] [CrossRef]
- Mohanty, S.P.; Choppali, U.; Kougianos, E. Everything you wanted to know about smart cities: The Internet of things is the backbone. IEEE Consum. Electron. Mag. 2016, 5, 60–70. [Google Scholar] [CrossRef]
- Zeadally, S.; Siddiqui, F.; Baig, Z.; Ibrahim, A. Smart healthcare: Challenges and potential solutions using internet of things (IoT) and big data analytics. PSU Res. Rev. 2020, 4, 149–168. [Google Scholar] [CrossRef] [Green Version]
- Al Omar, A.; Jamil, A.K.; Khandakar, A.; Uzzal, A.R.; Bosri, R.; Mansoor, N.; Rahman, M.S. A Transparent and Privacy-Preserving Healthcare Platform with Novel SC for Smart Cities. IEEE Access 2021, 9, 90738–90749. [Google Scholar] [CrossRef]
- Bishta, S.; Bishta, N.; Singha, P.; Dasilaa, S.; Nisar, K.S. Smart healthcare using blockchain technologies: The importance, applications, and challenges. Blockchain Appl. Healthc. Inform. 2022, 163–180. [Google Scholar] [CrossRef]
- Sodhro, A.H.; Sennersten, C.; Ahmad, A. Towards Cognitive Authentication for Smart Healthcare Applications. Sensors 2022, 22, 2101. [Google Scholar] [CrossRef]
- Sayeed, S.; Marco-Gisbert, H.; Caira, T. SC: Attacks and Protections. IEEE Access 2020, 8, 1. [Google Scholar] [CrossRef]
- Available online: https://www.h-x.technology/blog/top-3-smart-contract-audit-tools (accessed on 23 March 2023).
- Truong, T.C.; Diep, Q.B.; Zelinka, I. Artificial Intelligence in the Cyber Domain: Offense and Defense. Symmetry 2020, 12, 410. [Google Scholar] [CrossRef] [Green Version]
- El-Dosuky, M.A.; Eladl, G.H. DOORchain: Deep Ontology-Based Operation Research to Detect Malicious SCs. In New Knowledge in Information Systems and Technologies; Springer Nature: Basel, Switzerland, 2019. [Google Scholar]
- Zhou, H.; Fard, A.M.; Makanju, A. The State of Ethereum SC Security: Vulnerabilities, Countermeasures, and Tool Support. J. Cybersecur. Priv. 2022, 2, 358–378. [Google Scholar] [CrossRef]
- Liu, H.; Liu, C.; Zhao, W.; Jiang, Y.; Sun, J. S-gram: Towards Semantic-Aware Security Auditing for Ethereum SCs. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE18), Montpellier, France, 3–7 September 2018. [Google Scholar]
- Torres, C.F.; Baden, M.; Norvill, R.; Pontiveros, B.B.F.; Jonker, H.; Mauw, S. ÆGIS: Shielding Vulnerable SCs Against Attacks. arXiv 2020, arXiv:2003.05987. [Google Scholar]
- Al-Otaibi, Y.D. K-nearest neighbour-based SC for internet of medical things security using blockchain. Comput. Electr. Eng. 2022, 101, 108129. [Google Scholar] [CrossRef]
- Badruddoja, S.; Dantu, R.; He, Y.; Upadhayay, K.; Thompson, M. Making SCs Smarter. In Proceedings of the 2021 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), Virtual, 3–6 May 2021. [Google Scholar]
- Deebak, B.D.; AL-Turjman, F. Privacy-preserving in SCs using blockchain and artificial intelligence for cyber risk measurements. J. Inf. Secur. Appl. 2021, 58, 102749. [Google Scholar]
- Xu, Y.; Hu, G.; You, L.; Cao, C. A Novel Machine Learning-Based Analysis Model for SC Vulnerability. Secur. Commun. Netw. 2021, 2021, 5798033. [Google Scholar] [CrossRef]
- Alnavar, K.; Babu, D.C.N. Blockchain-based SC with Machine Learning for Insurance Claim Verification. In Proceedings of the 2021 5th International Conference on Electrical, Electronics, Communication, Computer Technologies and Optimization Techniques, Mysuru, India, 10–11 December 2021. [Google Scholar]
- Tann, W.J.-W.; Han, X.J.; Gupta, S.S.; Ong, Y.-S. Towards Safer SCs: A Sequence Learning Approach to Detecting Security Threats. arXiv 2019, arXiv:1811.06632. [Google Scholar]
- Gupta, R.; Tanwer, S.; AL-Turjman, F.; Italiya, P.; Nauman, A.; Kim, S.W. Smart Contract Privacy Protection Using AI in Cyber-Physical Systems: Tools, Techniques and Challenges. IEEE Access 2020, 8, 24746–24772. [Google Scholar] [CrossRef]
- Ray, S.; Mishra, K.N.; Dutta, S. Detection and prevention of DDoS attacks on M-healthcare sensitive data: A novel approach. Int. J. Inf. Technol. 2022, 14, 1333–1341. [Google Scholar] [CrossRef]
- Udupa, P. Smart home for elder care using wireless sensor. Circuit World 2018, 44, 69–77. [Google Scholar] [CrossRef]
- Available online: https://www.kaggle.com/datasets/engrarri21/human-vital-signs (accessed on 27 March 2023).
- Available online: https://www.kaggle.com/datasets/rupakroy/ethereum-fraud-detection (accessed on 29 March 2023).
- Zhang, S.; Zhang, C.; Yang, Q. Data Preparation for Data Mining. Appl. Artif. Intell. 2010, 17, 375–381. [Google Scholar] [CrossRef]
- Thabtah, F.; Abdelhamid, N.; Peebles, D. A machine learning autism classification based on logistic regression analysis. Health Inf. Sci. Syst. 2019, 7, 12. [Google Scholar] [CrossRef]
- Available online: https://machinelearningmastery.com/method-of-lagrange-multipliers-the-theory-behind-support-vector-machines-part-3-implementing-an-svm-from-scratch-in-python/ (accessed on 7 April 2023).
- Jijo, B.T.; Abdulazeez, A.M. Classification Based on Decision Tree Algorithm forMachine Learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar]
- Kurdi, F.T.; Amakhchan, W.; Gharineiat, Z. Random Forest Machine Learning Technique for Automatic Vegetation Detection and Modelling in LiDAR Data. Int. J. Environ. Sci. Nat. Resour. 2021, 28, 556234. [Google Scholar] [CrossRef]
- Yuvalı, M.; Yaman, B.; Tosun, Ö. Classification Comparison of Machine Learning Algorithms Using Two Independent CAD Datasets. Mathematics 2022, 10, 311. [Google Scholar] [CrossRef]
- Available online: https://www.simplilearn.com/tutorials/machine-learning-tutorial/confusion-matrix-machine-learning#:~:text=A%20confusion%20matrix%20presents%20a,actual%20values%20of%20a%20classifier (accessed on 18 April 2023).
- AlZoman, R.M.; Alenazi, M.J.F. A Comparative Study of Traffic Classification Techniques for Smart City Networks. Sensors 2021, 21, 4677. [Google Scholar] [CrossRef]
- Mandrekar, J.N. Receiver Operating Characteristic Curve in Diagnostic Test Assessment. J. Thorac. Oncol. 2010, 5, 1315–1316. [Google Scholar] [CrossRef] [Green Version]
- Cheng, D.; Zhang, H.; Xia, F.; Li, S.; Zhang, Y. The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters. arXiv 2020, arXiv:1910.11510. [Google Scholar]
- Aziz, R.M.; Baluch, M.F.; Patel, S.; Ganie, A.H. LGBM: A machine learning approach for Ethereum fraud detection. Int. J. Inf. Technol. 2022, 14, 3321–3331. [Google Scholar] [CrossRef]
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Logistic Regression | 97.79 | 98.55 | 98.63 | 98.59 |
KNN | 99.66 | 99.81 | 99.84 | 99.83 |
SVM | 99.76 | 99.81 | 99.83 | 99.88 |
Gaussian NB | 97.5 | 98.65 | 97.92 | 98.28 |
Decision Tree | 99.76 | 99.89 | 99.79 | 99.84 |
Random Forest | 99.82 | 99.89 | 99.87 | 99.88 |
Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|
Logistic Regression | 84.05 | 88.48 | 32.26 | 47.2 |
KNN | 95.68 | 92.6 | 87.51 | 89.97 |
SVM | 84.55 | 80.55 | 39.88 | 53.34 |
Gaussian NB | 30.88 | 23.99 | 97.88 | 38.54 |
Decision Tree | 97.02 | 93.2 | 93.39 | 93.28 |
Random Forest | 98.2 | 98.64 | 93.16 | 95.82 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mohammed, M.A.; Boujelben, M.; Abid, M. A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning. Future Internet 2023, 15, 250. https://doi.org/10.3390/fi15080250
Mohammed MA, Boujelben M, Abid M. A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning. Future Internet. 2023; 15(8):250. https://doi.org/10.3390/fi15080250
Chicago/Turabian StyleMohammed, Mohammed A., Manel Boujelben, and Mohamed Abid. 2023. "A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning" Future Internet 15, no. 8: 250. https://doi.org/10.3390/fi15080250
APA StyleMohammed, M. A., Boujelben, M., & Abid, M. (2023). A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning. Future Internet, 15(8), 250. https://doi.org/10.3390/fi15080250