Different Scales of Medical Data Classification Based on Machine Learning Techniques: A Comparative Study
Abstract
:1. Introduction
- Continuous detection of the patient’s health and state. In the case of an unusual event, an alarm is sent to the patient’s doctor for an early intervention.
- Early detection of disease.
- Prediction of new disease.
Challenges of Big Data in Healthcare
- Collecting patient data continuously from different sources, thus leading to the high volume of data.
- Medical data are almost unstructured or semi-structured.
- Medical data are not clear for everyone.
- Handling a huge size of medical data.
- Extracting useful information from medical big data.
2. Related Work
3. Methodology
3.1. DT
- The max_depth parameter represents the maximum depth of the tree. Without defining this parameter, the tree can lead to an infinite loop until all of the leaves are expanded. We assign it as 20.
- The criterion parameter represents the measure of the split’s quality. Here, we use entropy, which measures the information gain.
3.2. RF
- The n_estimators’ parameter represents the number of trees that we need to build prior to voting. Better accuracy results from the highest number of trees, but it results in high-performance time. We use 50 tree numbers for an efficient classification.
- The max_depth parameter represents the maximum length of the trees. If we do not assign a value for max_depth, it may lead to infinite nodes. We assign it as 20.
3.3. J48
3.4. LM
3.5. R
- The max_iter parameter represents the maximum number of iterations until it converges. We assign it as 20.
- The random_state parameter represents the random values used in shuffling.
- The classes attribute represents the list of class labels for a clear classification.
3.6. GBT
- The max_depth parameter represents the maximum length of the trees. If we do not assign a value for max_depth, it may lead to infinite nodes. We assign it as 20.
- The learning rate presented is assigned to 0.01.
3.7. NB
3.8. CFS
4. Comparison between Medical Data Classification Methods
- Machine learning needs to train on huge datasets, and these should be unbiased as well as of good quality. As a result, there can be periods where we should wait for new data to be produced.
- Sufficient time is required for the algorithms to learn how to achieve their purpose with accuracy. Machine learning also requires massive resources to function, which leads to additional computer power requirements [37].
- I.
- Data Acquisition
- II.
- Data Preprocessing
- III.
- Feature Selection
- IV.
- Data Classification
5. Results and Discussion
5.1. Dataset Description
5.2. Results
5.3. Discussion
- -
- Preprocessing helps in increasing the data quality, since good data quality results in good accuracy values.
- -
- The application of data reduction algorithm reduces the consuming time.
- -
- The limitations or challenges of our model are as follows:
- -
- Time complexity: The execution requires a large amount of time, since the more the data increase, the more time it requires for the execution.
- -
- CPU processing issues, such as data with a high volume require a major part of the computer memory.
- -
- Adjusting the parameters’ values, such as the number of layers, max_depth, etc. requires significant effort and time since it is a trial-and-error experiment.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Maleki, N.; Zeinali, Y.; Niaki, S. A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Expert Syst. Appl. 2021, 164, 113981. [Google Scholar] [CrossRef]
- Bichri, A.; Kamzon, M.; Abderafi, S. Artificial neural network to predict the performance of the phosphoric acid production. Procedia Comput. Sci. 2020, 177, 444–449. [Google Scholar] [CrossRef]
- Aurelia, J.; Rustam, Z.; Wirasati, I.; Hartini, S.; Saragih, G. Hepatitis classification using support vector machines and random forest. IAES Int. J. Artif. Intell. (IJ-AI) 2021, 10, 446–451. [Google Scholar] [CrossRef]
- Ehatisham-ul-Haq, M.; Malik, M.; Azam, M.; Naeem, U.; Khalid, A.; Ghazanfar, M. Identifying Users with Wearable Sensors based on Activity Patterns. Procedia Comput. Sci. 2020, 177, 8–15. [Google Scholar] [CrossRef]
- Ye, Y.; Shi, J.; Zhu, D.; Su, L.; Huang, J.; Huang, Y. Management of medical and health big data based on integrated learning-based health care system: A review and comparative analysis. Comput. Methods Programs Biomed. 2021, 209, 106293. [Google Scholar] [CrossRef] [PubMed]
- Nandhini, S.; JeenMarseline, K.S. Performance Evaluation of Machine Learning Algorithms for Email Spam Detection. In Proceedings of the International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE), Vellore, India, 24–25 February 2020; pp. 1–4. [Google Scholar]
- Nasiri, S.; Khosravani, M. Machine learning in predicting mechanical behavior of additively manufactured parts. J. Mater. Res. Technol. 2021, 14, 1137–1153. [Google Scholar] [CrossRef]
- Jalota, C.; Agrawal, R. Analysis of Educational Data Mining using Classification. In Proceedings of the International Conference on Machine Learning, Big Data, Cloud and Parallel Computing(Com-IT-Con), Faridabad, India, 14–16 February 2019; pp. 1–5. [Google Scholar]
- Rumsfeld, J.; Joynt, K.; Maddox, T. Big data analytics to improve cardiovascular care: Promise and challenges. Nat. Rev. Cardiol. 2016, 13, 350–359. [Google Scholar] [CrossRef]
- Lee, C.; Yoon, H. Medical big data: Promise and challenges. Kidney Res. Clin. Pract. 2017, 36, 3–11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Costa, R.; Moreira, J.; Pintor, P.; dos Santos, V.; Lifschitz, S. A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms. Big Data Res. 2021, 25, 100206. [Google Scholar] [CrossRef]
- Gavai, G.; Nabi, M.; Bobrow, D.; Shahraz, S. Heterogenous Knowledge Discovery from Medical Data Ontologies. In Proceedings of the IEEE International Conference on Healthcare Informatics, Park City, UT, USA, 23–26 August 2017; pp. 519–523. [Google Scholar]
- Ansari, M.F.; Alankar, B.; Email, H.K. A Prediction of Heart Disease Using Machine Learning Algorithms. In Proceedings of the International Conference on Image Processing and Capsule Networks, Bangkok, Thailand, 6–7 May 2020; pp. 497–504. [Google Scholar]
- Singh, J.; Bagga, S.; Kaur, R. Software-based Prediction of Liver Disease with Feature Selection and Classification Techniques. Procedia Comput. Sci. 2020, 167, 1970–1980. [Google Scholar] [CrossRef]
- Kondababu, A.; Siddhartha, V.; Kumar, B.; Penumutchi, B. A comparative study on machine learning based heart disease prediction. In Materials Today: Proceedings; Elsevier: Amsterdam, The Netherlands, 2021; Volume 10, pp. 1–5. [Google Scholar] [CrossRef]
- Ali, M.; Paul, B.; Ahmed, K.; Bui, F.; Quinn, J.; Moni, M. Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput. Biol. Med. 2021, 136, 104672. [Google Scholar] [CrossRef] [PubMed]
- Abdulhamit, S.; Mariam, R.; Rabea, K.; Kholoud, K. IOT Based Mobile Healthcare System for Human Activity Recognition. In Proceedings of the 15th Learning and Technology Conference (L&T), Jeddah, Saudi Arabia, 25–26 February 2018; pp. 29–34. [Google Scholar]
- Jan, M.; Awan, A.; Khalid, M.; Nisar, S. Ensemble approach for developing a smart heart disease prediction system using classification algorithms. Res. Rep. Clin. Cardiol. 2018, 9, 33–45. [Google Scholar] [CrossRef] [Green Version]
- Khan, N.; Husain, S.M.; Tripathi, M.M. Analytical Study of Big Data Classification. In Proceedings of the ACEIT Conference Proceeding, Garden City, Bengaluru, March 2016; pp. 143–146. [Google Scholar]
- Mercaldo, F.; Nardone, V.; Santone, A. Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques. Procedia Comput. Sci. 2017, 112, 2519–2528. [Google Scholar] [CrossRef]
- Arul Jothi, K.; Subburam, S.; Umadevi, V.; Hemavathy, K. Heart disease prediction system using machine learning. Mater. Today Proc. 2021, 12, 1–3. [Google Scholar] [CrossRef]
- Arumugam, K.; Naved, M.; Shinde, P.; Leiva-Chauca, O.; Huaman-Osorio, A.; Gonzales-Yanac, T. Multiple disease prediction using Machine learning algorithms. Mater. Today Proc. 2021, 7, 1–4. [Google Scholar] [CrossRef]
- Pinto, A.; Ferreira, D.; Neto, C.; Abelha, A.; Machado, J. Data Mining to Predict Early Stage Chronic Kidney Disease. Procedia Comput. Sci. 2020, 177, 562–567. [Google Scholar] [CrossRef]
- Mateo, J.; Rius-Peris, J.; Maraña-Pérez, A.; Valiente-Armero, A.; Torres, A. Extreme gradient boosting machine learning method for predicting medical treatment in patients with acute bronchiolitis. Biocybern. Biomed. Eng. 2021, 41, 792–801. [Google Scholar] [CrossRef]
- Sabeena, B.; Sivakumari, S.; Amudha, P. A technical survey on various machine learning approaches for Parkinson’s disease classification. Mater. Today Proc. 2020, 10, 1–5. [Google Scholar] [CrossRef]
- Analytics Vidhya. Available online: https://www.analyticsvidhya.com/blog/2021/05/25-questions-to-test-your-skills-on-decision-trees/ (accessed on 31 December 2021).
- Muhammad, L.J.; Islam, M.M.; Usman, S.S.; Ayon, S.I. Predictive Data Mining Models for Novel Coronavirus (COVID 19) Infected Patients’ Recovery. SN Comput. Sci. 2020, 1, 200–206. [Google Scholar] [CrossRef]
- Genuer, R.; Poggi, J.M. Random Forests. In Random Forest in R; H2O.ai Inc., Springer Nature: Cham, Switzerland, 2020; pp. 33–51. [Google Scholar] [CrossRef]
- Medium. Available online: https://medium.com/m/globalidentity?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Frandom-forests-an-ensemble-of-decision-trees-37a003084c6c (accessed on 31 December 2021).
- Ihya, R.; Namir, A.; El Filali, S.; DAOUD, M.A.; Guerss, F. J48 algorithm of machine learning for predicting user’s the acceptance of an E-orientation systems. In Proceedings of the 4th International Conference, Casablanca, Morocco, 2 October 2019; pp. 1–9. [Google Scholar]
- Nykodym, T.; Tom Kraljevic, T.; Wang, A. Generalized Linear Modeling with H2O, 6th ed.; Bartz, A., Ed.; H2O.ai, Inc.: Mountain View, CA, USA, 2017; pp. 14–45. [Google Scholar]
- Boateng, E.Y.; Abaye, D.A.A. Review of the Logistic Regression Model with Emphasis on Medical Research. J. Data Anal. Inf. Processing 2019, 7, 190–207. [Google Scholar] [CrossRef] [Green Version]
- Saberian, M.; Delgado, P.; Raimond, Y. Gradient Boosted Decision Tree Neural Network. arXiv 2019, arXiv:1910.09340. [Google Scholar]
- Dai, Y.; Sun, H. The naive Bayes text classification algorithm based on rough set in the cloud platform. J. Chem. Pharm. Res. 2014, 6, 1636–1643. [Google Scholar]
- Zhang, Y.; Wang, S.; Yang, X.; Dong, Z.; Liu, G.; Phillips, P.; Yuan, T. Pathological brain detection in MRI scanning by wavelet packet Tsallis entropy and fuzzy support vector machine. SpringerPlus 2015, 4, 201–209. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Sudirman, I.D.; Nugraha, D.Y. Naive Bayes Classifier for Predicting the Factors that Influence Death Due to COVID-19 In China. J. Theor. Appl. Inf. Technol. 2020, 98, 1686–1696. [Google Scholar]
- CIS. Available online: https://www.cisin.com/coffee-break/enterprise/highlights-the-advantages-and-disadvantages-of-machine-learning.html (accessed on 31 December 2021).
- Qiu, P.; Niu, Z. TCIC_FS: Total correlation information coefficient-based feature selection method for high-dimensional data. Knowl.-Based Syst. 2021, 231, 107418. [Google Scholar] [CrossRef]
- Banos, O.; Garcia, R.; Terriza, A.H.J.; Damas, M.; Pomares, H.; Rojas, I.; Saez, A.; Villalonga, C. mHealthDroid: A novel framework for agile development of mobile health applications. In Proceedings of the 6th International Work-conference on Ambient Assisted Living an Active Ageing, Belfast, UK, 2–5 December 2014; pp. 91–98. [Google Scholar]
- Kaggle: Your Machine Learning and Data Science Community. Available online: https://www.kaggle.com/brandao/diabetes?select=diabetic_data.csv (accessed on 11 October 2021).
- Catalog.data.gov. Available online: https://catalog.data.gov/dataset/heart-disease-mortality-data-among-us-adults-35-by-state-territory-and-county-2016-2018 (accessed on 13 October 2021).
Reference | Year | Dataset | Algorithm | Accuracy | Advantages/Disadvantages |
---|---|---|---|---|---|
Khan et al. [19] | 2016 | Adult | NB, C4.5 | 98.7% | The accuracy changed with the increase of data. When the data increased, the accuracy of the model decreased. NB is good with a small dataset. |
Mercaldo et al. [20] | 2017 | Diabetes Pima Indian | J48, MLP, Hoeffding Tree, JRip, Bayes Net, RF | 77.6% | They used various algorithms but no single algorithm provided a sufficient accuracy value. They need to experiment with new classification algorithms which provide high accuracy. |
Subasi et al. [17] | 2018 | M-health | SVM, RF | 86% | The RF algorithm is most efficient with a high amount of data. It results in a high accuracy value. |
Jan et al. [18] | 2018 | Cleveland and Hungarian. | RF, NB, R, NN, SVM | 98.136% | While RF provides very high accuracy, the regression algorithm provides the lowest accuracy value with a high volume of data. |
Singh et al. [14] | 2020 | Indian Liver Patient Dataset (ILPD) | LR, SMO, RF, NB, J48, IBk | 77.4% | The best accuracy result was from the LR with feature selection. We suggest using the CFS algorithm for feature selection to enhance the accuracy value. |
Pinto et al. [23] | 2020 | Chronic Kidney Disease | J48 | 97.66% | They developed a system that can categorize the chronic condition of kidney diseases. The J48 algorithm is suitable for the small or medium volume of data. |
Ansari et al. [13] | 2020 | UCI Heart Disease | LR, PCA | 86% | LR with PCA achieved the best accuracy. |
Ali et al. [16] | 2021 | Kaggle Heart Disease | MLP, RF, DT, KNN | 100% | Three classifications based on KNN, DT, and RF algorithms have the highest accuracy value. |
Jothi et al. [21] | 2021 | Heart Disease | DT, KNN | 81% | DT has 81% accuracy and the KNN algorithm has an accuracy rate or level of 67%. We assume that the Random Forest algorithm is more efficient with the proposed work. |
Arumugam et al. [22] | 2021 | Diabetes based Heart Disease | NB, SVM, DT | 90% | DT has the highest accuracy value. The DT model consistently has higher accuracy than NB and SVM models. |
Mateo et al. [24] | 2021 | Acute Bronchiolitis | GBT, KNN, NB, SVM, DT | 94% | The XGB has the highest prediction accuracy. Reduction data implementation is important to enhance the accuracy value of the prediction. |
Kondababu et al. [15] | 2021 | UCI Heart Disease | RF, LM | 88.4% | We suggest the use of suitable data preprocessing steps and a reduction algorithm as the CFS to enhance the accuracy value. |
Dataset | Attributes’ Number | Instances’ Number | Missing Data (Y/N) | Redundancy (Y/N) | Noise (Y/N) |
---|---|---|---|---|---|
M-Health [39] | 24 | 161281 | Y | Y | Y |
Diabetes [40] | 50 | 101797 | Y | Y | N |
Heart Disease [41] | 19 | 59077 | Y | N | Y |
Dataset | Encoding (Y/N) | Feature Selection (Y/N) | Attributes’ Number |
---|---|---|---|
M-Health | N | Y | 12 |
Diabetes | Y | Y | 29 |
Heart Disease | Y | Y | 10 |
Algorithm | Accuracy | Relative Error | Precision | Sensitivity | Time (s) |
---|---|---|---|---|---|
NB | 66.2 | 33.8 | 75.2 | 89.56 | 1.39 |
LM | 69.96 | 30.04 | 70.43 | 72.67 | 3.9 |
R | 74.4 | 25.6 | 69.73 | 70.02 | 8.04 |
DT | 75 | 25 | 78.43 | 78.9 | 4.7 |
RF | 75.2 | 24.8 | 89.52 | 99.06 | 9.12 |
GBT | 74.1 | 25.9 | 87.4 | 90.02 | 16.79 |
J48 | 72.9 | 27.1 | 70.42 | 84.1 | 86.83 |
Algorithm | Accuracy | Relative Error | Precision | Sensitivity | Time (s) |
---|---|---|---|---|---|
NB | 77.5 | 22.5 | 86.7 | 65 | 0.078 |
LM | 82.5 | 17.5 | 78.3 | 90 | 3 |
R | 90 | 10 | 86.4 | 95 | 2.95 |
DT | 80 | 20 | 73.1 | 95 | 2.99 |
RF | 90 | 10 | 83.3 | 100 | 3 |
GBT | 90 | 10 | 83.3 | 100 | 16 |
J48 | 84.56 | 15.44 | 84.6 | 84.6 | 85.5 |
Algorithm | Accuracy | Relative Error | Precision | Sensitivity | Time (s) |
---|---|---|---|---|---|
NB | 70.82 | 29.18 | 68.75 | 74.15 | 9.07 |
LM | 65.67 | 34.33 | 72.36 | 71.08 | 12.08 |
R | 70.39 | 29.61 | 68.87 | 69.17 | 10 |
DT | 70.04 | 29.96 | 79.6 | 76.75 | 12.613 |
RF | 79.2 | 20.8 | 75.46 | 80.05 | 16.814 |
GBT | 74.25 | 27.75 | 76.21 | 77.8 | 685.47 |
J48 | 60.4 | 39.6 | 65.8 | 60.7 | 91.65 |
Algorithm | Accuracy | Relative Error | Precision | Sensitivity | Time (s) |
---|---|---|---|---|---|
NB | 82.25 | 17.75 | 88.2 | 70.04 | 7.683 |
LM | 70.44 | 29.65 | 74.31 | 82.52 | 10.158 |
R | 77.1 | 22.9 | 81.1 | 89 | 9.351 |
DT | 79.04 | 20.96 | 66.23 | 90.2 | 9.926 |
RF | 83.59 | 16.41 | 74.3 | 93.1 | 13.481 |
GBT | 79.29 | 20.71 | 72.13 | 91.4 | 446.8 |
J48 | 66.39 | 33.61 | 61.2 | 54.2 | 72.42 |
Algorithm | Accuracy | Relative Error | Precision | Sensitivity | Time (s) |
---|---|---|---|---|---|
NB | 93.62 | 6.38 | 87.9 | 90.18 | 27.026 |
LM | 91.73 | 8.27 | 81.6 | 83.6 | 31.4231 |
R | 80.09 | 19.91 | 68.06 | 74.5 | 31.753 |
DT | 92.53 | 7.47 | 85.5 | 89.086 | 33.9 |
RF | 93.96 | 6.04 | 90.26 | 91.47 | 58.205 |
GBT | 91.07 | 8.93 | 89.69 | 87.05 | 123.45 |
J48 | 70.23 | 29.77 | 70.1 | 72.06 | 90.64 |
Algorithm | Accuracy | Relative Error | Precision | Sensitivity | Time (s) |
---|---|---|---|---|---|
NB | 96.16 | 3.84 | 83.35 | 91.43 | 22.374 |
LM | 94.4 | 5.6 | 81.03 | 76.98 | 26.564 |
R | 85.31 | 14.96 | 79.01 | 95 | 29.064 |
DT | 96.85 | 3.15 | 85.8 | 95.48 | 33.667 |
RF | 97.58 | 2.42 | 86.39 | 97.14 | 56.508 |
GBT | 96.55 | 3.45 | 85.08 | 95.08 | 117.845 |
J48 | 73.828 | 26.172 | 73.5 | 75.1 | 87.24 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Elzeheiry, H.A.; Barakat, S.; Rezk, A. Different Scales of Medical Data Classification Based on Machine Learning Techniques: A Comparative Study. Appl. Sci. 2022, 12, 919. https://doi.org/10.3390/app12020919
Elzeheiry HA, Barakat S, Rezk A. Different Scales of Medical Data Classification Based on Machine Learning Techniques: A Comparative Study. Applied Sciences. 2022; 12(2):919. https://doi.org/10.3390/app12020919
Chicago/Turabian StyleElzeheiry, Heba Aly, Sherief Barakat, and Amira Rezk. 2022. "Different Scales of Medical Data Classification Based on Machine Learning Techniques: A Comparative Study" Applied Sciences 12, no. 2: 919. https://doi.org/10.3390/app12020919
APA StyleElzeheiry, H. A., Barakat, S., & Rezk, A. (2022). Different Scales of Medical Data Classification Based on Machine Learning Techniques: A Comparative Study. Applied Sciences, 12(2), 919. https://doi.org/10.3390/app12020919