A Review of Data Engineering in United States Healthcare Infrastructure
Abstract
1. Introduction
- Assessing the Role of Data Engineering in AI and ML Implementation in the Healthcare Industry: Evaluating how data engineering supports the deployment of AI and ML models in the healthcare sector, focusing on the unique data requirements for high-performance, reliable, and scalable AI solutions.
- Identifing Challenges in Healthcare Data Management: Investigating the specific challenges healthcare institutions face in managing, structuring, and utilizing large volumes of data, including issues related to data quality, accessibility, integration, and regulatory compliance.
- Assessing the Future Potential and Limitations of AI in Healthcare: Critically evaluating the future trajectory of AI and ML in healthcare, including potential breakthroughs and ongoing limitations, with a focus on how advancements in data engineering can address existing roadblocks and accelerate adoption.
2. Background
2.1. Data Engineering
2.2. Data Gaps in Healthcare
2.3. Machine Learning in Healthcare
3. Materials and Methods
3.1. Overview of the Literature Review
3.2. AI in Oncology Research
3.3. AI in Cardiovascular and Metabolic Disease Prediction
3.4. AI for Infectious Disease Detection and Public Health
3.5. AI in Neurological and Cognitive Disorders
3.6. AI in Medical Imaging and Computer Vision
3.7. Algorithmic Innovations and Framework Development
3.8. Summary of Methodological Approach
4. Discussion
4.1. Data Engineering as the Practical Bottleneck Solution for Clinical AI
4.2. Risk, Liability, and Privacy Constraints Shape Technical Choices
4.3. Evaluation Gaps: Interpretability, Generalizability, and Bias
4.4. Interoperability and “Fragmented Architectures” as a Core Systems Problem
4.5. Future Directions
- The idea of end-to-end data pipeline design with an emphasis on reusable data channels that support cleaning, labeling, provenance tracking, and continuous updates, rather than one-off dataset construction. Here we will investigate different approaches and methods used to successfully implement interoperability across healthcare ecosystems.
- The methods and techniques used for external validation by investigating the concept of multi-site evaluations and reporting performance across different healthcare ecosystems. This is especially true if data is gathered from different systems and the accrued data is used for decision-making in healthcare.
- The methods and techniques used for routine subgroup auditing, documentation of missingness, and bias checks into data engineering workflows. The concept of missingness and bias is rarely investigated in data engineering applied to healthcare.
- Approaches that enable learning from distributed data while respecting privacy constraints, paired with clear governance structures. Here an investigation on improving data governance while maintaining privacy and security is needed.
4.6. Limitations
5. Conclusions
- Identifying the need for reliable AI systems and problems associated with reliability,
- Exploration of challenges within the healthcare ecosystem in adopting AI systems for healthcare, and
- Identifying some of the key limitations of data engineering with respect to use of AI in healthcare.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| AUC | Area Under Curve |
| CAD | Computer-Aided Diagnosis |
| CE-CT | Contrast-Enhanced Computed Tomography |
| CNN | Convolutional Neural Network |
| CQ | Classical-Quantum |
| CT | Computed Tomography |
| DCGAN | Deep Convolutional Generative Adversarial Network |
| DCNN | Deep Convolutional Neural Network |
| EHRs | Electronic Health Records |
| EMD | Empirical Mode Decomposition |
| GAN | Generative Adversarial Network |
| GCN | Graph Convolutional Network |
| GLCM | Gray-Level Co-occurrence Matrix |
| GWAS | Genome-Wide Association Studies |
| LIME | Local Interpretable Model-agnostic Explanations |
| ML | Machine Learning |
| MRI | Magnetic Resonance Imaging |
| ROC | Receiver Operating Characteristic |
| SHAP | SHapley Additive exPlanations |
| TBI | Traumatic Brain Injury |
References
- Spyns, P.; Meersman, R.; Jarrar, M. Data modelling versus ontology engineering. ACM SIGMOD Rec. 2002, 31, 12–17. [Google Scholar] [CrossRef]
- Zikopoulos, P. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data; McGraw-Hill: New York, NY, USA, 2012. [Google Scholar]
- Sacco, A.Y.; Self, Q.R.; Worswick, E.L.; Couperus, C.J.; Kolli, S.S.; Muñoz, S.A.; Carney, J.K.; Repp, A.B. Patients’ perspectives of diagnostic error: A qualitative study. J. Patient Saf. 2021, 17, e1759–e1764. [Google Scholar] [CrossRef] [PubMed]
- Ruini, C.; Schlingmann, S.; Jonke, Ž.; Avci, P.; Padrón-Laso, V.; Neumeier, F.; Koveshazi, I.; Ikeliani, I.U.; Patzer, K.; Kunrad, E.; et al. Machine Learning Based Prediction of Squamous Cell Carcinoma in Ex Vivo Confocal Laser Scanning Microscopy. Cancers 2021, 13, 5522. [Google Scholar] [CrossRef] [PubMed]
- Jang, H.J.; Lee, S.H. AI-Driven Digital Pathology: Deep Learning and Multimodal Integration for Precision Oncology. Int. J. Mol. Sci. 2026, 27, 379. [Google Scholar] [CrossRef]
- Naeem, A.; Anees, T.; Khalil, M.; Zahra, K.; Naqvi, R.A.; Lee, S.W. SNC_Net: Skin Cancer Detection by Integrating Handcrafted and Deep Learning-Based Features Using Dermoscopy Images. Mathematics 2024, 12, 1030. [Google Scholar] [CrossRef]
- Mobiny, A.; Singh, A.; Van Nguyen, H. Risk-Aware Machine Learning Classifier for Skin Lesion Diagnosis. J. Clin. Med. 2019, 8, 1241. [Google Scholar] [CrossRef]
- Liu, L.; Parker, K.J.; Jung, S.H. Design and Analysis Methods for Trials with AI-Based Diagnostic Devices for Breast Cancer. J. Pers. Med. 2021, 11, 1150. [Google Scholar] [CrossRef]
- Kuno, M.; Osumi, H.; Udagawa, S.; Yoshikawa, K.; Ooki, A.; Shinozaki, E.; Ishikawa, T.; Oba, J.; Yamaguchi, K.; Sakurada, K. Artificial Intelligence in Clinical Oncology: From Productivity Enhancement to Creative Discovery. Curr. Oncol. 2025, 32, 588. [Google Scholar] [CrossRef]
- Daimiel Naranjo, I.; Gibbs, P.; Reiner, J.S.; Lo Gullo, R.; Sooknanan, C.; Thakur, S.B.; Jochelson, M.S.; Sevilimedu, V.; Morris, E.A.; Baltzer, P.A.T.; et al. Radiomics and Machine Learning with Multiparametric Breast MRI for Improved Diagnostic Accuracy in Breast Cancer Diagnosis. Diagnostics 2021, 11, 919. [Google Scholar] [CrossRef]
- Danala, G.; Maryada, S.K.; Islam, W.; Faiz, R.; Jones, M.; Qiu, Y.; Zheng, B. A Comparison of Computer-Aided Diagnosis Schemes Optimized Using Radiomics and Deep Transfer Learning Methods. Bioengineering 2022, 9, 256. [Google Scholar] [CrossRef]
- Adebiyi, M.O.; Arowolo, M.O.; Mshelia, M.D.; Olugbara, O.O. A Linear Discriminant Analysis and Classification Model for Breast Cancer Diagnosis. Appl. Sci. 2022, 12, 11455. [Google Scholar] [CrossRef]
- Soltan, A.; Washington, P. Challenges in Reducing Bias Using Post-Processing Fairness for Breast Cancer Stage Classification with Deep Learning. Algorithms 2024, 17, 141. [Google Scholar] [CrossRef]
- Huhulea, E.N.; Huang, L.; Eng, S.; Sumawi, B.; Huang, A.; Aifuwa, E.; Hirani, R.; Tiwari, R.K.; Etienne, M. Artificial Intelligence Advancements in Oncology: A Review of Current Trends and Future Directions. Biomedicines 2025, 13, 951. [Google Scholar] [CrossRef] [PubMed]
- Troisi, J.; Tafuro, M.; Lombardi, M.; Scala, G.; Richards, S.M.; Symes, S.J.K.; Ascierto, P.A.; Delrio, P.; Tatangelo, F.; Buonerba, C.; et al. A Metabolomics-Based Screening Proposal for Colorectal Cancer. Metabolites 2022, 12, 110. [Google Scholar] [CrossRef] [PubMed]
- Mostafa, F.; Hasan, E.; Williamson, M.; Khan, H. Statistical Machine Learning Approaches to Liver Disease Prediction. Livers 2021, 1, 294–312. [Google Scholar] [CrossRef]
- Heinzelmann, E.; Piraino, F. AI-Enhanced Patient-Derived Cancer Organoids: Integrating Machine Learning for Precision Oncology. Organoids 2025, 4, 30. [Google Scholar] [CrossRef]
- Sultan, L.R.; Cary, T.W.; Al-Hasani, M.; Karmacharya, M.B.; Venkatesh, S.S.; Assenmacher, C.A.; Radaelli, E.; Sehgal, C.M. Can Sequential Images from the Same Object Be Used for Training Machine Learning Models? A Case Study for Detecting Liver Disease by Ultrasound Radiomics. AI 2022, 3, 739–750. [Google Scholar] [CrossRef]
- Read, A.J.; Zhou, W.; Saini, S.D.; Zhu, J.; Waljee, A.K. Prediction of Gastrointestinal Tract Cancers Using Longitudinal Electronic Health Record Data. Cancers 2023, 15, 1399. [Google Scholar] [CrossRef]
- Dunn, B.; Pierobon, M.; Wei, Q. Automated Classification of Lung Cancer Subtypes Using Deep Learning and CT-Scan Based Radiomic Analysis. Bioengineering 2023, 10, 690. [Google Scholar] [CrossRef]
- Nasrullah, N.; Sang, J.; Alam, M.S.; Mateen, M.; Cai, B.; Hu, H. Automated Lung Nodule Detection and Classification Using Deep Learning Combined with Multiple Strategies. Sensors 2019, 19, 3722. [Google Scholar] [CrossRef]
- Wang, T.W.; Wang, C.K.; Hong, J.S.; Chao, H.S.; Chen, Y.M.; Wu, Y.T. Deep Learning in Thoracic Oncology: Meta-Analytical Insights into Lung Nodule Early-Detection Technologies. Cancers 2025, 17, 621. [Google Scholar] [CrossRef]
- Shehata, M.; Alksas, A.; Abouelkheir, R.T.; Elmahdy, A.; Shaffie, A.; Soliman, A.; Ghazal, M.; Abu Khalifeh, H.; Salim, R.; Abdel Razek, A.A.K.; et al. A Comprehensive Computer-Assisted Diagnosis System for Early Assessment of Renal Cancer Tumors. Sensors 2021, 21, 4928. [Google Scholar] [CrossRef]
- Latif, G.; Ben Brahim, G.; Iskandar, D.N.F.A.; Bashar, A.; Alghazo, J. Glioma Tumors’ Classification Using Deep-Neural-Network-Based Features with SVM Classifier. Diagnostics 2022, 12, 1018. [Google Scholar] [CrossRef]
- Barhoumi, Y.; Fattah, A.H.; Bouaynaya, N.; Moron, F.; Kim, J.; Fathallah-Shaykh, H.M.; Chahine, R.A.; Sotoudeh, H. Robust AI-Driven Segmentation of Glioblastoma T1c and FLAIR MRI Series and the Low Variability of the MRIMath© Smart Manual Contouring Platform. Diagnostics 2024, 14, 1066. [Google Scholar] [CrossRef]
- Onakpojeruo, E.P.; Mustapha, M.T.; Ozsahin, D.U.; Ozsahin, I. A Comparative Analysis of the Novel Conditional Deep Convolutional Neural Network Model, Using Conditional Deep Convolutional Generative Adversarial Network-Generated Synthetic and Augmented Brain Tumor Datasets for Image Classification. Brain Sci. 2024, 14, 559. [Google Scholar] [CrossRef]
- Wang, H.Y.; Chen, C.H.; Shi, S.; Chung, C.R.; Wen, Y.H.; Wu, M.H.; Lebowitz, M.S.; Zhou, J.; Lu, J.J. Improving Multi-Tumor Biomarker Health Check-Up Tests with Machine Learning Algorithms. Cancers 2020, 12, 1442. [Google Scholar] [CrossRef]
- Liang, Y.; Gharipour, A.; Kelemen, E.; Kelemen, A. Homogeneous Ensemble Feature Selection for Mass Spectrometry Data Prediction in Cancer Studies. Mathematics 2024, 12, 2085. [Google Scholar] [CrossRef]
- Pashaei, E. An Efficient Binary Sand Cat Swarm Optimization for Feature Selection in High-Dimensional Biomedical Data. Bioengineering 2023, 10, 1123. [Google Scholar] [CrossRef] [PubMed]
- Bulić, L.; Brlek, P.; Hrvatin, N.; Brenner, E.; Škaro, V.; Projić, P.; Rogan, S.A.; Bebek, M.; Shah, P.; Primorac, D. AI-Driven Advances in Precision Oncology: Toward Optimizing Cancer Diagnostics and Personalized Treatment. AI 2026, 7, 11. [Google Scholar] [CrossRef]
- Dweekat, O.Y.; Lam, S.S. Cervical Cancer Diagnosis Using an Integrated System of Principal Component Analysis, Genetic Algorithm, and Multilayer Perceptron. Healthcare 2022, 10, 2002. [Google Scholar] [CrossRef] [PubMed]
- Saeed, Z.; Bouhali, O.; Ji, J.X.; Hammoud, R.; Al-Hammadi, N.; Aouadi, S.; Torfeh, T. Cancerous and Non-Cancerous MRI Classification Using Dual DCNN Approach. Bioengineering 2024, 11, 410. [Google Scholar] [CrossRef] [PubMed]
- Yang, H.W.; Hsiao, C.Y.; Peng, Y.Q.; Lin, T.Y.; Tsai, L.W.; Lin, C.; Lo, M.T.; Shih, C.M. Identification of Patients with Potential Atrial Fibrillation during Sinus Rhythm Using Isolated P Wave Characteristics from 12-Lead ECGs. J. Pers. Med. 2022, 12, 1608. [Google Scholar] [CrossRef] [PubMed]
- Khan Mamun, M.M.R.; Elfouly, T. Detection of Cardiovascular Disease from Clinical Parameters Using a One-Dimensional Convolutional Neural Network. Bioengineering 2023, 10, 796. [Google Scholar] [CrossRef]
- Decoodt, P.; Liang, T.J.; Bopardikar, S.; Santhanam, H.; Eyembe, A.; Garcia-Zapirain, B.; Sierra-Sosa, D. Hybrid Classical–Quantum Transfer Learning for Cardiomegaly Detection in Chest X-rays. J. Imaging 2023, 9, 128. [Google Scholar] [CrossRef] [PubMed]
- Decoodt, P.; Sierra-Sosa, D.; Anghel, L.; Cuminetti, G.; De Keyzer, E.; Morissens, M. Transfer Learning Video Classification of Preserved, Mid-Range, and Reduced Left Ventricular Ejection Fraction in Echocardiography. Diagnostics 2024, 14, 1439. [Google Scholar] [CrossRef]
- Lei, N.; Kareem, M.; Moon, S.K.; Ciaccio, E.J.; Acharya, U.R.; Faust, O. Hybrid Decision Support to Monitor Atrial Fibrillation for Stroke Prevention. Int. J. Environ. Res. Public Health 2021, 18, 813. [Google Scholar] [CrossRef]
- Chen, J.; Ji, Y.; Su, T.; Jin, M.; Yuan, Z.; Peng, Y.; Zhou, S.; Bao, H.; Luo, S.; Wang, H.; et al. Prediction of Adverse Outcomes in De Novo Hypertensive Disorders of Pregnancy: Development and Validation of Maternal and Neonatal Prognostic Models. Healthcare 2022, 10, 2307. [Google Scholar] [CrossRef]
- Perišić, M.M.; Vladimir, K.; Karpov, S.; Štorga, M.; Mostashari, A.; Khanin, R. Polygenic Risk Score and Risk Factors for Preeclampsia and Gestational Hypertension. J. Pers. Med. 2022, 12, 1826. [Google Scholar] [CrossRef]
- Prabhakar, A.J.; Prabhu, S.; Agrawal, A.; Banerjee, S.; Joshua, A.M.; Kamat, Y.D.; Nath, G.; Sengupta, S. Use of Machine Learning for Early Detection of Knee Osteoarthritis and Quantifying Effectiveness of Treatment Using Force Platform. J. Sens. Actuator Netw. 2022, 11, 48. [Google Scholar] [CrossRef]
- Sohail, M.N.; Jiadong, R.; Muhammad, M.U.; Chauhdary, S.T.; Arshad, J.; Verghese, A.J. An Accurate Clinical Implication Assessment for Diabetes Mellitus Prevalence Based on a Study from Nigeria. Processes 2019, 7, 289. [Google Scholar] [CrossRef]
- Ahsan, M.M.; Mahmud, M.A.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
- Mariani, M.C.; Biney, F.; Tweneboah, O.K. Analyzing Medical Data by Using Statistical Learning Models. Mathematics 2021, 9, 968. [Google Scholar] [CrossRef]
- Latif, G.; Morsy, H.; Hassan, A.; Alghazo, J. Novel Coronavirus and Common Pneumonia Detection from CT Scans Using Deep Learning-Based Extracted Features. Viruses 2022, 14, 1667. [Google Scholar] [CrossRef] [PubMed]
- Pradhan, A.; Prabhu, S.; Chadaga, K.; Sengupta, S.; Nath, G. Supervised Learning Models for the Preliminary Detection of COVID-19 in Patients Using Demographic and Epidemiological Parameters. Information 2022, 13, 330. [Google Scholar] [CrossRef]
- Le, N.; Sorensen, J.; Bui, T.; Choudhary, A.; Luu, K.; Nguyen, H. Enhance Portable Radiograph for Fast and High Accurate COVID-19 Monitoring. Diagnostics 2021, 11, 1080. [Google Scholar] [CrossRef]
- Khaloufi, H.; Abouelmehdi, K.; Beni-Hssane, A.; Rustam, F.; Jurcut, A.D.; Lee, E.; Ashraf, I. Deep Learning Based Early Detection Framework for Preliminary Diagnosis of COVID-19 via Onboard Smartphone Sensors. Sensors 2021, 21, 6853. [Google Scholar] [CrossRef]
- Abbaspour, S.; Robbins, G.K.; Blumenthal, K.G.; Hashimoto, D.; Hopcia, K.; Mukerji, S.S.; Shenoy, E.S.; Wang, W.; Klerman, E.B. Identifying Modifiable Predictors of COVID-19 Vaccine Side Effects: A Machine Learning Approach. Vaccines 2022, 10, 1747. [Google Scholar] [CrossRef]
- Cho, Y.S.; Hong, P.C. Applying Machine Learning to Healthcare Operations Management: CNN-Based Model for Malaria Diagnosis. Healthcare 2023, 11, 1779. [Google Scholar] [CrossRef]
- Khafaga, D.S.; Ibrahim, A.; El-Kenawy, E.S.M.; Abdelhamid, A.A.; Karim, F.K.; Mirjalili, S.; Khodadadi, N.; Lim, W.H.; Eid, M.M.; Ghoneim, M.E. An Al-Biruni Earth Radius Optimization-Based Deep Convolutional Neural Network for Classifying Monkeypox Disease. Diagnostics 2022, 12, 2892. [Google Scholar] [CrossRef]
- Bangyal, W.H.; Rehman, N.U.; Nawaz, A.; Nisar, K.; Ibrahim, A.A.A.; Shakir, R.; Rawat, D.B. Constructing Domain Ontology for Alzheimer Disease Using Deep Learning Based Approach. Electronics 2022, 11, 1890. [Google Scholar] [CrossRef]
- Mandal, P.K.; Mahto, R.V. Deep Multi-Branch CNN Architecture for Early Alzheimer’s Detection from Brain MRIs. Sensors 2023, 23, 8192. [Google Scholar] [CrossRef]
- Huynh, N.; Yan, D.; Ma, Y.; Wu, S.; Long, C.; Sami, M.T.; Almudaifer, A.; Jiang, Z.; Chen, H.; Dretsch, M.N.; et al. The Use of Generative Adversarial Network and Graph Convolution Network for Neuroimaging-Based Diagnostic Classification. Brain Sci. 2024, 14, 456. [Google Scholar] [CrossRef]
- Zhang, Y.; Dong, Z.; Wang, S.; Ji, G.; Yang, J. Preclinical Diagnosis of Magnetic Resonance (MR) Brain Images via Discrete Wavelet Packet Transform with Tsallis Entropy and Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM). Entropy 2015, 17, 1795–1813. [Google Scholar] [CrossRef]
- Ozkan, H. A Comparison of Classification Methods for Telediagnosis of Parkinson’s Disease. Entropy 2016, 18, 115. [Google Scholar] [CrossRef]
- Dhillon, N.S.; Sutandi, A.; Vishwanath, M.; Lim, M.M.; Cao, H.; Si, D. A Raspberry Pi-Based Traumatic Brain Injury Detection System for Single-Channel Electroencephalogram. Sensors 2021, 21, 2779. [Google Scholar] [CrossRef] [PubMed]
- Lenkala, S.; Marry, R.; Gopovaram, S.R.; Akinci, T.C.; Topsakal, O. Comparison of Automated Machine Learning (AutoML) Tools for Epileptic Seizure Detection Using Electroencephalograms (EEG). Computers 2023, 12, 197. [Google Scholar] [CrossRef]
- Guan, Y.; Cheng, C.H.; Chen, W.; Zhang, Y.; Koo, S.; Krengel, M.; Janulewicz, P.; Toomey, R.; Yang, E.; Bhadelia, R.; et al. Neuroimaging Markers for Studying Gulf-War Illness: Single-Subject Level Analytical Method Based on Machine Learning. Brain Sci. 2020, 10, 884. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Cano, L.; Boccuto, L.; Sirci, F.; Hidalgo, J.M.; Valentini, S.; Bosio, M.; Liogier D’Ardhuy, X.; Skinner, C.; Cascio, L.; Srikanth, S.; et al. Characterization of a Clinically and Biologically Defined Subgroup of Patients with Autism Spectrum Disorder and Identification of a Tailored Combination Treatment. Biomedicines 2024, 12, 991. [Google Scholar] [CrossRef]
- Zhang, H.; Li, Z.; Zhao, H.; Li, Z.; Zhang, Y. Attentive Octave Convolutional Capsule Network for Medical Image Classification. Appl. Sci. 2022, 12, 2634. [Google Scholar] [CrossRef]
- Zou, R.; Wang, Q.; Wen, F.; Chen, Y.; Liu, J.; Du, S.; Yuan, C. An Interactive Image Segmentation Method Based on Multi-Level Semantic Fusion. Sensors 2023, 23, 6394. [Google Scholar] [CrossRef]
- Oghalai, T.P.; Long, R.; Kim, W.; Applegate, B.E.; Oghalai, J.S. Automated Segmentation of Optical Coherence Tomography Images of the Human Tympanic Membrane Using Deep Learning. Algorithms 2023, 16, 445. [Google Scholar] [CrossRef]
- Abuhussein, M.; Robinson, A. Obscurant Segmentation in Long Wave Infrared Images Using GLCM Textures. J. Imaging 2022, 8, 266. [Google Scholar] [CrossRef]
- Jamjoom, M.; Mahmoud, A.M.; Abbas, S.; Hodhod, R. Gaussian Mixture with Max Expectation Guide for Stacked Architecture of Denoising Autoencoder and DRBM for Medical Chest Scans and Disease Identification. Electronics 2023, 12, 105. [Google Scholar] [CrossRef]
- Collazo, C.; Vargas, I.; Cara, B.; Weinheimer, C.J.; Grabau, R.P.; Goldgof, D.; Hall, L.; Wickline, S.A.; Pan, H. Synergizing Deep Learning-Enabled Preprocessing and Human–AI Integration for Efficient Automatic Ground Truth Generation. Bioengineering 2024, 11, 434. [Google Scholar] [CrossRef] [PubMed]
- Rosenberg, G.; Brubaker, J.K.; Schuetz, M.J.A.; Salton, G.; Zhu, Z.; Zhu, E.Y.; Kadıoğlu, S.; Borujeni, S.E.; Katzgraber, H.G. Explainable Artificial Intelligence Using Expressive Boolean Formulas. Mach. Learn. Knowl. Extr. 2023, 5, 1760–1795. [Google Scholar] [CrossRef]
- Ghimire, A.; Amsaad, F. A Parallel Approach to Enhance the Performance of Supervised Machine Learning Realized in a Multicore Environment. Mach. Learn. Knowl. Extr. 2024, 6, 1840–1856. [Google Scholar] [CrossRef]
- Dharavath, M. Transforming Healthcare Through Data Engineering, Predictive Analytics, and AI Models. Int. J. Res. Comput. Appl. Inf. Technol. (IJRCAIT) 2024, 7, 1710–1718. [Google Scholar]
- Dash, S.; Shakyawar, S.K.; Sharma, M.; Kaushik, S. Big data in healthcare: Management, analysis and future prospects. J. Big Data 2019, 6, 54. [Google Scholar] [CrossRef]
- Palanisamy, V.; Thirunavukarasu, R. Implications of big data analytics in developing healthcare frameworks—A review. J. King Saud Univ.-Comput. Inf. Sci. 2019, 31, 415–425. [Google Scholar] [CrossRef]
- Zhang, A.; Xing, L.; Zou, J.; Wu, J.C. Shifting machine learning for healthcare from development to deployment and from models to data. Nat. Biomed. Eng. 2022, 6, 1330–1345. [Google Scholar] [CrossRef]
- Nasir, A.; Gurupur, V.; Liu, X. A new paradigm to analyze data completeness of patient data. Appl. Clin. Inform. 2016, 7, 745–764. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Nasir, A.; Liu, X.; Gurupur, V.; Qureshi, Z. Disparities in patient record completeness with respect to the health care utilization project. Health Inform. J. 2019, 25, 401–416. [Google Scholar] [CrossRef] [PubMed]
- Gurupur, V.P.; Shelleh, M. Machine learning analysis for data incompleteness (madi): Analyzing the data completeness of patient records using a random variable approach to predict the incompleteness of electronic health records. IEEE Access 2021, 9, 95994–96001. [Google Scholar] [CrossRef]
- Biswas, P.; Gharami, P.P.; Islam, M.R. XMP-Net: An XAI-Based Modified Xception Model for Recognizing Monkeypox and Other Skin Diseases. BioMed Res. Int. 2026, 2026, 1113178. [Google Scholar]
- Adegoke, K.; Adegoke, A.; Dawodu, D.; Adekoya, A.; Bayowa, A.; Kayode, T.; Singh, M. Interoperability as a Catalyst for Digital Health and Therapeutics: A Scoping Review of Emerging Technologies and Standards (2015–2025). Int. J. Environ. Res. Public Health 2025, 22, 1535. [Google Scholar] [CrossRef] [PubMed]
- Gazzarata, R.; Almeida, J.; Lindsköld, L.; Cangioli, G.; Gaeta, E.; Fico, G.; Chronaki, C.E. HL7 Fast Healthcare Interoperability Resources (HL7 FHIR) in digital healthcare ecosystems for chronic disease management: Scoping review. Int. J. Med. Inform. 2024, 189, 105507. [Google Scholar] [CrossRef]
- Awaysheh, A.; Wilcke, J.; Elvinger, F.; Rees, L.; Fan, W.; Zimmerman, K. A review of medical terminology standards and structured reporting. J. Vet. Diagn. Investig. 2018, 30, 17–25. [Google Scholar] [CrossRef]
- Mollerus, F.; Lynch, C.; Bruining, H. Data interoperability for a systems approach to developmental conditions. Neurosci. Biobehav. Rev. 2025, 176, 106245. [Google Scholar] [CrossRef]



| Article | Journal | Focus Keywords |
|---|---|---|
| Skin cancer detection with hybrid CNN features [6] | Mathematics | Deep learning; dermoscopy; CAD; CNN |
| Breast MRI radiomics for tumor classification [10] | Diagnostics | Radiomics; MRI; dynamic contrast; breast cancer |
| Renal cancer CAD system [23] | Sensors | CE-CT; morphology; texture; functionality |
| Multi-tumor biomarker tests with ML algorithms [27] | Cancers | Screening; health check-ups; machine learning |
| Article | Journal | Keywords |
|---|---|---|
| Atrial fibrillation prediction via ECG features [33] | J. Pers. Med. | Cardiovascular diagnosis; signal processing; EMD; ML |
| Preeclampsia and gestational hypertension risk [39] | J. Pers. Med. | Pregnancy; polygenic score; ML; GWAS |
| Knee osteoarthritis detection via force platform [40] | JSAN | Machine learning; balance metrics; biomechanics |
| Article | Journal | Key Concepts |
|---|---|---|
| COVID-19 detection from CT scans [44] | Viruses | Deep learning; image classification; CNN |
| Smartphone sensor framework for COVID prediction [47] | Sensors | On-device AI; mobile sensors; real-time inference |
| ML model for vaccine side-effect prediction [48] | Vaccines | Time-of-day effects; allergy; explainable ML |
| Article | Journal | Focus Keywords |
|---|---|---|
| Deep learning for Alzheimer’s MRI [52] | Sensors | Brain imaging; CNN; disease detection |
| GAN–GCN neuroimaging classification [53] | Brain Sciences | Resting-state fMRI; GAN; GCN; deep learning |
| Autism spectrum disorder subgroup characterization [59] | Biomedicines | Precision medicine; ASD; transcriptomics |
| Article | Journal | Focus Keywords |
|---|---|---|
| Capsule network for medical image classification [60] | Applied Sciences | Attention mechanism; octave convolution; CNN |
| Interactive segmentation via semantic fusion [61] | Sensors | Image segmentation; multi-level features; deep learning |
| Whole-slide image preprocessing and normalization [65] | Bioengineering | WSI; annotation; pathology; preprocessing |
| Article | Journal | Keywords |
|---|---|---|
| Explainable AI using Boolean logic [66] | MAKE | Interpretable ML; Boolean search; ILP; QUBO |
| Bias reduction in breast cancer classification [13] | Algorithms | Fairness; post-processing; equalized odds |
| Parallel learning performance in multicore systems [67] | MAKE | Ensemble model; multicore computing; performance |
| Clinical Domain | Methods | Key Finding | Limitation |
|---|---|---|---|
| Oncology | CNN, radiomics, transfer learning, ensemble classifiers | Deep learning and radiomics achieve strong diagnostic accuracy across cancer types; multimodal data integration further improves performance | Models trained on single-site or narrow imaging datasets; limited generalizability |
| Cardiovascular & Metabolic Disease | 1D CNN, ECG signal processing, transfer learning, federated approaches | AI reliably detects arrhythmias and metabolic risk from ECG and clinical parameters; quantum–classical hybrid models show emerging promise | Most studies use structured EHR or single-modality signals; real-world deployment remains limited |
| Infectious Disease | CNN, SMOTE balancing, explainability tools (SHAP, LIME), mobile sensors | COVID-19 accelerated AI diagnostics in low-resource settings; explainability techniques are increasingly integrated to support public health decision-making | Heavy reliance on pandemic-era datasets; unclear generalizability to endemic or novel pathogens |
| Neurological & Cognitive Disorders | Deep CNN, GAN, GCN, wavelet transforms, AutoML | AI shows strong potential for early detection of Alzheimer’s, Parkinson’s, and epilepsy; synthetic data augmentation partially compensates for small neuroimaging datasets | Small and demographically homogeneous cohorts; limited external validation |
| Medical Imaging & Computer Vision | Capsule networks, attention mechanisms, semantic segmentation, denoising autoencoders | Automated preprocessing and annotation pipelines can dramatically reduce manual expert effort; segmentation methods generalize across organ systems | Computational cost and variability in imaging protocols across institutions |
| Algorithmic Frameworks | Explainable AI (Boolean logic, SHAP), fairness post-processing, parallel ensemble learning | Interpretability and bias correction are technically feasible but remain underutilized in clinical AI pipelines; fairness requires data governance, not just algorithmic fixes | Adoption of explainability and fairness tools is inconsistent across the field |
| Publication | Description |
|---|---|
| Dharavath [68] | The author focuses on the amount of data and systems associated with electronic health record systems. |
| Dash et al. [69] | The authors focus on different applications associated with big data within healthcare systems. |
| Palanisamy et al. [70] | The authors list and describe different data frameworks associated with healthcare systems across the globe. |
| Zhang et al. [71] | In this review article, the authors provide a description of different types of data that can be incorporated in developing machine learning models for healthcare decision-making. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Trader, E.A.; Hooshmand, S.; Abedin, P.; Park, J.; Gurupur, V. A Review of Data Engineering in United States Healthcare Infrastructure. Healthcare 2026, 14, 1401. https://doi.org/10.3390/healthcare14101401
Trader EA, Hooshmand S, Abedin P, Park J, Gurupur V. A Review of Data Engineering in United States Healthcare Infrastructure. Healthcare. 2026; 14(10):1401. https://doi.org/10.3390/healthcare14101401
Chicago/Turabian StyleTrader, Elizabeth A., Sahar Hooshmand, Paniz Abedin, Jaeyoung Park, and Varadraj Gurupur. 2026. "A Review of Data Engineering in United States Healthcare Infrastructure" Healthcare 14, no. 10: 1401. https://doi.org/10.3390/healthcare14101401
APA StyleTrader, E. A., Hooshmand, S., Abedin, P., Park, J., & Gurupur, V. (2026). A Review of Data Engineering in United States Healthcare Infrastructure. Healthcare, 14(10), 1401. https://doi.org/10.3390/healthcare14101401

