Next Article in Journal
From Subsumption to Semantic Mediation: A Generative Orchestration Architecture for Autonomous Systems
Previous Article in Journal
From Black Box to Glass Box: SHAP-Explained XGBoost Model for Coronary Artery Disease Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Review

Recent Trends in Machine Learning for Healthcare Big Data Applications: Review of Velocity and Volume Challenges

by
Doaa Yaseen Khudhur
1,2,*,
Abdul Samad Shibghatullah
1,
Khalid Shaker
2,
Aliza Abdul Latif
1 and
Zakaria Che Muda
3
1
Department of Informatics, College of Computing & Information Technology, Universiti Tenaga Nasional, Putrajaya Campus, Jalan IKRAM-UNITEN, Kajang 43000, Selangor, Malaysia
2
Departments of Artificial Intelligence & Information Technology, College of Computer Science and Information Technology, University of Anbar, Ramadi 31001, Iraq
3
Faculty of Engineering and Quantity Surveying, INTI International University, Nilai 71800, Negeri Sembilan, Malaysia
*
Author to whom correspondence should be addressed.
Algorithms 2025, 18(12), 772; https://doi.org/10.3390/a18120772
Submission received: 16 October 2025 / Revised: 27 November 2025 / Accepted: 28 November 2025 / Published: 8 December 2025

Abstract

The integration and emerging adoption of machine learning (ML) algorithms in healthcare big data has revolutionized clinical decision-making, predictive analytics, and real-time medical diagnostics. However, the application of machine learning in healthcare big data faces computational challenges, particularly in efficiently processing and training on large-scale, high-velocity data generated by healthcare organizations worldwide. In response to these issues, this study critically reviews and examines current state-of-the-art advancements in machine learning algorithms and big data frameworks within healthcare analytics, with a particular emphasis on solutions addressing data volume and velocity. The reviewed literature is categorized into three key areas: (1) efficient techniques, arithmetic operations, and dimensionality reduction; (2) advanced and specialized processing hardware; and (3) clustering and parallel processing methods. Key research gaps and open challenges are identified based on the evaluation of the literature across these categories, and important future research directions are discussed in detail. Among the several proposed solutions are the utilization of federated learning and decentralized data processing, as well as efficient parallel processing through big data frameworks such as Apache Spark, neuromorphic computing, and multi-swarm large-scale optimization algorithms; these highlight the importance of interdisciplinary innovations in algorithm design, hardware efficiency, and distributed computing frameworks, which collectively contribute to faster, more accurate, and resource-efficient AI-driven healthcare big data analytics and applications. This research supports the UNSDG 3 (Good Health and Well-Being) and UNSDG 9 (Industry, Innovation and Infrastructure) by integration of machine learning in healthcare big data and promoting product innovation in the healthcare industry, respectively.

1. Introduction

Driven predictive analytics and machine learning algorithms in human health have emerged as a transformative technology with significant potential to enhance the precision and efficiency of anesthetic treatments [1]. Machine learning (ML) algorithms have become progressively significant in the development of software applications and product innovation within the healthcare industry. ML techniques enable the analysis and extraction of valuable insights from diverse data sources, such as genomic information, patients’ health sensors, CT scans, and medical images—collected by healthcare organizations [2,3]. Some of the notable examples of ML applications include developing predictive models for certain diseases, enhancing clinical decision support systems, biomarker and drug target discovery, operational workflow optimization, and treatment plan customization. For instance, ML can assist in predictive analytics, automated documentation, and clinical decision support of Electronic Health Records (EHRs), thereby improving patient management. In MRI and CT scans medical data, ML algorithms are largely employed in image reconstruction, anomaly detection, and risk evaluation of lesions, thereby expediting and enhancing diagnostic accuracy [3,4]. Additionally, ML leverages DNA sequencing to identify disease-causing mutations, enabling faster human genome analysis and advancing personalized medicine through specific treatment strategies.
The integration of ML across these domains enhances efficacy, minimizes misdiagnosis, and improves clinical outcomes, marking a significant shift toward data-centric and AI-based medicine [5,6]. Overall, the implementation of ML holds significant potential for improving healthcare delivery and patient outcomes and reducing costs through analyses of enormous data produced daily within the healthcare system [7]. However, the applications of machine learning in healthcare big data face several critical challenges such as the efficient processing of large-scale, high-velocity, and high-volume healthcare data [4,8]. The medical field generates data at an unprecedented rate, presenting a significant challenge for ML algorithms to deliver real-time analysis and support decision-making. Moreover, the volume of data is expanding at unprecedented scale, rendering traditional methods inadequate for processing petabyte-scale data [9,10,11,12]. To address this complexity, ML requires robust and innovative solutions capable of efficiently, effectively, and seamlessly managing such data. While more advanced learning models such as deep learning show promise in processing high-volume and velocity data, further research is essential to fully realize their potential in healthcare big data.
Efficient classification of rapid, high-volume data streams is critical for real-time medical analytics and decision-making [11,12]. Machine learning has the potential to derive greater value from big data through the development of efficient paradigms that accelerate model training and predictions of large datasets. Addressing the challenges of velocity and volume will lead to the development of more efficient and robust algorithms capable of improving patients’ diagnoses and treatment planning [12,13]. Within the context of ML, Velocity refers to the rate at which models are trained, retrained, and updated to continuously gain insights from rapidly generated high velocity data. The vertical scaling of learning models is largely dependent on their theoretical structure and the construct of arithmetic operations. Although hardware accelerators such as GPUs and FPGAs can significantly enhance performance, their high cost and algorithm-specific nature limit their applicability across all learning models. One potential approach involves feature selection methods since the reduction in dimensionality ensures minimization of computational costs; however, feature selection poses an optimization challenge as it requires a balance between model prediction accuracy and computational efficiency. Moreover, reducing the number of features does not necessarily improve prediction accuracy, as opposed to lowering the computational cost of the model.
Similarly, the Volume of data created in healthcare settings poses unique computational challenges, given that the traditional ML algorithms are often incapable of efficiently processing large-scale datasets. The potential solution to such a challenge is horizontal scaling, in which ML models are distributed across several computing nodes using frameworks such as Apache Spark, Hadoop, and Dask. Although distributed computing facilitates parallel processing and improved resource allocation, certain machine learning algorithms are difficult—if not infeasible—to distribute due to their arithmetic sequential nature. In addition, many ML models require training on the entire dataset to accurately make predictions, making it difficult and impractical to partition data across distributed nodes. There are also additional challenges from inter-node overhead communication and synchronization of parameters, both of which contribute to increased latency and deterioration of performance. Value-based healthcare systems stand to greatly benefit from innovative and efficient AI solutions, making it essential to overcome these challenges.
Despite the achieved progress in applying ML algorithms in healthcare applications, existing methods remain insufficient to meet the rising challenges of scalable, complex, and operationally demanding healthcare big data environments. For instance, learning models suffer from training performance degradation when trained on rapidly growing and updated datasets. While distribution and parallel computing frameworks offer horizontal scaling advantages, only a small subset of ML models can benefit from such frameworks due to inherent sequential mathematical operations and interdependent parameter updates. Furthermore, the field of healthcare and ML recently saw wide adoption of hardware-accelerated solutions capable of high computational throughput; however, their applications remained limited due to implementation cost, specialized programming skills, and substantial infrastructure support. These constraints, among several others, indicate that there remains a clear need for scalable and efficient ML approaches to fully address the challenges posed by large-scale healthcare big data processing.
This review investigates recent advancements in ML algorithms and big data frameworks within healthcare big data analytics, with particular attention and focus on scalability and efficiency issues and challenges. It critically examines the current literature on recent methods and algorithmic designs to address the volume and velocity challenges facing ML in healthcare big data applications. The emphasis is first given to the use of efficient learning techniques, parallel processing, utilization of distributed computing big data frameworks, and clustering techniques. Then, the study will explore acceleration techniques enabled by the utilization of advanced hardware such as Field-Programmable Gate Arrays (FPGAs) and Application Specific Integrated Circuits (ASICs). Key research gaps and open challenges are highlighted based on the evaluation of the reviewed literature across three primary categories, and important future research and solutions are discussed in detail. Among the several proposed key solutions are the utilization of federated learning and decentralized data processing, as well as efficient parallel processing using big data frameworks such as Apache Spark, neuromorphic computing, and multi-swarm large-scale optimization algorithms. These highlight the importance of interdisciplinary innovations in algorithm design, hardware efficiency, and distributed computing frameworks, which collectively contribute to faster, more accurate, and resource-efficient AI-driven healthcare big data analytics and applications. For a guided reading, this review article is organized as depicted in Figure 1.
This research supports the UNSDG 3 (Good Health and Well-Being) and UNSDG 9 (Industry, Innovation and Infrastructure) by integration of machine learning in healthcare big data and promoting product innovation in the healthcare industry, respectively.

2. Background

2.1. Volume and Velocity in the Era of Machine Learning and Healthcare Big Data

Massive datasets in healthcare big data can be challenging to process effectively due to the inherent complexities in big data systems. Leveraging these systems to enhance healthcare services, such as accurate patient diagnosis, timely personalized treatment, longitudinal monitoring, preventive care, and medical research, requires highly efficient machine learning algorithms [8]. The generation of big data in healthcare has evolved from recording medical data to emphasizing medical outcomes and patient-centric metrics. This trend is well reflected in the academic literature, where healthcare systems transitioned from disease-based to value-based systems [8,10,14]. Notably, the characteristics of big data, such as volume, velocity, and variety, are now used to frame and address key problems in healthcare big data systems [8,14,15,16]. For example, Ohlhorst defined big data as information that grows beyond the traditional systems to process efficiently, in other words, highlighting the challenges posed by processing massive volumes of data by traditional ML algorithms and frameworks. Similarly, Laney defined big data as data that is rich in information and is available in real time, emphasizing the concept of velocity, which refers to the challenges of fast learning from a large number of data streaming sources [16,17,18].
The selection of a scalable learning technique for large medical datasets is critical due to the numerous computational and structural considerations involved. One key issue is related to the effective splitting of datasets on many nodes or processors. Efficient data splitting and segmentation are necessary to achieve balance in the computational workloads, enabling parallel processing of dataset segments. These approaches must ensure even distribution of the workload, minimum communication costs, and data integrity. Big data frameworks such as MapReduce can assist in this process [19,20]. MapReduce divides large datasets into smaller and more manageable portions and concurrently processes them across distributed systems. While MapReduce has been implemented in several research projects focusing on big medical data in healthcare [20,21], healthcare datasets often present challenges such as missing values, data entry errors, mislabeled records, and misdocumented medical conditions [22,23]. This adds to the already challenging issue of processing large-scale datasets as it can lead to misinterpretations of patient data by ML models, potentially resulting in incorrect diagnosis and inappropriate treatment recommendations.
Furthermore, certain ML models, such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, require the entirety of dataset during the training phase [18]. In addition, most ML models lack inherent structures for parallel training, which necessitates the use of other techniques such as ensemble methods [24]. The selection of appropriate ensemble methods varies depending on the model’s theoretical foundation, the type of task at hand (classification versus regression), the underlying data distribution, and the available computational resources. These complexities often outweigh the advantages offered by ensemble methods [24].
In summary, the inherent increasing complexity of volume and velocity challenges is evident in most of the recent healthcare big data reviews summarized in Table 1. The current trend in existing studies tends to focus on disease-specific challenges within the realm of healthcare big data systems instead of reflecting on a broader, holistic view of volume and velocity challenges. While valuable within their respective domains, this narrowing of scope may be attributed to the increased complexity of these challenges and the absence of directly applicable solutions.

2.2. Challenges Surrounding Large Volumes of Healthcare Big Data

Machine learning is crucial to the analysis of extensive large medical data and integration into the decision-making process in healthcare systems [8,10]. The high dimensionality of large medical sample data, such as DNA methylation and genomic data, which often contain a few hundred thousand features, poses significant scalability challenges for most ML algorithms. This becomes even more compounded by the complex mathematical structures, arithmetic operations, and theoretical constructs involved in data segmentation and distribution of these models [3,33]. To meet the growing complexities of modern healthcare systems, research and development efforts must prioritize the design and implementation of fast, scalable learning algorithms. For instance, logistic and linear regression models scale linearly with the data size and offer high interpretability—an essential feature in a clinical context—making them suitable and preferable for medical decision-support applications in contrast to deep learning models, which often consist of millions to billions of parameters and demand substantial computational resources [10,18].
Arithmetic functions, particularly matrix multiplications, are important to the scalability of ML models as they are extensively used in feature transformation, linear regression, model training, and dimensionality reduction techniques [34]. The utilization of specialized frameworks such as TensorFlow and PyTorch, as well as TPUs and hardware like GPUs, can greatly improve the efficiency of matrix operations [35,36,37,38]. However, efficiently executing matrix operations in parallel for ML models applied to large-scale datasets remains a challenge. In neural networks, for instance, the sequential dependency of layers can cause parallel computing nodes to remain idle if synchronization is not maintained. Such synchronization is essential to ensure calculation accuracy and to prevent error propagation through the network. The management of multiple layers with backpropagation, such as Deep Neural Networks (DNNs), further increases the complexity, where each layer must be treated separately as a unit computation of gradients, then fed back from the output layer to the network [39,40].
Furthermore, the theoretical foundation of a learning algorithm also influences its scalability. For instance, the K-means algorithm, a widely used unsupervised learning method, is highly scalable. K-means partitions a dataset into N clusters based on the distance, typically using Euclidean distance [18]. The calculation of the distance matrix for all data points is linearly related to the number of points in the dataset, which can be reduced substantially with parallel processing, such as divide-and-conquer techniques. In contrast, the hierarchical algorithms, such as decision trees, are generally less amenable to distribution and parallel computation. Due to the sequential branching structure, the distribution places heavy demands on processing resources and network bandwidth, and therefore, is inefficient for large-scale datasets [18,41].

2.3. Challenges Surrounding High Velocity Healthcare Big Data

Laney’s definition of big data as a rapidly created and content-rich dataset describes the contemporary understanding of “velocity” in healthcare big data research and emphasizes the importance of real-time processing generated from diverse medical sources [8,41,42]. In a fast-changing healthcare environment, where data streams originate from medical devices, electronic health records, and wearable health monitors, the swift and accurate processing of incoming medical data using ML models is essential. The capability to rapidly process extensive data streams and support real-time decisions is highly critical within healthcare systems [42]; this aligns with the core objective of healthcare systems, which is delivering the right care at the right time. Delays of diagnosis, treatment, or clinical decisions negatively impact patients and the efficiency of the healthcare system. As a result, response in near real-time is a fundamental expectation of healthcare systems [42].
Volume and velocity-related issues in ML are distinct. High-velocity data streams introduce different difficulties from those associated with large data volumes. Data streams can temporally alter data distribution and the relationships between the input and output variables, causing model concept drift and generalization failures [3,42]. In many cases, addressing these challenges requires full model retraining which is computationally inefficient and rarely feasible in healthcare big data. Online learning, a subfield of ML, allows incremental updates of the model’s parameters as new data arrives, making it well-suited for dynamic, fast-paced information [43,44]. The integration of online learning algorithms with real-time big data processing systems, such as Apache Kafka, can greatly improve system performance where fault tolerance and scalability are essential [44,45].
Due to single-pass learning capability, online learning is not only an ideal solution for high-velocity data streams, but also a suitable choice in the dynamic medical domain that is always changing due to the progression in medical knowledge, technology, and the emergence of new diseases [18,23]. Following single-pass learning, each data point is processed only once, making it efficient for applications of high-velocity data streams [46].
However, online learning faces three major challenges in healthcare big data settings. Extensive data preprocessing such as data cleaning and restructuring is required. Moreover, online learning is ineffective in scenarios where retraining is required [22,46]; without retaining historical data, online learning can be highly susceptible to concept drifting and gradual performance degradation over time. This lack of long-term memory further challenges the reliability of online learning in healthcare’s continuously evolving medical data [43,46]. In contrast, incremental learning models retain historical data, thereby improving the model’s stability and adaptability over time. Incremental learning algorithms evolve gradually overtime as new data arrives, reducing the risk of performance loss in comparison to online learning [47]. However, this advantage comes at the cost of increasing memory and computational resources utilization [47]. Transitioning from online to incremental learning in healthcare big data entails increased demand for computational resources; therefore, understanding the trade-off between the two models is essential to ensure sustainable implementation in healthcare big data systems [43,48].

3. Materials and Methods

3.1. Survey Methodology

The growing integration of ML in healthcare big data analytics has led to a notable surge in research publications in this domain. Several review articles have examined the significant growth and proliferation of publications in the domains of ML, artificial intelligence, and big data analytics within the context of healthcare big data [49,50,51,52]. These studies provide an analytical perspective on how ML-driven healthcare research expanded between 2020 and 2025 and validate the increase in publications in this domain. This study provides a comprehensive review of ML- and AI-driven solutions for healthcare big data analytics, with particular emphasis on challenges related to scalability, velocity, and volume. The insights drawn from this study will guide future investigations into emerging ML methodologies, algorithmic advancements, and efficient real-world healthcare applications. This study follows structured review guidelines to ensure a comprehensive, transparent, and unbiased evaluation of existing literature.

3.2. Literature Selection and Sources

To ensure relevance, credibility, and academic rigor, a structured search strategy was conducted across multiple high-impact databases, including IEEE Xplore, PubMed, Springer, Elsevier (ScienceDirect), ACM Digital Library, and Google Scholar. The search focused on articles published within the last four years, with particular emphasis on ML and healthcare big data. Additionally, the search was conducted using the following search terms: “healthcare and big data”, “recent trends and advancements in healthcare big data”, “healthcare big data applications challenges and issues”, “challenges and issues related to machine learning and healthcare big data applications”, “review of machine learning challenges in healthcare big data”, “efficient computational techniques for machine learning in big data”, “fast classifications for high velocity healthcare big data”, “hardware accelerated machine learning algorithms for healthcare big data”, “review of big data analytics framework in healthcare big data”, “review of big data machine learning for healthcare”, “efficient clustering and scalability techniques for machine learning in healthcare big data”, and “enhanced machine learning using big data frameworks for healthcare big data applications”. All book chapters, correspondence, letters, short communications, proceedings, and workshops are excluded.

3.3. Inclusion and Exclusion Criteria

In this review, studies were included or excluded based on the following criteria:

3.3.1. Inclusion Criteria

  • Peer-reviewed articles discussing AI/ML applications in healthcare big data analytics.
  • Studies that investigated distributed computing solutions (e.g., Apache Spark, federated learning, and neuromorphic computing) for scalable model training.
  • Research with a focus on improving machine learning model velocity, scalability, model/data parallelism, and efficient utilization of resources and power.
  • Research addressing challenges related to model velocity, volume, and optimization in large-scale healthcare big data systems.

3.3.2. Exclusion Criteria

  • Disease-specific reviews and learning models’ evaluations.
  • Studies focused on general AI and machine learning applications without healthcare context.
  • Articles with limited technical/theoretical contributions.
  • Articles relied on outdated methods and lacked performance validation in real-world healthcare environments.
  • Articles with a weak comprehensive assessment to ensure the replicability of the proposed method.
  • Articles with limited evaluation, such as no reported time or a scalability assessment.

3.4. Categorization and Thematic Analysis

The selected literature was categorized into three areas aligned with the research themes identified during the review of literature. The first category includes research focused on the use of efficient techniques and classification methods to address the volume and velocity challenges associated with healthcare big data. Advance and accelerated hardware-based learning methods and big data framework-based methods are examined in detail in the second and third categories, respectively.

4. Literature Review

The literature review examined the body of research encompassing healthcare big data, with particular emphasis on tackling the challenges facing ML with respect to data volume and velocity. In this review, the aim is to provide a comprehensive overview of the evolving landscape of healthcare big data analysis and the innovative approaches and methodologies that have emerged in ML to address the intricate issues linked to the volume and velocity of healthcare data. The review is divided into three categorical sections, as follows:

4.1. Efficient Techniques, Arithmetic Operations, and Improved Dimensionality Reduction

The rapid growth of big data applications within healthcare necessitates the implementation of sophisticated techniques for effective and efficient scaling of machine learning models. Improvements in power consumption and resource utilization can substantially enhance the sustainability of machine learning models’ application in healthcare. Moreover, efficient techniques, such as arithmetic operations and dimensionality reduction, can significantly reduce models’ power consumption, thereby making them feasible for long-term use in the domain of healthcare. Furthermore, power-aware enhancements are also particularly beneficial for resource-constrained medical devices, such as those used in monitoring and caring for diabetic patients [52,53,54]. This next section focuses on the examination of articles that aligned with these objectives.
Le et al. developed a scalable tree-based automated machine-learning model focused on biomedical analytics, specifically addressing the bioinformatics issues of big data [55]. The Tree-based Pipeline Optimization Tool (TPOT) was created using strongly typed Genetic Programming (GP) for automating and improving the data science analysts’ analysis pipeline. Model analysis was performed to identify features associated with depression severity using available datasets from RNA-Seq expression of 19,968 annotated protein-coding genes for depression disorder. The analysis examined and proposed two models, TPOT and TPOT-FSS, the latter including a Feature Set Selector (FSS) aimed at computational cost reduction in big data analysis. The evaluation results showed that the standard TPOT model required 18.5 h in high-performance running conditions (256 GB RAM and 28 cores Intel Xeon 2.60 GHz CPU) compared to 65 min for the TPOT-FSS model, which completed the same tasks 17 times faster. In addition, each simulation of the TPOT standard model on RNA-Seq data took 13 h on average compared to 40 min with TPOT-FSS, which is a speed increase of almost 20 times [55].
Zheng et al. [56] demonstrated that a specific class of learning algorithms called Extreme Learning Machine (ELM) is more efficient for classifying data streams. ELM-based frameworks and ELM-Ensemble-based frameworks outperform other models in speed due to their high efficiency, universal approximation capability, and generalization. These models are essential in making prompt and accurate medical decisions, like matching donor organs to recipients. Sangeetha et al. [57] developed a prediction model for analyzing the medical history of recipients and the smart contract of prospective donors using a hybrid extreme learning machine modified CNN (HEL-MCNN). This model facilitates optimal matching decisions for pairs of donors and recipients, thus minimizing the waiting time for organ donations. Two factors predominantly contributed to the high processing speed: the Prairie Dog Optimization Algorithm for feature selection and the use of a more efficient ELM classifier. The ELM classifier achieved a good reduction in processing time while preserving robust generalization with low values of parameterized neural nodes and less expensive computational training. The proposed model was tested against three real-time streams of datasets including liver, heart, and lung transplant donors and recipients. The datasets comprised 100,000 medical records and 80 attributes. The accuracy of the MCNN-HELM model was reported at 97.5%, achieved in a computation time of only 2.2 s, indicating much greater efficiency in comparison with other techniques.
In a study by Lahoura [58], a cloud-based system for monitoring breast cancer patients was developed using an extreme learning machine. The proposed approach is provided as SaaS designed to supervise patients, especially the elderly and disabled in remote areas, in settings of limited healthcare infrastructure. Laura explained that, for ELM, the weights of hidden nodes can be set randomly and never change. Furthermore, the ELM model performs noticeably faster than networks trained with backpropagation, evident from training 100 hidden node ELM on the Wisconsin Breast Cancer Diagnosis (WBCD) dataset. The ELM model outperformed all other models, with an accuracy of 96%, while AdaBoost, KNN, Naïve Bayes, Perceptron, and SVM achieved 92%, 90%, 84%, 83%, and 92%, respectively. In addition, the proposed cloud-based SaaS ELM achieved better results than the stand-alone ELM environment that had 36 vCPUs and more than 60 GB of RAM, resulting in an 18% reduction in execution time.
Healthcare systems faced many challenging issues during the COVID-19 pandemic. First, early detection was critical to limit the spread of the virus in the population. This would decrease the infection rates and lead to significant cost benefits in terms of cutting down the number of testing kits as well as healthcare personnel. Second, missing confidentiality and trust in sharing patients’ medical data from one center to another. Malik et al. [59] proposed a new framework called “Stream integration of Multi-source Data for COVID-19 patient Care”. The framework is a three-tier architecture wherein each tier is responsible for one function. The first tier is responsible for data cleansing and normalization. The second tier is a multi-model training method using blockchain and federated learning for anonymity. The last tier is the Capsule Network (CapsNet), which is based on Incremental Extreme Learning Machine (IELM) for the classification of COVID-19 patients [4]. CT scan images are processed, features are extracted, and results are sent to IELM for training and classification. IELM performed well in terms of training and classification speed, achieving 98.99% accuracy. The IELM neural network does not need a predefined learning rule; instead, it autonomously adjusts and tunes its weights, bias, and activation function to improve the model’s accuracy and reduce training errors.
Rajendran et al. presented an efficient big data classification model that integrates Chaotic Pigeon Inspired Optimization (CPIO) for feature selection with a Deep Belief Network (DBN) classifier whose hyperparameters are optimized using the Harris Hawks Optimization (HHO) algorithm [60]. The model was implemented in Hadoop’s MapReduce programming framework to enable the processing of larger datasets and validated with two datasets: the Epsilon dataset (500,000 instances and 2000 features) and the enormously scaled ECBDL14-ROS dataset (67 instances and 631 features). The experimental results confirmed the applicability of the proposed method in classifying large datasets. It further demonstrated that feature selection by CPIO enhanced classification performance and the use of HHO optimized Deep Belief Network yields a minimum runtime of only 160.90 s, outperforming the much longer runtime achieved by other classifiers like Support Vector Machine (SVMC), Logistic Regression (LRC), and Naive Bayes. The proposed CPIO-based feature selection in this work can achieve a reasonable tradeoff between accuracy and computational efficiency and is, therefore, a suitable option for complex big data classification challenges [60].
In the work of Goswami et al., the authors sought to improve deep machine learning model performance by performing quantization and data bit reduction for less processing time without affecting model performance [61]. Quantization is the reduction in the number of bits used to store each data value by rounding or truncating. The study proposed a novel quantile transformer that learns data distribution, transforms it, and then applies quantization from 64-bit to 32-bit through the astype function. For the heart disease dataset, KNN achieved 87.38% accuracy in 0.0029 s, which further dropped to 86.41% (0.0023 s) with float32 and 84.47% (0.0014 s) with int32. SVM started at 89.47% (0.0165 s), which further reduced to 88.5% (0.0055 s) with float32 and 80.12% (0.0032 s) with int32. For the breast cancer dataset, KNN recorded 96.49% (0.0258 s), with float32 adjusting to 95.18% (0.0142 s) and 67.74% (0.0025 s) with int32. SVM started at 95.52% (0.0022 s), with float32 maintaining accuracy at 95.52% (0.0022 s) while int32 dropped to 62.38% (0.0013 s) [61].
Sharada et al. designed an adaptive classifier to automate arrhythmia diagnosis [62]. The study adopted several techniques such as abnormality classification using profile curve contouring, ECG signal preprocessing and feature extraction, and QRS beat detection. The research considered several mathematical methods to improve ECG signal clarity by noise suppression before classification. The Shifting Window Mean Technique reduces high-frequency noise by averaging signal values within a dynamically adjusted window of R-peak proximity. Then, Fast Fourier Transform (FFT) is utilized to shift signals to the frequency domain, to attenuate low-frequency noise like baseline wandering and motion artifacts before restoring the signal into the time domain. Such preprocessing techniques enhance the signals before applying classification algorithms such as gated recurrent units (GRUs), thereby increasing the dependability and accuracy of arrhythmia diagnosis. The study used two datasets for testing: the Physionet database and the Kaggle datasets. The Physionet has 12,378 signals with 202 signal values and 63 inputs used, while the Kaggle datasets have 3694 signals with 789 signal values and 24 inputs used. Both datasets used for ECG signal analysis were divided into 80% training and 20% testing. The proposed method achieved a noticeable processing time of 3694 milliseconds, demonstrating its efficiency in large-scale healthcare datasets [62].
Chronic kidney disease (CKD) is one of the leading diseases around the world today. Although proactive steps have been taken in the field of CKD, the cost implications along with the increasing incidence of CKD continue to diminish overall life expectancy. Rahman et al. [63] focused on the study of eight ensemble learning methods for the diagnosis of chronic kidney disease using medical datasets [63]. To improve the disease classification accuracy, the study focused on the utilization of the MICE imputation method to address missing values and proposed the Borderline-SMOTE technique to minimize the imbalance of patients’ data. The results demonstrated that the proposed method achieved a relatively high accuracy of 99.75%, 4.65% higher than the results of other studies. Adaptive boosting, along with GBDT and bagging ensembles, were the slowest with average runtimes of 3.12, 2.67, and 2.34 s, respectively. On the other hand, XGBoost and LightGBM performed the best for extensive medical applications, given their speed of 0.53 s and 0.10 s, respectively [63].
Narwane et al. proposed a solution to class imbalance for medical datasets in the context of healthcare big data [64]. Given the sensitivity and complexity of the healthcare domain, addressing class imbalance in machine learning is difficult within a reasonable time, given the large-scale nature of data. This study, in particular, sought to determine how socioeconomic differences affect the healthcare data of diabetic patients with the aim of improving predictions of disease diagnosis [64]. The first step in this process was to reduce the dimensions of the imbalanced dataset using Principal Component Analysis (PCA). Then, SVM, LR, and ANN were applied to the reduced imbalanced dataset, and their performances were assessed. Following the results analysis and classifier selection, the CDR-KNN method was applied to balance the data. A prediction accuracy gain was observed on the balanced datasets, along with a reduction in training time due to PCA dimensionality reduction [64].
Kumar et al. investigated the classification of COVID-19 chest X-ray images using tuned IELM and PCA. The system accomplished an accuracy of 98.11% with a dataset size of 13,808 images [65], similar to Malik’s work in [59]. The approach was designed for increasing the classification velocity of already stored medical data rather than streaming data. The hidden node parameters in IELM are measured by the information returned from PCA in the training. IELM achieved the highest accuracy (96%), while AdaBoost, KNN, Naïve Bayes, Perceptron achieved 92%, 90%, 84%, 83%, and 92%, respectively, and SVM had 96% accuracy [65].

4.2. Advanced and Specialized Processing Hardware

Healthcare analytics systems face increasing difficulties in keeping pace with the rapid growth and complexity of healthcare data. Specialized Parallel Processors, such as field programmable gate arrays, application-specific integrated circuits, and Graphics Processing Units (GPUs), are instrumental in improving data processing throughput and the computational efficiency. A representative example is the GPUs in deep learning-based medical imaging predictive and analytics, where their capacity for fast matrix computations significantly accelerates model training and inference. In contrast, the parallel processing nature of an FPGA enables real-time processing of data streams with high power efficiency. The implementation of specialized hardware-based machine learning frameworks enables processing of large-scale medical data at high-speed specific to the healthcare domain [66,67,68].
Sharma et al. suggested a system for early detection and diagnosis of skin disease using an optimized Convolutional Neural Network (CNN) architecture [69]. To increase the classification speed, the authors modified EfficientNet B3 by substituting the Mobile Inverted Bottleneck with Depthwise Separable Convolution, thereby reducing the number of trainable parameters and lowering power computation. The proposed EffSVMNet model is based on a CNN architecture but incorporates an SVM classifier to enhance feature extraction and improve model decision. The model was trained using the DermNet dataset, comprising 4285 dermatological images encompassing acne, atopic dermatitis, bullous disease, and eczema. The input images were resized prior to feature extraction to 300 × 300 × 3. The evaluation of EffSVMNet on a system with 12 GB RAM and 15 GB GPU reached 87% classification accuracy with an efficient training time of 90 min [69].
Sakthivel et al. presented an effective hardware architecture based on ensemble deep-learning for COVID-19 detection using chest X-ray (CXR) images [70]. In contrast to typical single-model methods, the proposed ensemble is composed of the top five performing deep learning models—IRCNN, MobileNet, ResNet, EfficientNet, and FitNet. The architecture utilizes reconfigurable ASIC hardware designed with embedded parallel processing pipelines to enhance processing speed and efficiency and reduce overall latency. The proposed hardware design, developed using TSMC 90 nm, features data-aware processing elements (Pes) to significantly reduce computational effort and power consumption [70]. The model was trained on the COVID-19 Radiography Database, composed of 3616 images. With 99% accuracy scores, the proposed model outperformed the previously established accuracy in other studies, 93%. The results also show a 40% reduction in latency and required processing clock cycles, thereby demonstrating how hardware-accelerated learning can significantly impact healthcare big data applications, particularly in scenarios that require real-time decision-making [70].
In another study, Cheng et al. designed a deep learning 1D U-net model for pixel-wise classification of ECG signals using an optimized hardware framework [71]. To reduce the cost of computation, a two-stage pipeline Winograd convolution structure was used. This approach not only reduced the number of multiplications by a factor of three but also enhanced throughput. Additionally, a 3D Processing Element (PE) array was introduced to improve memory data access. The model achieved a classification accuracy of 95.55%—when evaluated on the MIT-BIH Arrhythmia dataset—for five types of ECG beats on the Xilinx Zynq ZC706 board, with an inference time of 383.89 microseconds and 76,778 clock cycles [71]. The hardware achieved a resource efficiency of 8.27 GOPS per kLUT while demonstrating a computation efficiency of 123% at a clock rate of 200 MHz [71]. Although power efficiency metrics were not explicitly reported, the proposed architecture in this study is well-suited for portable healthcare devices required for real-time ECG classification.
The performance of deep learning models is heavily dependent on the quality and quantity of training images. Increasing the number of training samples dramatically raises the computational cost. Notably, less than 10% of deep learning studies in radiology utilize more than 10,000 images in training [72]. To address this challenge, Draelos et al. implemented parallel pipeline streams to process DICOM files and transform different slices of CT scans into a single 3D array using the PyTorch framework [73]. Moreover, lossless zip compression was implemented to mitigate the storage space issues, reducing the overall size of the dataset from 9.2 terabytes to 2.8 terabytes. Two NVIDIA Titan XP GPUs (equipped with 11.9 GiB of memory) were used in parallel to train the CT-Net neural model. The study achieved an AUC score of 0.90% and training time of 15 days, a superior performance to other deep learning approaches to radiology [73].
Aruna et al. proposed an FPGA-based Deep Convolutional Neural Netwrok (DCNN) for ECG classification aimed at early detection of cardiac arrhythmias, the leading cause of cardiovascular diseases (CVDs) [74]. To effectively facilitate the analysis of non-stationary ECG signals, the study considered signal preprocessing using the Error Normalized Least Mean Square (ENLMS) algorithm and feature extraction based on Discrete Wavelet Transform (DWT). The one-dimensional DCNN comprised three convolutional layers, a pooling layer, and two fully connected layers implemented on an FPGA to reduce the computational overhead and increase efficiency. Performance evaluation on MIT-BIH and PTB databases showed classification accuracies of 98.6% and 99.67%, respectively, with a processing time of 15 s—surpassing the accuracy of multilayer perceptron (MLP) by 0.304% and decision-based classifiers by 0.47% [74].
Yacoub et al. present a reconfigurable hardware architecture of the K-Nearest Neighbor (KNN) algorithm to address limitations associated with traditional hardware accelerators [75]. Implemented on Genesys 2 FPGA from Xilinx, the design includes Distributed RAM blocks for improved data access to reduce Look-Up Table (LUT) usage. Yacoub et al. proposed the utilization of a control unit to allow for post-engineering parameters modification, such as K-value, distance metric, and FSM value that manages data flow. With the support of both fixed-point and single-precision floating format, the experimental results demonstrated over 90% classification accuracy across several evaluated datasets. In comparison to earlier designs operating at 24 watts and requiring over 135 K look-ups, 207 K FFs, and 131 MHz, the proposed floating point achieved 27.269 K LUTs and 20.112 K FFs operating at only 40 MHz, with a significant reduction in power consumption of only 0.312 watts. The second proposed architecture (Fixed point) achieved 22.867 K LUTs and 20.083 K FFs, operating at 109.12 MHz with 0.359 watts. While at 9.1 watts, the proposed architecture surpassed 120 K LUTs and 88 K FFs, a 37% reduction in power consumption [75].

4.3. Clustering and Parallel Processing Methods Frameworks

The increasing size and intricacy of healthcare big data also necessitate effective data processing frameworks that can accommodate machine learning velocity and volume challenges. Clustering approaches, along with parallel processing frameworks like Apache Spark and Hadoop, are critical to enhancing the speed and scale of data-centric machine learning systems. The distribution of processing effort across a network of nodes on carefully partitioned data can significantly reduce the total processing time. For instance, Apache Spark’s iterative in-memory data processing can substantially accelerate real-time healthcare analytics. Similarly, the MapReduce architecture of the Hadoop framework can facilitate the analysis of large-scale volumes of medical data through efficient batch processing. Through big data parallel computational frameworks, machine learning models can be scaled effectively, allowing for accelerated data processing and less fault tolerant implementations [76,77].
Generally, the three critical challenges facing healthcare data systems are related to data collection, data storage and management, and lastly, data analysis. Abdel-Fattah et al. investigated chronic kidney disease prediction using a hybridization of machine learning and big data processing frameworks, specifically the Apache Spark framework [78]. In comparison of several machine learning models utilized from Spark’s machine learning library (MLlib) and distributed feature selection, the Apache Spark framework significantly increased the processing speed, and the feature selection technique contributed to improved classification accuracy of CKD [78].
The early detection and diagnosis of lung cancer rely heavily on effective denoising segmentation of disease tissues of Positron Emission Tomography (PET) scans, as well as the classification accuracy of applied learning model. Although classification accuracy varies across diseases depending on the selected model, activation filters, and inherent quality and characteristics of PET scans, the segmentation process remains particularly time-consuming and laborious. To address this challenge, Guan et al. proposed an enhanced framework designed to automate disease detection within big data analytics environment [79]. The proposed framework addresses several limitations associated with improving PET processing efficiency and reducing workload in big data settings through the use of improved differential activation filter. Additionally, it enhances PET segmentation performance through novel Density Peak Clustering (DPC) method. With classification accuracy of 93.5% and 2.3% denoising improvement, the proposed framework reduces overall time cost by 77.29%—from 17.75 s to 4.03 s [79].
Given the enormous volume of data generated daily in healthcare systems for patients with cardiovascular conditions, the early detection and continuous monitoring of patients by machine learning algorithms are also challenged. Sukanya et al. developed a Parallel Semi-Naive Bayes with Improved MapReduce (PSNB:IMR) for early detection of heart disease in a healthcare big data setting. An Improved MapReduce (IMR) parallel feature selection was proposed to identify relevant features and reduce data dimensionality. The training dataset was partitioned into ten subsets and processed in parallel. The top K-ranked features are then selected for classification by the PSNB algorithm. Experimental results showed a substantial reduction in processing time, from 780 s and 300 s for Semi-Naïve Bayes and parallel K-means to only 130 s using the PSNB:IMR model [80].
Feature selection applications in big data analytics play a pivotal role in not only improving classification performance, but also in minimizing the computational overhead of learning algorithms. Identification of relevant features reduces the dimensionality of datasets and leads to faster training, reduced resource consumption, and improved model interpretability and generalization [81]. In a study by Xing and Bie, researchers investigated the influence of feature selection algorithms on KNN performance in healthcare big data settings [82]. In the work, Xing and Bie focused on addressing the implementation limitations of KNN in prior studies in the medical domain and proposed improved KNN with clusters denoising and density cropping techniques to overcome the limitations of processing large-scale data using the KNN algorithm. The method was validated across ten distinct datasets, on which the overall computational cost exceeded that of the traditional KNN classifier by approximately 20% [82].
Breast cancer remains as one of the most pressing health challenges globally affecting mostly women. However, early diagnosis and effective risk evaluation can greatly increase the survival rate and treatment outcome of breast cancer. The integration of ensemble learning and big data has proven to significantly improve prediction accuracy. Jaiswal et al. proposed an Improved XGBoost Ensemble (I-XGBoost) technique for the diagnosis of breast cancer in a healthcare big data setting [83]. In this context, Apache Spark’s Python API was used to streamline the Wisconsin Breast Cancer dataset preprocessing, implementation of parallel feature extraction, and development of the proposed classification model. The I-XGBoost model reduced classification time significantly and achieved 99.84% classification accuracy, outperforming the traditional methods of Decision Trees, Random Forests, Naive Bayes, KNN, SVM, and Adaboost [83].

5. Discussion and Analysis

The comprehensive literature review conducted in this study shows that the field of healthcare big data analytics is rapidly growing. Despite the growing challenges related to volume and velocity in machine learning and healthcare big data, the reviewed studies demonstrate significant advancements in efficient arithmetic operations, specialized hardware acceleration, and utilization of big data parallel processing frameworks. Although each category has distinct trade-offs between advantages and disadvantages, its application efficiency remains largely dependent on the requirements specific to healthcare systems. Understanding the strengths and weaknesses of each technique will enable the identification of suitable methods for healthcare big data analytics. Comparing the effectiveness of the three categories reveals that specialized hardware—such as parallel-GPU and FPGA-based architectures—delivers the greatest improvements in real-time performance, making it highly suitable for low-latency medical applications. Nevertheless, hardware-based solutions are often costly and require technical expertise. Moreover, parallel computing frameworks such as Apache Spark offer highly scalable and cost-effective solutions, making them ideal for challenges facing machine learning algorithms in processing large volumes of medical data. Lastly, techniques such as quantization and feature selection (e.g., TPOT-FSS and PCA) demonstrated promising performance enhancements for real-time data stream processing. However, their impact remains limited in comparison to specialized hardware-accelerated methods. The next subsections present the findings for each category, along with their identified limitations.

5.1. Role Analysis of Efficient Computations and Techniques in Healthcare Big Data

The efficiency of ELM frameworks for medical data stream processing in real-time decision-making tasks, such as organ transplant matching, is validated by Zheng et al. and Sangeetha et al. These works, together with the other efforts discussed, demonstrate the effectiveness of ELM-based frameworks for high-velocity medical data stream processing, especially in real-time decision-making [56,57]. Recently, quantization approaches proposed by Goswami et al. have helped minimize computational overhead without accuracy loss [61]. For example, their method achieved a 50% reduction in computational time while the SVM inference time decreased from 0.0165 s to 0.0032 s at some cost in accuracy between ~5–10%. In addition, the scalable TPOT-FSS model developed by Le et al. experienced a ~17-fold decrease in execution duration from 18.5 h with standard TPOT to only 65 min, all without sacrificing quality in biomedical analytics [55]. Zheng et al. also conducted other comparative studies that confirmed the high classification performance of ELM-based frameworks in comparison to traditional learning algorithms such as SVM and decision trees [56,57]. While such progress in training and classification speed is promising, the challenges in large-scale healthcare datasets remain open.
The reviewed literature also showed a trend that focuses on the enhancement of disease-specific learning models, such as breast cancer, kidney disease, or COVID-19. This raises uncertainty with regard to model performance when applied to other disease datasets. Moreover, single swarm optimization algorithms, such as Prairie Dog Optimization Algorithm [57], are effective in their respective domains and are inefficient in large-scale optimization problems. The computational complexity and energy efficiency are overlooked in the literature, particularly in relation to healthcare big data and large-scale optimization problems. Novel optimization algorithms, such as multi-swarm optimization algorithms, that can ensure scalability, exploration-exploitation trade-off, and performance efficiency, should be considered.
Although the reviewed techniques show a significant reduction in training duration while maintaining optimal classification accuracy, they failed to address concerns related to the scalability of the proposed methods, particularly in relation to big data healthcare systems. Additionally, some of the proposed approaches were evaluated using small-sized medical datasets in both the number of samples and feature dimension. Small sample size raises concerns about method effectiveness and reliability when applied to real-world, large-scale healthcare scenarios. In addition, the literature review also revealed that the allocated computational resources for the assessment and evaluation were unreasonably high and unjustified even for modest sized datasets. This imbalance defeats the purpose of applying efficient methods in the analysis of big data in healthcare. The degree of resource expenditure to achieved improvements is difficult to achieve when dealing with large healthcare datasets, which suggests the infeasibility of usefulness in big data medical applications.
To systematically assess these methodological limitations, the reviewed studies were analyzed based on three critical evaluation criteria:
  • Scalability: refers to the capacity of the proposed approach to maintain performance when applied to large-scale healthcare big data environments.
  • Applicability to Healthcare Big Data: evaluates the potential for the approach to be adopted across diverse medical applications, particularly those involving high-volume and high-velocity data streams.
  • Computational Resource Efficiency: assesses the computational cost and hardware requirements associated with the proposed methods, emphasizing whether the techniques truly align with the pursuit of computational efficiency in healthcare big data settings.
The limitations associated with each reviewed study under this category, along with their evaluations against these three criteria, are further detailed in Table 2.

5.2. Role Analysis of Specialized Hardware in Healthcare Big Data

The growing sophistication and volume of healthcare datasets require the deployment of specialized computation acceleration hardware such as GPUs, FPGAs, and ASICs. Numerous latencies have been reported to be reduced along with improved energy efficiency and processing speeds achieved through hardware-based improvements. For instance, Sakthivel et al. developed a hardware-based ensemble deep-learning model for COVID-19 classification, improving real-time diagnosis by 40% in terms of processing latency and clock cycles [70]. In the same breath, Cheng et al. employed Winograd Convolution optimization for ECG classification, reducing the number of required multiplications by 30%. The model attained 95.55% accuracy in conjunction with a runtime of 383.89 microseconds, far surpassing standard CNN architectures [73]. Draelos et al. also extended this paradigm further, utilizing a multi-GPU CT-Net model. The implemented parallel pipeline data streams reduced the training time for the deep learning model from 15 days to only few hours while attaining an AUC score of 0.90%, marking an improvement over conventional architectures [73].
With the addition of power-efficient devices for real-time medical tasks, FPGAs and ASICs offer a significant advantage over traditional GPUs. However, their medical applications are challenged by excessive development costs and the need for specialized technical skills. Aruna et al. proposed a Deep Convolutional Neural Network based on FFPGA for ECG classification. The systems achieved an accuracy of 98.6% while attaining a power consumption of 0.45 mW, proving to be more energy efficient than the CPU-based implementations [74]. The energy efficiency demonstrates the usefulness of FPGA-based architectures particularly in medical devices with limited resources such as wearable health monitors and diagnostic systems. On the other hand, GPUs are preferable in high-performance cloud environments because of scalable efficiency.
One primary concern arising with the adoption of hardware-accelerated learning is model portability. The utilization of hardware accelerators often optimizes learning models for a specific hardware architecture and vendor, which significantly limits their portability across diverse big data platforms. In addition to cost as one of the major challenges limiting their adoption in medical applications, ASICs and FPGA-based solutions can be used. Besides outstanding energy efficiency to gain a performance ratio, highly specialized technical and programming skills hinder their adoption in clinical settings, in contrast to GPUs.
While hardware acceleration techniques have improved processing efficiency, challenges related to scalability, cost, and accessibility remain essential for their application and effective integration in healthcare big data medical applications.
To rigorously evaluate these constraints, the studies under review were examined according to the following three fundamental assessment criteria shown in Table 3.
  • Scalability: measures whether the proposed hardware-based approach is efficiently scalable in large-scale healthcare big data environments.
  • Cost: measures the financial expenses associated with the deployment and maintenance of hardware-accelerated learning models in big data settings.
  • Technical Difficulties: pertains to the complexity of implementing, configuring, scaling, and integrating the proposed approach in a healthcare big data setting.

5.3. Role of Clustering and Parallel Processing in Healthcare Big Data

The growth in the number of patient records, medical images, and real-time monitoring data greatly impacts the healthcare sector. The field of big data analytics continues to face significant challenges mainly related to machine learning scalability. In an attempt to reduce the computation bottleneck, a few research articles considered the application of parallel processing techniques using Apache Spark, Hadoop, and MapReduce for the distribution optimization of machine learning workflows. For instance, Abdel-Fattah et al. and Sujitha et al. demonstrated how big data frameworks can be beneficial in significantly enhancing machine learning applications of complex and large healthcare datasets [78,79]. Abdel-Fattah et al. examined kidney disease prediction using the Apache Spark framework, which achieved a 1000-fold speedup in comparison to Hadoop. The evaluated model built using MLlib machine learning library demonstrated the applicability of the Spark framework in scalable healthcare analytics (the classification of 200 medical images took only 12 s. Similarly, Xing et al. improved KNN through the utilization of clusters denoising and density cropping techniques to address KNN’s shortcomings in noisy medical datasets. The method also proved efficacious for healthcare big data, with a reduction in computational cost by 20% without negatively impacting the classification accuracy [82].
Despite these advantages, scalability remains a challenge. Apache Spark and MapReduce offer significant improvements over traditional batch processing; they are not fully optimized for the real-time data streaming in time-sensitive medical applications. Additionally, many of the reviewed studies utilized these platforms to achieve parallelism without assessing high-level performance optimization such as adaptive resource allocation or dynamic load balancing. This limits true application evaluation that is essential for the dynamic specific nature of healthcare big data. Many of the reviewed Spark and MapReduce-based implementations are optimized for batch processing rather than low-latency real-time streaming, making them less suitable for applications that require instantaneous medical decision-making.
To assess these limitations, the reviewed studies were analyzed based on three critical evaluation criteria: Scalability, Cost, and their applicability to Real-Time Medical Applications in healthcare big data (see Table 4).

6. Research Gaps and Future Trends

In the evolving landscape of healthcare and big data research, the adoption of machine learning and artificial intelligence promises to revolutionize patient diagnosis, improve personalized treatments and disease diagnosis, and enhance medical research; however, this integration faces many challenges and difficulties. Despite significant advancements in healthcare big data analytics, several research gaps remain within each of the three reviewed key categories—efficient computational techniques, specialized hardware acceleration, and parallel processing frameworks. Addressing these gaps is essential, in addition to several other factors detailed by Orlu et al. [84], to fully realize efficient performance, scalability, and real-time applicability of machine learning models in handling large-scale, high-velocity medical data. This section discusses the identified research gaps and outlines future research directions that could drive innovation in healthcare big data analytics.
The examination of the challenges and limitations associated with the utilization of machine learning and big data in healthcare as discussed and identified in the Observations and Analysis section reveals several significant research gaps related to Velocity and Volume. The relevance and insights of each gap are detailed in the subsequent sections. Future research directions of each gap are detailed in Figure 2.

6.1. Research Gaps and Future Directions in Efficient Computational Techniques

The literature extensively explored efficient techniques such as Extreme Learning Machine, dimensionality reduction, and quantization for reducing processing overhead and enhancing accuracy. A common limitation across the literature is the limited generalization of existing solutions across multi-disease and heterogeneous medical datasets, such as COVID-19, breast cancer, and kidney disease, and the failure to evaluate other scenarios which can become suboptimal in addressing real, complex, and heterogeneous healthcare challenges. There should be a focused effort towards developing an adaptive learning framework integrated with meta-learning or Transfer Learning for application within several healthcare domains [85,86]. In the same context, combining hyper-parameter tuning and feature selection multi-objective optimization algorithms remains an insufficient approach and rather difficult especially for large-scale healthcare datasets.
Recent advancements in machine learning introduced a new class of learning methods known as AutoML [87,88,89]. AutoML simply is a class of learning algorithms that automates machine learning model tuning of performance and structural parameters without user intervention. In cases where AutoML is not applicable for specific healthcare applications—particularly cases involving high-dimensional, dynamic, and heterogeneous datasets—large-scale optimization algorithms present a more viable alternative [90,91]. In addition, common methods such as RFE (Recursive Feature Selection), single swarm optimization algorithms, and PCA are insufficient techniques used in resolving the complex feature spaces of medical datasets. Medical datasets usually have high dependency between features, strong temporal correlation, and are domain specific, which therefore necessitate advanced domain-based feature selection methods with strong exploration-exploitation optimization performance, such as hybrid multi-swarm optimization and adaptive evolutionary dynamic or adaptive feature selection methods, to ensure preservation of important diagnostic features while attaining model interpretability and reduction in processing time [92,93,94,95,96].
Moreover, advanced learning models such as ELM, CNN, and IELM are inherently dependent on matrix operations, particularly matrix multiplications and Moore–Penrose pseudo-inverse computations [97,98,99]. The computational complexity of these operations presents a significant bottleneck, particularly when dealing with high-dimensional medical imaging data, genomic sequences, and real-time physiological signal processing. To address this, mathematical matrix computational techniques must be tailored to enhance processing speed, reducing both training time and inference latency while maintaining predictive accuracy. These may include block-wise matrix decompositions, randomized numerical linear algebra methods, and hardware-accelerated tensor computations [100,101,102,103]. The proposed direction not only addresses the existing computational challenges but also enhances the practical feasibility of real-time, data-driven decision-making in healthcare. By integrating optimized mathematical frameworks with scalable machine learning models, this approach ensures that AI-driven healthcare analytics remain computationally efficient, clinically interpretable, and seamlessly deployable across large-scale medical data infrastructures. Notably, this area of research is largely unexplored in the reviewed healthcare big data literature, highlighting a critical research gap. Addressing these problems could significantly advance the effectiveness of AI applications in precision medicine, early disease detection, and personalized treatment strategies.
Moreover, integration of Federated Learning (FL) in healthcare can enable patient privacy-compliant real-time analytics on dispersed healthcare data [104,105,106,107]. Zeyden et al. and Joshi et al. have identified and attempted to solve some major problems, like centralized versus decentralized architecture, data partitioning, resource constraints, data fragmentation, and patient privacy [32,106]. To some extent, their findings ignored the fundamental gaps associated with the rate of change and quantity of information in healthcare. In federated learning for healthcare, models are trained locally in different healthcare organizations; the effectiveness of this process depends on the local learning models’ capability of processing high velocity and large volumes of medical data. This may cause delay of convergence, excessive computational overhead, communication bottlenecks, and impractical deployment. While healthcare federated learning holds great promise for enabling the training of collaborative models from different and decentralized healthcare organizations, the actual implementation of such a model remains impractical because the issues of velocity and volume must first be addressed.

6.2. Research Gaps and Future Directions in Specialized Hardware Acceleration

The use of GPUs, FPGAs, and ASICs has significantly improved the speed at which medical data is processed. Despite recent advancements, several critical gaps remain, particularly related to cost-to-performance gain, scalability, as well as implementation complexity. In order to streamline their integrations into hospitals’ and health centers’ IT infrastructure, one valuable research approach is software abstraction of hybrid computing architectures—CPU, GPUs, and FPGAs—where tasks are dynamically assigned to available resources. This approach offers a promising solution for enhancing efficient resource utilization, reducing the development complexity, and increasing model portability across various healthcare applications and settings. Furthermore, the recent development in the area of reconfigurable neuromorphic computing architecture presents an opportunity for many healthcare institutions to adopt low-power, application-specific AI accelerators that do not require significant changes in hardware designs or high-end technical skills [108,109]. In contrast to traditional deep learning accelerators, the new computation architecture is designed based on event-driven parallel processing inherent biological neural networks. This adaptivity enables power-efficient AI computations for dynamic healthcare analytic applications [110]. Depending on the workload, the architecture can dynamically scale its processing elements, allowing for efficient processing of high-workload tasks such as the analyses of medical images and patient monitoring in real time. In addition to its power efficiency, neuromorphic systems require low memory bandwidth, which makes them well-suited for edge AI applications, such as wearable healthcare devices, mobile diagnostics, and even emergency distributed hospital network devices. This change in direction towards reconfigurable neuromorphic computing can greatly improve the range, effectiveness, and sustainability of AI-powered solutions for healthcare, especially those that are integrated into electronic systems and in resource-limited settings.
Another underexplored area identified from the reviewed literature is the absence of specific studies related to hardware optimizations tailored to processing and analytics of real-time data streams. Most of the current implementations are focused on batch processing of medical images and patient records and lack solutions for continuous, high-rate data streams in areas of ICU monitoring systems, ECG sensors, or wearable devices. The focus of further research should be on hardware-optimized real-time inference healthcare big data applications.
Energy-efficient machine learning approaches remain largely unsolved problems. The use of GPUs indeed accelerates deep learning medical applications but also limits deep learning in low-power mobile medical imaging devices due to their high energy consumption. Shifting the focus to bio-inspired computing paradigms, such as Spiking Neural Networks (SNNs), offers a promising direction. SNNs are known for significantly low power consumption in comparison to deep learning algorithms [111,112]. Additionally, all the reviewed methods are primarily designed for centralized high-performance computing systems and lack support for scalable and distributed learning. Limited by the underlying theoretical structure of the learning model itself, scalability has proven to be difficult and often infeasible for many algorithms, even with hybrid hardware of CPUs and GPUs [113].
Tensor Processing Units (TPUs) are also specialized hardware accelerators capable of wider memory bandwidth, lower latency, and superior power efficiency when performing tensor-based operations in comparison to CPUs and GPUs [114]. TPUs’ high memory bandwidth and specialized matrix multiplication capabilities also significantly accelerate training and inference tasks, addressing the computational inefficiencies and scalability issues observed in traditional hardware implementations, such as GPUs and FPGAs. In healthcare big data, TPUs can be used efficiently in processing EHRs, radiological scans, and genomic sequences [115]. Moreover, the effortless assimilation of TensorFlow and PyTorch frameworks and the availability of scalable AI models and inference by Google Cloud services significantly decreases the development and deployment lifecycle in healthcare systems in comparison to GPU-based solutions. However, there has been no comprehensive assessment nor disease-specific research in the recent literature related to TPUs in healthcare big data, or to TPUs as a viable alternative to GPUs and FPGAs.
Another future research direction is the integration of TPUs with federated learning frameworks, which presents a transformative approach to overcoming numerous challenges and limitations identified in healthcare big data applications. TPUs, with their high-throughput tensor processing capabilities and power-efficient architecture, can enable distributed computations of learning models across multiple healthcare institutions without compromising data privacy or security. This is particularly critical in sensitive medical domains where patient data cannot be directly shared across hospitals due to regulatory constraints such as GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act).
This synergy between TPUs and federated learning or edge computing [116] can mitigate the limitations of dataset diversity, model generalizability, and real-time processing in healthcare AI, ultimately enabling large-scale, privacy-preserving medical diagnostics, predictive analytics, and personalized treatment recommendations.

6.3. Research Gaps and Future Directions in Parallel and Distributed Processing Frameworks

The scalability of parallel and distributed processing frameworks has greatly improved the development of healthcare big data analytics. However, the reviewed implementations in this study showed several critical limitations particularly concerning the processing of real-time data streaming, fault tolerance, and resource-awareness. Although Apache Spark and MapReduce are considerably efficient when compared to traditional batch processing techniques, both frameworks are inefficient for low-latency healthcare environments that require immediate decisions such as patient monitoring in the ICU or predictive diagnosis in emergency responses. Second, some of the reviewed literature attempted partial parallelization to enable a more efficient learning process of large-scale medical data. While these approaches demonstrate effectiveness, they are model-specific and do not guarantee lifecycle scalability. While data partitioning and distributed processing are technically straightforward using big data frameworks, many machine learning algorithms require access to the entire dataset during the learning process, making these approaches inefficient or even impractical. Lastly, current popular distributed processing frameworks such as Spark and Hadoop only offer native support to limited Apache-tailored learning models. Learning models that are specific to certain applications in healthcare big data would require extensive customization and modification.
Concerning this context, distributed training of machine learning models typically involves heavy network communication for parameter synchronization across processing nodes required for the evolution of the learning model itself. This introduces considerable network overhead and causes a significant increase in the time required to complete training due to delays occurring during synchronization. Therefore, research efforts should resolve the fundamental theoretical limitations of distributed machine learning models to reduce distribution and efficiency trade-offs, for example, through the use of parallel matrix operations which requires less/no network communications [117,118], improved data partitioning to ensure balanced training [119,120], as well as enhanced error estimation techniques to limit frequent synchronization.
Recent research has also examined the utilization of the Apache Spark framework not for the distribution of ML training but for parallel computation of otherwise undistributed models. Approaches such as Stark and JAMPI [121,122] can significantly enhance and overcome scalability challenges for certain ML models in healthcare big data applications, eventually facilitating the development and implementation and ensuring the scalable lifecycle in healthcare big data systems.
Given that many machine learning models are not inherently scalable, several research efforts have been made to explore the applications of ensemble learning in healthcare big data [24,71,83]. An efficient data partitioning technique can be utilized, and several models can be trained independently in a distributed environment. Ensemble techniques, such as bagging, boosting, or stacking, can be applied. By combining the strengths of multiple distributed classifiers, the use of ensemble methods can partially address the gap of otherwise undistributed ML algorithms [123,124]. The ensemble applications in healthcare big data are also surrounded with several challenges, including careful selection of techniques and parameter tuning, time and resources overhead, suitability for real-time applications, and implementation technical complexity. However, the clinical and medical interpretability of inferred output could be the most critical challenge facing ensemble applications in healthcare big data. Due to the nature of combining multiple outputs from several classifiers, the uninformed selection of an ensemble technique can reduce interpretability and render the decision more difficult to explain.
Several studies streamlined parallel feature selection/extraction processes (such as recursive elimination, PCA, and K-Rank) in the reviewed literature. The complexity of feature relationships in big data continuously increases over time, thereby requiring the adoption of not only resource-efficient algorithms, but also algorithms with strong exploration–exploitation and convergence speed. The application of feature selection methods can increase model processing velocity due to the reduction in computational burden. However, the reduction in features does not always guarantee high-accuracy model prediction, due to the loss of critical features. Researchers should investigate the development and enhancements of scalable multi-swarm optimization algorithms with emphasis on a strong balance between exploration and exploitation; enhanced solution elimination strategies such as Nomadic People Optimizer [125] must be investigated to reduce the risks of converging to local minima.

7. Conclusions

The development of generalized, real-time, and scalable solutions that fuse high-performance computing with intelligent big data frameworks and adaptive machine learning algorithms is essential to healthcare big data analytics. Addressing the velocity and volume challenges inherent in healthcare data requires multi-faceted approaches that leverage efficient computational techniques, advanced hardware acceleration, and parallel processing architectures. The importance of interdisciplinary innovations in algorithm design, hardware efficiency, and distributed computing frameworks, which collectively contribute to faster, more accurate, and resource-efficient AI-driven healthcare analytics, has been highlighted in this review. Although there is no universally optimal solution, the evaluation of the reviewed literature across three primary themes (efficient techniques and arithmetic optimization, advanced processing hardware, and clustering and parallel processing methods) suggests that hybrid strategies can offer the most effective path forward. On the other hand, efficient matrix operations, TPUs-based learning, AutoML, large-scale optimization algorithms, federated learning, and parallel matrix multiplications/operations are promising approaches for enhancing computational speed and reducing learning complexity facing machine learning in healthcare big data. However, their effectiveness can only be fully realized through complete integration into ML-driven healthcare architectures and applications, which necessitate further research and investigations.
Moreover, the current limitations identified in existing methods highlight the urgent need for further development of scalable and distributed ML frameworks, particularly those tailored for real-time healthcare applications. Some of the key challenges include inefficient computation, weak feature selection methods for large-scale data, low lifecycle scalability, and inadequate model selection and implementation. These gaps may be addressed through techniques such as distributed matrix operations, adoption of hybrid data and model parallelism techniques, and distributed multi-swarm optimization approaches for model parameter tuning and wrapper-based feature selection. The strategic adoption and implementation of these techniques can ensure significant improvements in efficiency and scalability specific to ML applications in healthcare big data. These improvements are set to not only support and enhance clinical decision-making and healthcare analytics but also to facilitate the use of personalized medicine, predictive diagnostics, and value-based healthcare systems.

Author Contributions

Conceptualization, D.Y.K., A.S.S. and A.A.L.; methodology, D.Y.K., A.S.S., K.S. and A.A.L.; validation, A.S.S., K.S. and A.A.L.; formal analysis, D.Y.K., K.S. and A.A.L.; investigation, D.Y.K., A.S.S. and A.A.L.; resources, D.Y.K. and K.S.; data curation, D.Y.K.; writing—original draft preparation, D.Y.K. and K.S.; writing—review and editing, A.S.S. and A.A.L.; visualization, D.Y.K.; supervision, A.S.S., K.S. and A.A.L.; project administration, A.S.S. and A.A.L.; funding acquisition, Z.C.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by INTI International University (INTI IU Research Grant 2025: INTI-FEQS-01-03-2025).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Baalann, K.P.; Chandrasekar, V.S.A.; Kathirvel, M.; Chakraborty, T.; Vasanthi, R.K.; Ganesh, K.; Bhandari, A.; Prasanna, P.M.; Parthasarathy, S. AI innovations in anaesthesia: A systematic review of clinical application. Indian J. Clin. Anaesth. 2025, 12, 177–189. [Google Scholar] [CrossRef]
  2. Khanra, S.; Dhir, A.; Islam, A.N.; Mäntymäki, M. Big data analytics in healthcare: A systematic literature review. Enterp. Inf. Syst. 2020, 14, 878–912. [Google Scholar] [CrossRef]
  3. L’heureux, A.; Grolinger, K.; Elyamany, H.F.; Capretz, M.A.M. Machine learning with big data: Challenges and approaches. IEEE Access 2017, 5, 7776–7797. [Google Scholar] [CrossRef]
  4. Batko, K.; Ślęzak, A. The use of big data analytics in healthcare. J. Big Data 2022, 9, 3. [Google Scholar] [CrossRef]
  5. National Center for Biotechnology Information (NCBI). Genebank Statistics. Available online: https://www.ncbi.nlm.nih.gov/genbank/statistics/ (accessed on 22 January 2023).
  6. Santosh, K.C.; Ghosh, S. COVID-19 imaging tools: How big data is big? J. Med. Syst. 2021, 45, 71. [Google Scholar] [CrossRef] [PubMed]
  7. Fatima, S. Improving healthcare outcomes through machine learning: Applications and challenges in big data analytics. Int. J. Adv. Res. Eng. Technol. Sci. 2024, 11, 2349–2819. [Google Scholar]
  8. Berros, N.; El Mendili, F.; Filaly, Y.; El Idrissi, Y.E.B.E. Enhancing digital health services with big data analytics. Big Data Cogn. Comput. 2023, 7, 64. [Google Scholar] [CrossRef]
  9. Al-Sai, Z.A.; Husin, M.H.; Syed-Mohamad, S.M.; Abdin, R.M.S.; Damer, N.; Abualigah, L.; Gandomi, A.H. Explore big data analytics applications and opportunities: A review. Big Data Cogn. Comput. 2022, 6, 157. [Google Scholar] [CrossRef]
  10. Miah, S.J.; Camilleri, E.; Vu, H.Q. Big data in healthcare research: A survey study. J. Comput. Inf. Syst. 2022, 62, 480–492. [Google Scholar] [CrossRef]
  11. Khoei, T.T.; Singh, A. Data reduction in big data: A survey of methods, challenges and future directions. Int. J. Data Sci. Anal. 2025, 20, 1643–1682. [Google Scholar] [CrossRef]
  12. Tsai, C.-W.; Lai, C.-F.; Chao, H.-C.; Vasilakos, A.V. Big data analytics: A survey. J. Big Data 2015, 2, 21. [Google Scholar] [CrossRef]
  13. James, R. Out of the box: Big data needs the information profession—The importance of validation. Bus. Inf. Rev. 2014, 31, 118–121. [Google Scholar] [CrossRef]
  14. Miller, H.D. From volume to value: Better ways to pay for health care. Health Aff. 2009, 28, 1418–1428. [Google Scholar] [CrossRef]
  15. Guo, C.; Chen, J. Big data analytics in healthcare. In Knowledge Technology and Systems: Toward Establishing Knowledge Systems Science; Springer Nature: Singapore, 2023; pp. 27–70. [Google Scholar] [CrossRef]
  16. Ohlhorst, F.J. Big Data Analytics: Turning Big Data into Big Money; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  17. Laney, D. 3D data management: Controlling data volume, velocity, and variety. META Group Res. Note 2001, 6, 1. [Google Scholar]
  18. Malekloo, A.; Ozer, E.; AlHamaydeh, M.; Girolami, M. Machine learning and structural health monitoring overview with emerging technology and high-dimensional data source highlights. Struct. Health Monit. 2022, 21, 1906–1955. [Google Scholar] [CrossRef]
  19. Palanisamy, V.; Thirunavukarasu, R. Implications of big data analytics in developing healthcare frameworks—A review. J. King Saud Univ. Comput. Inf. Sci. 2019, 31, 415–425. [Google Scholar] [CrossRef]
  20. George, M.M.; Rasmi, P.S. Performance comparison of Apache Hadoop and Apache Spark for COVID-19 data sets. In Proceedings of the 4th International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 January 2022; pp. 1659–1665. [Google Scholar] [CrossRef]
  21. Kumari, S.; Muthulakshmi, P. High-performance computation in big data analytics. In International Conference on Intelligent Systems Design and Applications; Springer: Cham, Switzerland, 2022; pp. 543–553. [Google Scholar] [CrossRef]
  22. Alanazi, A. Using machine learning for healthcare challenges and opportunities. Inform. Med. Unlocked 2022, 30, 100924. [Google Scholar] [CrossRef]
  23. Lee, C.H.; Yoon, H.-J. Medical big data: Promise and challenges. Kidney Res. Clin. Pract. 2017, 36, 3–11. [Google Scholar] [CrossRef]
  24. Sagi, O.; Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
  25. Alhenawi, E.; Al-Sayyed, R.; Hudaib, A.; Mirjalili, S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput. Biol. Med. 2022, 140, 105051. [Google Scholar] [CrossRef]
  26. Adadi, A. A survey on data-efficient algorithms in the big data era. J. Big Data 2021, 8, 24. [Google Scholar] [CrossRef]
  27. Tchapga, C.T.; Mih, T.A.; Kouanou, A.T.; Fonzin, T.F.; Fogang, P.K.; Mezatio, B.A.; Tchiotsop, D. Biomedical image classification in a big data architecture using machine learning algorithms. J. Healthc. Eng. 2021, 2021, 9998819. [Google Scholar] [CrossRef] [PubMed]
  28. Rehman, A.; Naz, S.; Razzak, I. Leveraging big data analytics in healthcare enhancement: Trends, challenges, and opportunities. Multimed. Syst. 2022, 28, 1339–1371. [Google Scholar] [CrossRef]
  29. An, Q.; Rahman, S.; Zhou, J.; Kang, J.J. A comprehensive review on machine learning in the healthcare industry: Classification, restrictions, opportunities and challenges. Sensors 2023, 23, 4178. [Google Scholar] [CrossRef]
  30. Azmi, J.; Arif, M.; Nafis, M.T.; Alam, M.A.; Tanweer, S.; Wang, G. A systematic review on machine learning approaches for cardiovascular disease prediction using medical big data. Med. Eng. Phys. 2022, 105, 103825. [Google Scholar] [CrossRef]
  31. Altman, M.B.; Wan, W.; Hosseini, A.S.; Nowdeh, S.A.; Alizadeh, M. Machine learning algorithms for FPGA implementation in biomedical engineering applications: A review. Heliyon 2024, 10, 4. [Google Scholar] [CrossRef] [PubMed]
  32. Zeydan, E.; Arslan, S.S.; Liyanage, M. Managing distributed machine learning lifecycle for healthcare data in the cloud. IEEE Access 2024, 12, 115750–115774. [Google Scholar] [CrossRef]
  33. Khalsan, M.; Machado, L.R.; Al-Shamery, E.S.; Ajit, S.; Anthony, K.; Mu, M.; Agyeman, M.O. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access 2022, 10, 27522–27534. [Google Scholar] [CrossRef]
  34. Zhang, X.-D. A Matrix Algebra Approach to Artificial Intelligence; Springer: Singapore, 2020; p. 803. [Google Scholar] [CrossRef]
  35. Ashraf, M.; Gupta, D.; Khanna, A.; Bhattacharyya, S.; Hassanien, A.E.; Anand, S.; Jaiswal, A. Prediction of cardio-vascular disease through cutting-edge deep learning technologies: An empirical study based on TensorFlow, PyTorch and Keras. In International Conference on Innovative Computing and Communications; Gupta, D., Khanna, A., Bhattacharyya, S., Hassanien, A.E., Anand, S., Jaiswal, A., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2021; Volume 1165, pp. 239–255. [Google Scholar] [CrossRef]
  36. Dai, H.; Peng, X.; Shi, X.; He, L.; Xiong, Q.; Jin, H. Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment. Sci. China Inf. Sci. 2022, 65, 172101. [Google Scholar] [CrossRef]
  37. Kimm, H.; Paik, I.; Kimm, H. Performance comparison of TPU, GPU, CPU on Google Colaboratory over distributed deep learning. In Proceedings of the IEEE 14th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 20–23 December 2021; pp. 312–319. [Google Scholar] [CrossRef]
  38. Nikolić, G.S.; Dimitrijević, B.R.; Nikolić, T.R.; Stojčev, M.K. A survey of three types of processing units: CPU, GPU, and TPU. In Proceedings of the 57th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST), Ohrid, North Macedonia, 16–18 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
  39. Chung, I.-H.; Sainath, T.N.; Ramabhadran, B.; Picheny, M.; Gunnels, J.; Austel, V.; Chauhari, U.; Kingsbury, B. Parallel deep neural network training for big data on Blue Gene/Q. IEEE Trans. Parallel Distrib. Syst. 2016, 28, 1703–1714. [Google Scholar] [CrossRef]
  40. Dhouibi, M.; Ben Salem, A.K.; Saidi, A.; Ben Saoud, S. Accelerating deep neural networks implementation: A survey. IET Comput. Digit. Tech. 2021, 15, 79–96. [Google Scholar] [CrossRef]
  41. Khalilian, M.; Boroujeni, F.Z.; Mustapha, N.; Sulaiman, M.N. K-means divide and conquer clustering. In Proceedings of the International Conference on Computer and Automation Engineering, Bangkok, Thailand, 8–10 March 2009; pp. 306–309. [Google Scholar] [CrossRef]
  42. Imran, S.; Mahmood, T.; Morshed, A.; Sellis, T. Big data analytics in healthcare—A systematic literature review and roadmap for practical implementation. IEEE/CAA J. Autom. Sin. 2020, 8, 1–22. [Google Scholar] [CrossRef]
  43. Qiu, J.; Wu, Q.; Ding, G.; Xu, Y.; Feng, S. A survey of machine learning for big data processing. EURASIP J. Adv. Signal Process. 2016, 2016, 67. [Google Scholar] [CrossRef]
  44. Slavakis, K.; Kim, S.-J.; Mateos, G.; Giannakis, G.B. Stochastic approximation vis-a-vis online learning for big data analytics [lecture notes]. IEEE Signal Process. Mag. 2014, 31, 124–129. [Google Scholar] [CrossRef]
  45. Ta, V.-D.; Liu, C.-M.; Nkabinde, G.W. Big data stream computing in healthcare real-time analytics. In Proceedings of the IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, 5–7 July 2016; pp. 37–42. [Google Scholar] [CrossRef]
  46. Shahraki, A.; Abbasi, M.; Taherkordi, A.; Jurcut, A.D. A comparative study on online machine learning techniques for network traffic streams analysis. Comput. Netw. 2022, 207, 108836. [Google Scholar] [CrossRef]
  47. Luo, Y.; Yin, L.; Bai, W.; Mao, K. An appraisal of incremental learning methods. Entropy 2020, 22, 1190. [Google Scholar] [CrossRef]
  48. He, J.; Mao, R.; Shao, Z.; Zhu, F. Incremental learning in online scenario. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13926–13935. [Google Scholar] [CrossRef]
  49. Senthil, R.; Anand, T.; Somala, C.S.; Saravanan, K.M. Bibliometric analysis of artificial intelligence in healthcare research: Trends and future directions. Future Healthc. J. 2024, 11, 100182. [Google Scholar] [CrossRef]
  50. Ganatra, H.A. Machine learning in pediatric healthcare: Current trends, challenges, and future directions. J. Clin. Med. 2025, 14, 807. [Google Scholar] [CrossRef]
  51. Ahmed, A.; Xi, R.; Hou, M.; Shah, S.A.; Hameed, S. Harnessing big data analytics for healthcare: A comprehensive review of frameworks, implications, applications, and impacts. IEEE Access 2023, 11, 112891–112928. [Google Scholar] [CrossRef]
  52. Domenteanu, A.; Cibu, B.; Delcea, C. Mapping the research landscape of Industry 5.0 from a machine learning and big data analytics perspective: A bibliometric approach. Sustainability 2024, 16, 2764. [Google Scholar] [CrossRef]
  53. Mayer, R.; Jacobsen, H.-A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Comput. Surv. 2020, 53, 1–37. [Google Scholar] [CrossRef]
  54. Lwakatare, L.E.; Raj, A.; Crnkovic, I.; Bosch, J.; Olsson, H.H. Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Inf. Softw. Technol. 2020, 127, 106368. [Google Scholar] [CrossRef]
  55. Le, T.T.; Fu, W.; Moore, J.H. Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 2020, 36, 250–256. [Google Scholar] [CrossRef] [PubMed]
  56. Zheng, X.; Li, P.; Wu, X. Data stream classification based on extreme learning machine: A review. Big Data Res. 2022, 30, 100356. [Google Scholar] [CrossRef]
  57. Sangeetha, G.; Balasubramanian, V. HEL-MCNN: Hybrid extreme learning modified convolutional neural network for allocating suitable donors for patients with minimized waiting time. Expert Syst. Appl. 2023, 232, 120673. [Google Scholar] [CrossRef]
  58. Lahoura, V.; Singh, H.; Aggarwal, A.; Sharma, B.; Mohammed, M.A.; Damaševičius, R.; Kadry, S.; Cengiz, K. Cloud computing-based framework for breast cancer diagnosis using extreme learning machine. Diagnostics 2021, 11, 241. [Google Scholar] [CrossRef]
  59. Malik, H.; Anees, T.; Naeem, A.; Naqvi, R.A.; Loh, W.-K. Blockchain-federated and deep-learning-based ensembling of capsule network with incremental extreme learning machines for classification of COVID-19 using CT scans. Bioengineering 2023, 10, 203. [Google Scholar] [CrossRef]
  60. Rajendran, S.; Khalaf, O.I.; Alotaibi, Y.; Alghamdi, S. MapReduce-based big data classification model using feature subset selection and hyperparameter tuned deep belief network. Sci. Rep. 2021, 11, 24138. [Google Scholar] [CrossRef]
  61. Goswami, M.; Mohanty, S.; Pattnaik, P.K. Optimization of machine learning models through quantization and data bit reduction in healthcare datasets. Frankl. Open 2024, 8, 100136. [Google Scholar] [CrossRef]
  62. Sharada, K.A.; Sushma, K.; Muthukumaran, V.; Mahesh, T.; Swapna, B.; Roopashree, S. High ECG diagnosis rate using novel machine learning techniques with distributed arithmetic (DA)-based gated recurrent units. Microprocess. Microsyst. 2023, 98, 104796. [Google Scholar] [CrossRef]
  63. Rahman, M.M.; Al-Amin, M.; Hossain, J. Machine learning models for chronic kidney disease diagnosis and prediction. Biomed. Signal Process. Control 2024, 87, 105368. [Google Scholar] [CrossRef]
  64. Narwane, S.V.; Sawarkar, S.D. Is handling unbalanced datasets for machine learning uplift system performance? A case of diabetic prediction. Diabetes Metab. Syndr. Clin. Res. Rev. 2022, 16, 102609. [Google Scholar] [CrossRef]
  65. Kumar, V.; Biswas, S.; Rajput, D.S.; Patel, H.; Tiwari, B. PCA-based incremental extreme learning machine (PCA-IELM) for COVID-19 patient diagnosis using chest X-ray images. Comput. Intell. Neurosci. 2022, 2022, 9107430. [Google Scholar] [CrossRef]
  66. Hoozemans, J.; Peltenburg, J.; Nonnemacher, F.; Hadnagy, A.; Al-Ars, Z.; Hofstee, H.P. FPGA acceleration for big data analytics: Challenges and opportunities. IEEE Circuits Syst. Mag. 2021, 21, 30–47. [Google Scholar] [CrossRef]
  67. Wang, L.; Alexander, C.A. Big data analytics in medical engineering and healthcare: Methods, advances, and challenges. J. Med. Eng. Technol. 2020, 44, 267–283. [Google Scholar] [CrossRef]
  68. Sanaullah, A.; Yang, C.; Alexeev, Y.; Yoshii, K.; Herbordt, M.C. Real-time data analysis for medical diagnosis using FPGA-accelerated neural networks. BMC Bioinform. 2018, 19 (Suppl. S19), 19–31. [Google Scholar] [CrossRef] [PubMed]
  69. Sharma, Y.; Tiwari, N.K.; Upadhyay, V.K. EffSVMNet: An efficient hybrid neural network for improved skin disease classification. Smart Health 2024, 34, 100520. [Google Scholar] [CrossRef]
  70. Sakthivel, R.; Thaseen, I.S.; Vanitha, M.; Deepa, M.; Angulakshmi, M.; Mangayarkarasi, R.; Mahendran, A.; Alnumay, W.; Chatterjee, P. An efficient hardware architecture based on an ensemble of deep learning models for COVID-19 prediction. Sustain. Cities Soc. 2022, 80, 103713. [Google Scholar] [CrossRef]
  71. Cheng, X.; Liu, D.; Lu, J.; Wei, L.; Hu, A.; Lei, J.; Zou, Z.; Zou, X.; Jiang, Q. Efficient hardware design of a deep U-net model for pixel-level ECG classification in healthcare devices. Microelectron. J. 2022, 126, 105492. [Google Scholar] [CrossRef]
  72. Soffer, S.; Ben-Cohen, A.; Shimon, O.; Amitai, M.M.; Greenspan, H.; Klang, E. Convolutional neural networks for radiologic images: A radiologist’s guide. Radiology 2019, 290, 590–606. [Google Scholar] [CrossRef] [PubMed]
  73. Draelos, R.L.; Dov, D.; Mazurowski, M.A.; Lo, J.Y.; Henao, R.; Rubin, G.D.; Carin, L. Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes. Med. Image Anal. 2021, 67, 101857. [Google Scholar] [CrossRef]
  74. Aruna, V.B.K.L.; Chitra, E.; Padmaja, M. Accelerating deep convolutional neural network on FPGA for ECG signal classification. Microprocess. Microsyst. 2023, 103, 104939. [Google Scholar] [CrossRef]
  75. Yacoub, M.H.; Ismail, S.M.; Said, L.A.; Madian, A.H.; Radwan, A.G. Reconfigurable hardware implementation of K-nearest neighbor algorithm on FPGA. AEÜ–Int. J. Electron. Commun. 2024, 173, 154999. [Google Scholar] [CrossRef]
  76. Shafqat, S.; Kishwer, S.; Rasool, R.U.; Qadir, J.; Amjad, T.; Ahmad, H.F. Big data analytics enhanced healthcare systems: A review. J. Supercomput. 2020, 76, 1754–1799. [Google Scholar] [CrossRef]
  77. Kumar, S.; Singh, M. Big data analytics for healthcare industry: Impact, applications, and tools. Big Data Min. Anal. 2018, 2, 48–57. [Google Scholar] [CrossRef]
  78. Abdel-Fattah, M.A.; Othman, N.A.; Goher, N. Predicting chronic kidney disease using hybrid machine learning based on Apache Spark. Comput. Intell. Neurosci. 2022, 2022, 9898831. [Google Scholar] [CrossRef] [PubMed]
  79. Guan, P.; Yu, K.; Wei, W.; Tan, Y.; Wu, J. Big data analytics on lung cancer diagnosis framework with deep learning. IEEE/ACM Trans. Comput. Biol. Bioinf. 2023, 21, 757–768. [Google Scholar] [CrossRef]
  80. Sukanya, J.; Gandhi, K.R.; Palanisamy, V. An assessment of machine learning algorithms for healthcare analysis based on improved MapReduce. Adv. Eng. Softw. 2022, 173, 103285. [Google Scholar] [CrossRef]
  81. Albattah, W.; Khan, R.U.; Alsharekh, M.F.; Khasawneh, S.F. Feature selection techniques for big data analytics. Electronics 2022, 11, 3177. [Google Scholar] [CrossRef]
  82. Xing, W.; Bei, Y. Medical health big data classification based on KNN classification algorithm. IEEE Access 2019, 8, 28808–28819. [Google Scholar] [CrossRef]
  83. Jaiswal, V.; Saurabh, P.; Lilhore, U.K.; Pathak, M.; Simaiya, S.; Dalal, S. A breast cancer risk prediction and classification model with ensemble learning and big data fusion. Decis. Anal. J. 2023, 8, 100298. [Google Scholar] [CrossRef]
  84. Orlu, G.U.; Abdullah, R.B.; Zaremohzzabieh, Z.; Jusoh, Y.Y.; Asadi, S.; Qasem, Y.A.M.; Nor, R.N.H.; Mohd Nasir, W.M.H. A Systematic Review of Literature on Sustaining Decision-Making in Healthcare Organizations Amid Imperfect Information in the Big Data Era. Sustainability 2023, 15, 15476. [Google Scholar] [CrossRef]
  85. Vettoruzzo, A.; Bouguelia, M.-R.; Vanschoren, J.; Rögnvaldsson, T.; Santosh, K.C. Advances and challenges in meta-learning: A technical review. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4763–4779. [Google Scholar] [CrossRef]
  86. Rafiei, A.; Moore, R.; Jahromi, S.; Hajati, F.; Kamaleswaran, R. Meta-learning in healthcare: A survey. SN Comput. Sci. 2024, 5, 791. [Google Scholar] [CrossRef]
  87. He, X.; Zhao, K.; Chu, X. AutoML: A survey of the state-of-the-art. Knowl. Based Syst. 2021, 212, 106622. [Google Scholar] [CrossRef]
  88. Barbudo, R.; Ventura, S.; Romero, J.R. Eight years of AutoML: Categorization, review, and trends. Knowl. Inf. Syst. 2023, 65, 5097–5149. [Google Scholar] [CrossRef]
  89. Yuan, H.; Yu, K.; Xie, F.; Liu, M.; Sun, S. Automated machine learning with interpretation: A systematic review of methodologies and applications in healthcare. Med. Adv. 2024, 2, 205–237. [Google Scholar] [CrossRef]
  90. Parimanam, K.; Lakshmanan, L.; Palaniswamy, T. Hybrid optimization-based learning technique for multi-disease analytics from healthcare big data using optimal pre-processing, clustering, and classifier. Concurr. Comput. Pract. Exp. 2022, 34, e6986. [Google Scholar] [CrossRef]
  91. Cao, B.; Zhao, J.; Lv, Z.; Liu, X.; Yang, S.; Kang, X.; Kang, K. Distributed parallel particle swarm optimization for multi-objective and many-objective large-scale optimization. IEEE Access 2017, 5, 8214–8221. [Google Scholar] [CrossRef]
  92. Wang, X.; Wang, F.; He, Q.; Guo, Y. A multi-swarm optimizer with a reinforcement learning mechanism for large-scale optimization. Swarm Evol. Comput. 2024, 86, 101486. [Google Scholar] [CrossRef]
  93. Bhattacharya, M.; Islam, R.; Abawajy, J. Evolutionary optimization: A big data perspective. J. Netw. Comput. Appl. 2016, 59, 416–426. [Google Scholar] [CrossRef]
  94. Yang, T.; Deng, Y.; Yu, B.; Qian, Y.; Dai, J. Local feature selection for large-scale data sets with limited labels. IEEE Trans. Knowl. Data Eng. 2022, 35, 7152–7163. [Google Scholar] [CrossRef]
  95. Liyanage, Y.W.; Zois, D.-S.; Chelmis, C. Dynamic instance-wise joint feature selection and classification. IEEE Trans. Artif. Intell. 2021, 2, 169–184. [Google Scholar] [CrossRef]
  96. Zhu, X.; Song, Y.; Wang, P.; Li, L.; Fu, Z. Data-driven adaptive and stable feature selection method for large-scale industrial systems. Control Eng. Pract. 2024, 153, 106097. [Google Scholar] [CrossRef]
  97. Sakivama, K.; Kato, S.; Ishikawa, Y.; Hori, A.; Monrroy, A. Deep learning on large-scale multicore clusters. In Proceedings of the International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Lyon, France, 24–27 September 2018; pp. 314–321. [Google Scholar] [CrossRef]
  98. Ragala, R.; Kumar, G. Rank-based pseudoinverse computation in extreme learning machine for large datasets. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 1341–1346. [Google Scholar] [CrossRef]
  99. Zhao, S.-X.; Wang, X.-Z.; Wang, L.-Y.; Hu, J.-M.; Li, W.-P. Analysis on fast training speed of extreme learning machine and replacement policy. Int. J. Wirel. Mob. Comput. 2017, 13, 314–322. [Google Scholar] [CrossRef]
  100. Yang, B. Application of matrix decomposition in machine learning. In Proceedings of the IEEE International Conference on Computer Science, Electronics and Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021; pp. 133–137. [Google Scholar] [CrossRef]
  101. Dereziński, M.; Mahoney, M.W. Recent and upcoming developments in randomized numerical linear algebra for machine learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Barcelona, Spain, 25–29 August 2024; pp. 6470–6479. [Google Scholar] [CrossRef]
  102. Ahmad, A.; Pasha, M.A. Optimizing hardware-accelerated general matrix–matrix multiplication for CNNs on FPGAs. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 2692–2696. [Google Scholar] [CrossRef]
  103. Robinson, T.; Harkin, J.; Shukla, P. Hardware acceleration of genomics data analysis: Challenges and opportunities. Bioinformatics 2021, 37, 1785–1795. [Google Scholar] [CrossRef] [PubMed]
  104. Antunes, R.S.; da Costa, C.A.; Küderle, A.; Yari, I.A.; Eskofier, B. Federated learning for healthcare: Systematic review and architecture proposal. ACM Trans. Intell. Syst. Technol. 2022, 13, 1–23. [Google Scholar] [CrossRef]
  105. Bharati, S.; Mondal, M.R.H.; Podder, P.; Prasath, V.B.S. Federated learning: Applications, challenges, and future directions. Int. J. Hybrid Intell. Syst. 2022, 18, 19–35. [Google Scholar] [CrossRef]
  106. Joshi, M.; Pal, A.; Sankarasubbu, M. Federated learning for healthcare domain—Pipeline, applications, and challenges. ACM Trans. Comput. Healthc. 2022, 3, 1–36. [Google Scholar] [CrossRef]
  107. Li, H.; Li, C.; Wang, J.; Yang, A.; Ma, Z.; Zhang, Z.; Hua, D. Review on the security of federated learning and its application in healthcare. Future Gener. Comput. Syst. 2023, 144, 271–290. [Google Scholar] [CrossRef]
  108. Tian, F.; Yang, J.; Zhao, S.; Sawan, M. NeuroCARE: A generic neuromorphic edge computing framework for healthcare applications. Front. Neurosci. 2023, 17, 1093865. [Google Scholar] [CrossRef]
  109. Gautam, A.; Sharma, S. Artificial narrow intelligence-inspired neuromorphic computing for logic operations in healthcare appliances. In Proceedings of the 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), Kollam, India, 8–9 August 2024; Volume 1, pp. 714–719. [Google Scholar] [CrossRef]
  110. Goyal, S.R. Neuromorphic system for real-time healthcare applications. In Primer to Neuromorphic Computing; Academic Press: Cambridge, MA, USA, 2025; pp. 83–96. [Google Scholar] [CrossRef]
  111. Cohen, S.; Leve, F.; Trannois, H.; Badreddine, W.; Legendre, F. A decision-making model based on spiking neural network (SNN) for remote patient monitoring. Int. J. Mach. Learn. Comput. 2023, 13, 82–90. [Google Scholar] [CrossRef]
  112. Yamazaki, K.; Vo-Ho, V.-K.; Bulsara, D.; Le, N. Spiking neural networks and their applications: A review. Brain Sci. 2022, 12, 863. [Google Scholar] [CrossRef] [PubMed]
  113. Shahid, A.; Mushtaq, M. A survey comparing specialized hardware and evolution in TPUs for neural networks. In Proceedings of the IEEE 23rd International Multitopic Conference (INMIC), Bahawalpur, Pakistan, 5–7 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
  114. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. Proc. Int. Symp. Comput. Archit. 2017, 45, 1–12. [Google Scholar] [CrossRef]
  115. Azghadi, M.R.; Lammie, C.; Eshraghian, J.K.; Payvand, M.; Donati, E.; Linares-Barranco, B.; Indiveri, G. Hardware implementation of deep network accelerators towards healthcare and biomedical applications. IEEE Trans. Biomed. Circuits Syst. 2020, 14, 1138–1159. [Google Scholar] [CrossRef] [PubMed]
  116. Theodorakopoulos, L.; Theodoropoulou, A.; Stamatiou, Y. A State-of-the-Art Review in Big Data Management Engineering: Real-Life Case Studies, Challenges, and Future Research Directions. Eng 2024, 5, 1266–1297. [Google Scholar] [CrossRef]
  117. Huang, H.; Chow, E. Exploring the design space of distributed parallel sparse matrix–multiple vector multiplication. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 1977–1988. [Google Scholar] [CrossRef]
  118. Kang, H.; Kwon, H.C.; Kim, D. HPMaX: Heterogeneous parallel matrix multiplication using CPUs and GPUs. Computing 2020, 102, 2607–2631. [Google Scholar] [CrossRef]
  119. Liu, J.; Liang, X.; Ruan, W.; Zhang, B. High-performance medical data processing technology based on distributed parallel machine learning algorithm. J. Supercomput. 2022, 78, 5933–5956. [Google Scholar] [CrossRef]
  120. Sharma, S.K.; Dixit, R.J. Applications of parallel data processing for biomedical imaging. In Applications of Parallel Data Processing for Biomedical Imaging; IGI Global: Hershey, PA, USA, 2024; pp. 1–24. [Google Scholar] [CrossRef]
  121. Misra, C.; Bhattacharya, S.; Ghosh, S.K. STARK: Fast and scalable Strassen’s matrix multiplication using Apache Spark. IEEE Trans. Big Data 2020, 8, 699–710. [Google Scholar] [CrossRef]
  122. Foldi, T.; von Csefalvay, C.; Perez, N.A. JAMPI: Efficient matrix multiplication in Spark using barrier execution mode. Big Data Cogn. Comput. 2020, 4, 32. [Google Scholar] [CrossRef]
  123. Mishra, R. Parallel computing techniques for accelerating machine learning algorithms on big data. In Proceedings of the International Conference on Power, Energy, Environment and Intelligent Control (PEEIC), Greater Noida, India, 19–23 December 2023; pp. 669–672. [Google Scholar] [CrossRef]
  124. Jini, S.; Indra, N.C. Understanding the impact of data parallelism on neural network classification. Opt. Mem. Neural Netw. 2022, 31, 107–121. [Google Scholar] [CrossRef]
  125. Salih, S.Q.; Alsewari, A.A. A new algorithm for normal and large-scale optimization problems: Nomadic people optimizer. Neural Comput. Appl. 2020, 32, 10359–10386. [Google Scholar] [CrossRef]
Figure 1. Review outline.
Figure 1. Review outline.
Algorithms 18 00772 g001
Figure 2. Research directions: (a) Efficient techniques, arithmetic operations, and dimensionality reduction. (b) Advanced and specialized processing. (c) Clustering and parallel processing methods/frameworks.
Figure 2. Research directions: (a) Efficient techniques, arithmetic operations, and dimensionality reduction. (b) Advanced and specialized processing. (c) Clustering and parallel processing methods/frameworks.
Algorithms 18 00772 g002
Table 1. Summary of the recent healthcare big data review articles.
Table 1. Summary of the recent healthcare big data review articles.
ReferenceAimFindings
[25]Investigate healthcare feature selection methods with a focus on high-dimensional microarray gene expression datasets of cancer disease. The study evaluated 132 publications from nine different research directions.Based on six key perspectives assessment, the study highlights that there is no single feature selection algorithm that is universally effective. It also emphasized that further research is required to address the computational intensity and time-consumption of feature selection methods for high-dimensional datasets.
[26]Exploring how applications of machine learning algorithms in healthcare are dependent on the availability of medical datasets as the process of data acquisition faces many challenges.The study identified several potential research avenues that call for more data-efficient machine learning algorithms research. The review suggested several solutions for the effective classification of small medical datasets and further suggested ML pipelines to address the challenges of large volumes of healthcare datasets.
[27]Review, investigate, and evaluate the role of ML in the classification of biomedical image datasets with emphasis on applications of healthcare big data. While effective, the review paper also investigated the performance efficiency of ML using big data frameworks like Apache Spark.Efficient and streamlined diagnosis of a substantial volume of healthcare biomedical images requires the integration of ML and big data technologies. The proposed workflow (deep learning and Apache Spark) highlights several advantages, such as enhancing rapid medical data query, training speed, and efficient management of vast healthcare datasets (both structured and unstructured).
[28]Explore and investigate the applications and utilization of big data analytics techniques in healthcare, emphasizing the early detection of diseases, prediction, and prevention. The study examined five healthcare sub-disciplines: medical signal analytics, bioinformatics, image analysis and informatics, and public and clinical informatics.The review highlighted several key applications of big data in healthcare, including personalized medicine, clinical decision support, operational optimization, and cost-effectiveness analysis. It demonstrates how big data analytics facilitate early patient identification for timely intervention, thereby enhancing clinical outcomes across medical domains. Moreover, the study examined several key challenges in adopting big data analytics in healthcare, including privacy concerns, efficient data processing, and data heterogeneity, while emphasizing the need for further research and exploration.
[29]Review and examine the effectiveness of supervised and unsupervised machine learning in time series healthcare datasets, such as heart rate datasets. The review also analyzed the advantages and disadvantages of utilizing unsupervised methods in cases when dataset labels are unavailable.While effective in predictive analytics and diagnosis when applied to several scenarios, the authors stressed the need for a collaborative approach between machine learning and data analytics to ensure effective integration of ML algorithms into healthcare practices.
[30]Review and analyze of the performance of 41 research articles aimed at the prediction of cardiovascular diseases (CVD) using healthcare medical bigdata. The authors stressed that the reviewed ML algorithm’s performance notably degrades when applied to larger sample sizes. Second, many studies lacked consistent clinical relevance, rendering the application of proposed methods in real-world healthcare scenarios more challenging. The authors also suggested—based on several noted issues in the reviewed literature—the need for efficient selection and hyper-parameter tuning techniques to enhance the prediction accuracy of CVD.
[31]Examine a range of ML algorithms implemented on Field Programmable Gate Arrays and hybrid system-on-a-chip (SoC) for real-time classification required for high-velocity healthcare applications. The study primarily focused on tackling scalability challenges posed by embedded systems for biomedical applications, such as power, limited memory, and network sizes and topologies. The study emphasized the significance of real-time performance gain with less energy consumption compared to traditional implementations of nine reviewed ML algorithms. However, in addition to requiring low-level programming skills, the authors stressed that many of the existing embedded systems are not designed for general scalable purposes and lack the necessary flexibility to accommodate healthcare big data applications. New design patterns for ML algorithms and embedded systems (such as flexible FPGA architecture) for solving such problem sets are required.
[32]Explore the role of cloud infrastructure and machine learning algorithms in the lifecycle management of ML in healthcare and biomedical data. The study comprehensively reviewed the state-of-the-art architectural decisions necessary to ensure data privacy, security, and efficient management of AI-driven healthcare systems.The study highlighted several critical roles in realizing data-driven decision-making in healthcare big data. Among key findings is the distributed learning across decentralized structures for efficient high-velocity processing, data pipelines to ensure effective learning of large volumes of medical data and mitigation of ML bottlenecks, as well as integration of federated learning as a key aspect in enhancing medical collaboration, privacy, lifecycle management, and efficient AI-driven healthcare applications.
Table 2. Limitations and assessment of efficient computations and techniques studies.
Table 2. Limitations and assessment of efficient computations and techniques studies.
Ref.FocusScalabilityHealthcare Big Data
Applications
Computational
Resources
Requirement
Other Limitations
[54]Volume●●●●●●●●●●●●●●●●●●
  • The evaluated dataset is <20 K samples in size.
  • The presented results were obtained based on a single-node evaluation; multiple-node or cluster form of evaluation was not presented. Multi-node implementation can severely suffer from network overhead due to tree-based classifiers.
[57]Velocity●○○○○○●●○○○○●●●○○○
  • It is not clear whether the achieved velocity was due to the proposed feature selection approach or to the implementation of the ELM classifier.
  • The evaluation dataset has 80 features; high-dimensional medical datasets were not considered.
  • The proposed approach is not scalable due to the convolutional operations of the CNN architecture.
[58]Velocity●●●●●●●●●○○○●●●●●●
  • The proposed cloud-based ELM was evaluated only with 100 neurons; a higher number of nodes should be tested since ELM learning accuracy is substantially affected by the number of nodes.
  • The WBCD dataset is small in size (569 × 30).
[59]Velocity●●○○○○●●○○○○●●○○○○
  • The study did not investigate learning rate and regularization techniques to guarantee future-proof performance for Incremental-ELM classifiers.
  • Model parallelism of IELM requires efficient communication. In addition to hidden layer weights, each node might build a different set of hidden nodes.
[60]Volume●●●●●●●●○○○○●●○○○○
  • The dataset features dimension is small (2000 features), and high-dimensional medical datasets were not considered.
  • Hadoop framework specifics were not given, such as the number of Hadoop parallel mappers. Also, CPIO and HHO optimization algorithm parameters, such as the number of iterations, were not clearly defined.
  • Scalability and performance consistency of high dimensional datasets are potentially limiting the understanding of their practical applicability in big data environments.
[61]Velocity●●●●●●●●●●●●●○○○○○
  • Performance gain in ML velocity from applying the proposed quantization technique on small datasets cannot be generalized to large-scale datasets.
  • The proposed approach can be parallelized using a multi-core CPU or multiple GPUs. However, time assessment was performed using a single CPU only.
[62]Velocity●○○○○○●○○○○○●●○○○○
  • Models such as Gated Recurrent Units (GRU) typically require high computational resources, which limits the proposed approach for real-time healthcare applications. Although an FPGA-improved GRU (FPGA-GRU) was implemented, its performance remains tied to the dimensionality of the input features. Moreover, while FPGA acceleration improves processing speed, it introduces scalability limitations to accommodate larger datasets or more complex healthcare big data environments.
[63]Volume●●○○○○●●○○○○●●●○○○
  • Assessment of high-dimensional datasets is required. The CKD dataset is small in size (400 × 24).
  • Recursive feature selection and XGBoost classifier are both computationally intensive and hard to scale across multiple node settings.
[64]Volume●○○○○○●○○○○○●●○○○○
  • The proposed approach is not suitable for big data applications; it constitutes PCA, SVM, CDR, and KNN.
  • The computational overhead can increase substantially when applied to bigger datasets.
  • Applications of parallel PCA or distributed KNN were not investigated in the study.
[65]Velocity ●●●●●●●●●○○○●●●●●●
  • The iterative optimization of IELM to determine an optimal number of nodes and weights is rather slow for medical applications in big data.
  • PCA elimination of low-variance components can pose information loss important in medical classification. In addition, PCA is not adequate to capture important features with non-linear relationships that exist in medical datasets.
  • The evaluated WBCD dataset is small in size (569 × 30).
●●●●●● Very high, ●●●○○○ Moderate, ●●○○○○ Low, ●○○○○○ Very low. Interpretability levels are assigned according to Very High (0.85–1.00), High (0.70–0.85), Moderately High (0.55–0.70), Moderate (0.40–0.55), Low (0.20–0.40), and Very Low (0.00–0.20).
Table 3. Limitations and assessment of specialized hardware studies.
Table 3. Limitations and assessment of specialized hardware studies.
Ref.FocusScalabilityCostTechnical
Difficulties
Other Limitations
[69]Velocity●●●●●●●●●●○○●●●●○○
  • Multi-node scalability is challenging and requires high-speed and efficient network bandwidth for synchronization of neural gradient aggregation.
  • Although the proposed hybrid model reduced the number of trainable parameters compared to traditional deep learning models, the complexity of CNN and SVM training computational resources requirements is still high, especially in large medical datasets.
  • The study did not investigate the performance scalability on larger patient imaging datasets. The evaluation was performed only on one dataset with 5000 images.
[70]Velocity●○○○○○●○○○○○●●●●●●
  • While the proposed single-node hybrid approach combining GPU and FPGA hardware-accelerated deep learning model appears effective in increasing training velocity by 40%, this can further constrain the system’s adaptability to large-scale high-volume healthcare medical big datasets.
  • The study reported an additional time consumption of ~9 s when processing a relatively small ensemble size of four models. The scalable multi-node computational and technical overhead of the proposed ensemble approach will offset the efficiency gains through hardware acceleration.
[71]Velocity ●●●●○○●○○○○○●●●○○○
  • The MIT-BIH arrhythmia dataset with only 48 samples is relatively small when compared to continuous multi-patient ECG streams generated by telemetry wearable heart monitoring devices.
  • Signals preprocessing was not integrated into the proposed hardware pipeline, potentially creating preprocessing bottlenecks when working with multiple parallel ECG streams.
  • The proposed serial-to-parallel I/O buffer system (URAT and XI bus) can introduce data flow limitations, especially when scaled to real-time parallel ECG data streams.
[73]Volume●●●●●●●●●●○○●●●●○○
  • The scaling efficiency of GPU-based Convolutional Neural Networks (CNNs) is challenging, requiring high-speed network bandwidth for synchronization and gradient aggregation, which elevates costs and system complexity. Additionally, distributed GPU systems lack fault tolerance, as a single-node failure often necessitates retraining the entire model.
[74]Velocity●○○○○○●○○○○○●●●○○○
  • The classification process of the proposed FPGA-based accelerated DCNN took approximately ~15 s for the MIT-BIH with a 48-sample dataset. This is relatively high, especially for the applications of real-time parallel detection of cardiac anomaly detection.
[75]Velocity●○○○○○●○○○○○●●●○○○
  • The performance accuracy of KNN heavily relies on the chosen distance metric; such performance dependency can significantly impact the results of the proposed approach in specific healthcare big data applications.
  • KNN’s inherent computational complexity scales poorly as the volume of healthcare medical data increases. In large-scale EHRs, genomics datasets, or real-time ECG streams, the linear search of KNN becomes increasingly inefficient.
●●●●●● Very high, ●●●●○○ Moderately High, ●●●○○○ Moderate, ●○○○○○ Very low. Interpretability levels are assigned according to Very High (0.85–1.00), High (0.70–0.85), Moderately High (0.55–0.70), Moderate (0.40–0.55), Low (0.20–0.40), and Very Low (0.00–0.20).
Table 4. Limitations and assessment of clustering and parallel processing studies.
Table 4. Limitations and assessment of clustering and parallel processing studies.
Ref.FocusScalabilityCostTechnical
Difficulties
Other Limitations
[78]Volume●●●●●●●○○○○○●●○○○○
  • The cause of the performance increase is not clear regarding whether it is attributable to the feature selection technique or the utilization of Apache Spark.
  • No theoretical contribution was made. All evaluated algorithms are based on the Spark MLib library. Thus, the applicability of the proposed approach with different learning algorithms requires reimplementation using Spark programming language.
  • The study failed to provide scalability assessment on different configuration scenarios on large-scale datasets.
[79]Velocity●●●●●●●●●○○○●●●●○○
  • The study failed to assess scalability performance which is important requirement for big data analytics. The reported results are based on single GPU and data set of only 200 images.
  • Limited validation procedure which restricts real-world generalizability. In big data analytics contexts, data types are typically multimodal (PET, CT, MRI). The proposed framework was only evaluated only on PET scans. Inclusion of multimodal imaging might degrade framework accuracy and segmentation reliability.
[80]Velocity●●●●●○●○○○○○●○○○○○
  • The notion of setting the number of clusters to 10 in an improved MapReduce K-means the algorithm was not given. Since the optimal number of clusters is data-dependent, an explanation, such as elbow method analysis and silhouette score, is necessary. Moreover, DBSCAN can be used in conjunction with K-means given that DBSCAN automatically determines the cluster structure based on the data density in the feature space. In addition, DBSCAN often outperforms K-means in scenarios where clusters have irregular shapes or varying density; this is necessarily valuable for healthcare big data applications, where medical datasets frequently exhibit class imbalance and complex features space. As such, the linear or purely statistical relationships methods may not adequately capture underlying patterns.
  • Due to superior initialization strategy and fast convergence, K-means++ is often recommended over the standard K-means algorithm.
[81]Volume●●●●●●●○○○○○●●○○○○
  • The study exclusively evaluated the WBCD dataset, which is relatively small and lacks diversity in patients’ demographics.
  • Lacks a clear discussion on the feature selection method used in the study. Additionally, the study claims to leverage breast cancer detection using Apache Spark; however, no clear implementation and assessment was given.
  • The reported 99.84% accuracy may suggest overfitting of the XGBoost model and necessitate further evaluation using different datasets and/or the utilization of cross-validation techniques such as k-fold.
  • Gained Spark computational efficiency in terms of training/testing time was not adequately addressed. Clusters/nodes configuration was also not detailed.
[82]Velocity●●●●○○●○○○○○●○○○○○
  • KNN classifier is computationally expansive and not suitable for big data applications.
  • KNN classification accuracy and time efficiency are directly influenced by the density threshold and the cropping ratio of proposed clustering technique. The study observed that beyond a certain percentage, the time efficiency of KNN cannot be improved without leading to a notable decline in classification accuracy. This trade-off between computational efficiency and accuracy raises concern regarding the suitability of the proposed approach to healthcare big data applications.
●●●●●● Very high, ●●●●●○ High, ●●●●○○ Moderately High, ●●●○○○ Moderate, ●●○○○○ Low, ●○○○○○ Very low. Interpretability levels are assigned according to Very High (0.85–1.00), High (0.70–0.85), Moderately High (0.55–0.70), Moderate (0.40–0.55), Low (0.20–0.40), and Very Low (0.00–0.20).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Khudhur, D.Y.; Shibghatullah, A.S.; Shaker, K.; Abdul Latif, A.; Muda, Z.C. Recent Trends in Machine Learning for Healthcare Big Data Applications: Review of Velocity and Volume Challenges. Algorithms 2025, 18, 772. https://doi.org/10.3390/a18120772

AMA Style

Khudhur DY, Shibghatullah AS, Shaker K, Abdul Latif A, Muda ZC. Recent Trends in Machine Learning for Healthcare Big Data Applications: Review of Velocity and Volume Challenges. Algorithms. 2025; 18(12):772. https://doi.org/10.3390/a18120772

Chicago/Turabian Style

Khudhur, Doaa Yaseen, Abdul Samad Shibghatullah, Khalid Shaker, Aliza Abdul Latif, and Zakaria Che Muda. 2025. "Recent Trends in Machine Learning for Healthcare Big Data Applications: Review of Velocity and Volume Challenges" Algorithms 18, no. 12: 772. https://doi.org/10.3390/a18120772

APA Style

Khudhur, D. Y., Shibghatullah, A. S., Shaker, K., Abdul Latif, A., & Muda, Z. C. (2025). Recent Trends in Machine Learning for Healthcare Big Data Applications: Review of Velocity and Volume Challenges. Algorithms, 18(12), 772. https://doi.org/10.3390/a18120772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop