Next Article in Journal
Demo-ToT: Enhancing the Reasoning Capabilities of AI Agent via Improved Demonstrations Retrieval Strategy
Previous Article in Journal
RE-XswinUnet: Rotary Positional Encoding and Inter-Slice Contextual Connections for Multi-Organ Segmentation
Previous Article in Special Issue
CNN-Based Framework for Classifying COVID-19, Pneumonia, and Normal Chest X-Rays
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

LizAI XT—AI-Accelerated Management Platform for Complex Healthcare Data at Scale, Beyond EMR/EHR and Dashboards

by
Trung Tin Nguyen
1,*,† and
David Raphael Elmaleh
1,2,*,†
1
LizAI Inc., Newton, MA 02459, USA
2
Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Big Data Cogn. Comput. 2025, 9(11), 275; https://doi.org/10.3390/bdcc9110275 (registering DOI)
Submission received: 19 May 2025 / Revised: 13 October 2025 / Accepted: 27 October 2025 / Published: 1 November 2025

Abstract

In this study, we present LizAI XT, an AI-powered platform designed to automate the structuring, anonymization, and semantic integration of large-scale healthcare data from diverse sources, into one comprehensive table or any designated forms, based on diseases, clinical variables, and/or other defined parameters, beyond the creation of a dashboard or visualization. We evaluate the platform’s performance on a cluster of 4x NVIDIA A30 GPU 24GB, with 16 diseases—from deathly cancer and COPD, to conventional ones—ear infections, including a total 16,000 patients, ∼115,000 medical files, and ∼800 clinical variables. LizAI XT structures data from thousands of files into sets of variables for each disease in one file, achieving > 95.0% overall accuracy, while providing exceptional outputs in complicated cases of cancers (99.1%), COPD (98.89%), and asthma (98.12%), without model-overfitting. Data retrieval is sub-second for a variable per patient with a minimal GPU power, which can significantly be improved on more powerful GPUs. LizAI XT uniquely enables fully client-controlled data, complying with strict data security and privacy regulations per region/nation. Our advances complement the existing EMR/EHR, AWS HealthLake, and Google Vertex AI platforms, for healthcare data management and AI development, with large-scalability and expansion at any levels of HMOs, clinics, pharma, and government.

1. Introduction

Artificial intelligence (AI) aims to accelerate science and technology innovations, improve diagnostic accuracy, reduce physician burnout, enhance precision treatment, and improve management across levels from clinics, insurance companies, to government, in healthcare. Despite the recent hype and technical advancements in AI applications, some of which are invented by OpenAI [1], Google [2], Amazon [3], and Microsoft [4], reliability and quality of their performance still remain controversial for healthcare applications mainly due to the data accuracy. AI in healthcare, especially related to people’s well-being and life, requires the highest accuracy, mostly higher than 80% for diagnosis, and ranging from 90% to as high as 99% in clinical data retrieval [5,6,7,8]. Critical attributes to AI’s accuracy and success are the machine learning model strength, the quantity of data inputs, and most importantly the foundation of structured high-quality data which adheres to five key criteria: completeness, correctness, concordance, plausibility, and currency [9,10,11].
Healthcare data composes of any information about a patient’s health and medical history, including but not limited to personal information, medical records, clinical data, and treatment plans. Clinical data, in turn, covers lab results, X-rays, diagnostic imaging, and more, of each individual, which exists in various formats, such as .txt, .doc, .xls, .pdf, DICOM, etc. In most cases, the data is stored and fragmented across multiple sources, such as devices, departments, and even at different hospitals, which poses significant challenges to doctors when accessing and gaining insights in all data for the best fit treatment decision [12,13,14,15,16]. AI technologies, which enable precise treatment and data assessment, are highly demanded to save life, especially from life-threatening and/or chronic diseases, like cancers, and COPD. Furthermore, AI technologies, with a real-time and wide integration of all individuals’ healthcare data at functional department, hospital, even region, and nation levels, will strengthen disease prediction, intervention efficiency, and cost-effectiveness.
Despite all potentials and excitements, the applications of AI in healthcare are still in their early phases, mostly small scales and trials, particularly due to legal, ethical, and regulatory considerations on data security and privacy. Key regulatory principles include explicit consent, data minimization, access and rectification rights, strong security, and breach notification. Unauthorized sharing, commercial use, and manipulation are prohibited, with strict cross-border transfer controls. Thus, processing sensitive medical information requires explicit user consent and strict compliance with laws governing data sharing between entities, in each country, such as HIPAA in the USA [17], and GDPR in the EEA [18].
The existing technologies—EHRs and EMRs (Electronic Health Records, and Electronic Medical Records)—have been deployed in clinical practice, enabling clinical data collection, management, and sharing. Facile access to clinical data in these applications sparks increased interest in leveraging data for AI development. However, employing data from EHRs and EMRs in AI training has been hindered by various factors, mostly concerning their data quality and suitability. Burnum [19] reported that the adoption of EHRs has not enhanced data quality but instead resulted in a huge accumulation of poor-quality data. Overall, hospitals generate approximately 50 petabytes of data annually, 97% of which, however, remains unused due to its inaccessibility and unstructured nature, leading to siloed data [20]. Thus, valuable insights remain trapped within individual systems, further limiting the ability to make informed decisions based on a comprehensive view of patient health.
Major industry players, such as AWS HealthLake [3] and Google Vertex AI [2], offer platforms for data processing and AI development, partly addressing the lacks in EMRs and EHRs. To our knowledge, HealthLake focuses on structured clinical data like FHIR, while Vertex AI supports general data, both structured and unstructured, yet is not specifically tailored for healthcare. Notably, the employment of these platforms could raise regulatory compliance concerns when processing sensitive data with cloud-based AI models, posing significant risks of data breaches, unauthorized access, and limited control for clients/government. Although legal clarity is still await at the time of this report, a GDPR compliance inquiry of Ireland’s Data Protection Commission into Google in 2024, regarding the development of PaLM 2, is an outstanding example [21]. Previous efforts in healthcare data management have focused primarily on digital record systems such as Electronic Health Records (EHRs) and Electronic Medical Records (EMRs), as well as commercial solutions including AWS HealthLake and Google Vertex AI. While these tools enable storage and partial integration of structured data, they lack advanced mechanisms for multimodal data harmonization, semantic understanding, and client-side privacy control. Our approach distinguishes itself by combining large language models (LLMs), ontology-based frameworks, and multimodal AI pipelines in a fully client-controlled environment that ensures regulatory compliance and scalability.
To address these challenges, we have innovated AI-powered platform LizAI XT (as shown in Figure 1), which is engineered with natural language processing (NLP), image processing, large language models (LLMs), and advanced retrieval and data insights. LizAI XT is capable of mega-structuring a largely fragmented clinical database into a comprehensive table of all anonymized patients, and variables per disease. The clinical data mega-structure can be designed in any forms beyond table, such as charts, graphs, knowledge interaction, and more, which further provides a maximum accuracy for clinical data semantic search and management across various sources. Our technology supports both on-premises and cloud-based servers, which enables fully client/government-controlled healthcare data, and qualifies diverse security and infrastructure needs by adhering to the strict data security, privacy, and regulatory standards, such as such as HIPAA and GDPR, of each region/country. We note that LizAI XT employs certified de-identification standards aligned with HIPAA §164.514(b)(2) and GDPR Recital 26, ensuring that no combination of structured or unstructured data can reasonably re-identify individuals. Anonymization is executed on-premises before data transfer or aggregation, minimizing cross-border data exposure. To ensure sustainability, LizAI XT integrates modular object storage with automated indexing and incremental update pipelines. These enable continuous ingestion and restructuring of new patient data without full reprocessing of existing datasets, ensuring the mega-structure remains current and computationally efficient.
In this report, we design the evaluation of LizAI XT with 16 diseases, categorizing in seven groups—Oncology, Respiratory conditions, Immunological and Allergic disorders, Neurological and Psychiatric conditions, Infectious and Inflammatory diseases, Reproductive Health, and Endocrine and Metabolic disorders. Our clinically relevant database consists of a total of 16,000 patients, ∼115,000 medical files, and ∼800 clinical variables, which is prepared based on real-life clinical events, guided by experts’ inputs and realworld statistics from sources, such as the CDC and NIH. The overall accuracy for data structure in LizAI XT is >95.0% across all diseases, and accuracy values are exceptionally outstanding in complicated diseases, namely colorectal cancer (99.12% ± 0.049%), prostate cancer (99.03% ± 0.08%), COPD (98.89% ± 0.076%), and asthma (98.12% ± 0.172%), without model-overfitting. The data retrieval speed for a variable per patient is sub-second with a minimal cluster of 4x NVIDIA A30 GPU 24GB, which can exponentially be improved on more powerful GPUs. Beyond the creation of a dashboard or visualization tool, the primary objective of this study aims to design and evaluate a comprehensive AI-powered framework—LizAI XT—that automates clinical data mega-structuring, deidentification, and semantic retrieval across multimodal healthcare datasets. The platform’s core function is to transform fragmented patient data from multiple systems into coherent, analyzable, and privacy-preserving datasets that support precision medicine, research, and decision-making.
These critical attributes ensure LizAI XT potentials for launching in majority of clinics with various levels of IT infrastructure and the nationwide scalability expectation. With the addition of user-friendly interface, LizAI XT will complement the existing platforms, such as EMR/EHR, AWS HealthLake, and Google Vertex AI, for healthcare data management and AI development.

2. Materials and Methods

2.1. Database Preparation

To create highly realistic patient data for evaluating LizAI XT, we leverage the Synthea™ Patient Generator (version 3.3.0, MITRE Corporation, McLean, VA, USA) [22], a well-established framework for producing synthetic healthcare data. Synthea’s Generic Module Framework (GMF) enables the simulation of diverse diseases and conditions, generating complete medical histories for synthetic patients from birth to the present. Each module replicates real-world clinical events, incorporating expert knowledge and statistical data from sources like the CDC [23] and NIH [24]. Additionally, the generated database adheres to standardized coding systems for laboratory results, clinical diagnoses, and medications, including LOINC [25], SNOMED-CT [26], RxNorm [27], and ICD-10 [28].
As this paper focuses on evaluating XT’s capability to structure unstructured medical data from various types and formats into datasets, we enhance Synthea-generated data by developing a component that processes its output (e.g., JSON and FHIR) using a large language model (Qwen2.5-32B-Instruct, Alibaba Cloud, Hangzhou, China [29], deployed on our on-premises server) to enrich the clinical context. The enriched data is then transformed into multiple widely used healthcare formats, including FHIR, HL7, CSV, PDF, and TXT, as well as unstructured formats such as free-text clinical notes and imaging reports. For example, Figure 2 shows an example of our generated medical report for a patient. While this study employs high-fidelity synthetic data (Synthea) to validate the platform’s technical performance, future work will include real-world clinical datasets obtained through institutional partnerships and IRB-approved data-use agreements. The inclusion of real data will allow benchmarking of clinical outcomes and imaging-based model evaluation, further strengthening the platform’s translational applicability.
LizAI XT’s multimodal integration—spanning text, imaging, and speech—provides a foundation for comprehensive patient-level analysis. By aligning all modalities under unified ontologies and knowledge graphs, the system ensures semantic consistency and interoperability across disparate healthcare information systems. Ongoing development focuses on extending this architecture for personalized analytics, federated learning, and incremental updates to maintain scalability as healthcare data volume continues to grow.
The number of files and variables is randomized to improve data generalization and ensure a diverse representation of clinical scenarios. Overall, we created a clinically relevant database containing records from 16,000 patients across 16 diseases. In total, the database comprises 112,711 medical files in various formats, including FHIR, HL7, .csv, .pdf, and .txt, as well as unstructured formats such as free-text clinical notes and imaging reports. It includes 781 clinical variables, with definitions provided in Supplementary Information—File S1. The diversity of data formats contributes to a comprehensive patient representation, which is essential for assessing LizAI XT’s performance.

2.2. Clinical Data Mega-Structure by LizAI XT

LizAI XT is designed to automate the structure of medical data from multiple healthcare systems, handling various data types and formats, as shown in Figure 3. In particular, patient data is fragmented across multiple sources, including medical devices, departments, and even different hospitals [12,13,14,15,16]. This fragmentation poses significant challenges for doctors in accessing and synthesizing all relevant information to make the best treatment decisions. Once integrated into the healthcare infrastructure, LizAI XT automatically collects these fragmented data, structuring them into a clinical mega-structured datasets.
First, the system performs personal data anonymization (1), ensuring compliance with country/region data protection regulations, such as GDPR [18], HIPAA [17], and institutional policies. This process de-identifies sensitive patient information, reducing the risk of unauthorized access or data breaches. In hospital settings, where data can be shared across departments and facilities, anonymization enables secure access to lab results, imaging scans, and treatment histories while maintaining confidentiality. For example, it allows seamless collaboration between radiology, cardiology, and oncology without exposing personal details. This feature is optional and configurable, allowing institutions to tailor privacy measures to their specific regulatory and operational needs.
Then, the anonymized data is stored in a high-performance object storage system (2), which enables scalability, durability, and ability to manage diverse medical data types like imaging files, EHR records, and clinical notes. Object storage supports efficient indexing, metadata tagging, and integration with AI models and interoperability standards (FHIR and HL7). Depending on the institution’s IT and security requirements, it can be deployed on-premises or in the cloud for greater scalability, accessibility, and full control.
Following secure storage, the system intelligently routes the data to specialized processing components based on its format and type (3). These components include natural language processing (NLP) for structured and unstructured text-based data, such as physician notes, discharge summaries, and pathology reports; computer vision for analyzing medical imaging, including X-rays, MRIs, and CT scans; speech processing for transcribing and interpreting audio-based clinical records, such as doctor–patient interactions and dictated reports; and multimodal processing for integrating complex, multi-source medical data streams (4). For example, a radiology report containing both free-text descriptions and associated DICOM images can be processed using a combination of NLP and computer vision to extract clinical insights. Additionally, LizAI XT ensures interoperability by supporting standardized healthcare data formats like HL7, FHIR, and DICOM, enabling seamless integration across hospital information systems (HIS), electronic health records (EHRs), and picture archiving and communication systems (PACS). This structured routing enhances the accuracy of AI-driven analytics, ensuring that each data type is processed optimally for downstream clinical applications.
To enhance the accuracy and contextual relevance of structuring clinical variables, LizAI XT incorporates ontology-based frameworks and knowledge graphs (5), enriching its understanding of medical terminologies, relationships between clinical entities, and disease-specific variations. By leveraging standardized ontologies like LOINC [25], SNOMED-CT [26], RxNorm [27], and ICD-10 [28], the system ensures interoperability across EHRs and clinical databases while improving data consistency. This approach mitigates AI hallucinations by constraining outputs within validated medical knowledge, reducing errors and misinterpretations. For example, in oncology, it differentiates between similar terms like “neoplasm” and “benign lesion” ensuring precise clinical insights. Additionally, LizAI’s knowledge graphs enable inferential reasoning, helping identify related conditions, drug interactions, and disease progression patterns. This enhances clinical accuracy, supports decision-making, and ensures standardized data representation across diverse healthcare environments.
Once the data has undergone its processing pipeline, it is further refined and transformed through an advanced embedding model (6), which converts complex medical information into structured representations optimized for downstream analytics and predictive modeling. This process enhances pattern recognition, enabling more accurate disease classification, risk stratification, and treatment response predictions. The refined data is then systematically stored in a structured format, ensuring efficient retrieval for clinical interpretation, decision support, and integration with AI-driven applications, such as automated diagnostics and personalized treatment recommendations.
In the final stage, LizAI XT automatically extracts disease-relevant clinical variables (7) and structures them into comprehensive, condition-specific datasets (8), ensuring standardized and interpretable data for clinical use. These curated datasets enable precise and efficient assessments by providing a consolidated view of patient health, supporting differential diagnosis, treatment planning, and outcome prediction. Additionally, they facilitate large-scale medical research, improve predictive analytics by enhancing AI model training with high-quality inputs, and integrate seamlessly with decision support systems for real-time clinical guidance. By transforming fragmented data into structured, actionable insights, LizAI XT enhances healthcare intelligence, optimizes operational workflows, and strengthens evidence-based medical decision-making across diverse healthcare settings.

2.3. Accuracy Assessment of Structuring Clinical Variables

The performance of LizAI XT is assessed primarily based on the accuracy of the data structuring process for all patients across diseases. In this study, we use exact match accuracy to evaluate the correctness, which measures the proportion of fully aligned structured data with the ground truth. This metric is crucial for tasks requiring strict precision, such as clinical or regulatory data structuring, in which minor errors can impact outcomes [30,31,32,33,34,35]. The overall accuracy of LizAI XT’s performance across all 16 diseases is accessed as indicated in formula (1):
Overall LizAI XT Accuracy = i = 1 16 Accuracy disease i 16 × 100
whereas i = 1 16 Accuracy disease i is the sum of accuracy values for each individual disease, which is assessed as indicated in formula (2):
Accuracy disease = i = 1 N Accuracy v a r i a b l e i N × 100
whereas N is the total number of clinical variables per disease, and i = 1 N Accuracy variable i is the sum of accuracy values for each clinical variable of that disease. Specifically, for each disease, we calculate the accuracy for each variable using the following formula (3):
Accuracy variable = i = 1 M 1 X i ^ = X i M × 100
whereas X i is the ground truth value for clinical variable i, X i ^ is the extracted value by LizAI XT for clinical variable i, 1 X i ^ = X i equals 1 if correctly extracted and 0 otherwise, and N is the total number of patients.

Standard Error of Accuracy

We assess the reliability of the accuracy measurement with standard error (SE) of accuracy, which represents the possible variation in accuracy across different samples. The standard error is calculated as indicated in formula (4):
S E = p 1 p N
whereas p is the accuracy proportion (e.g., 0.95 for 95% accuracy), N is the total number of samples, SE quantifies the uncertainty in the accuracy estimate. For instance, if LizAI XT achieves 95% accuracy (0.95) across 1000 patients, the standard error is 0.69%. This means the true accuracy is expected to fall within ±0.69% of the reported value, indicating a high level of confidence in the measurement.

3. Results

3.1. Preparation of Clinically Relevant Database for LizAI XT Performance Evaluation

We generated a clinically relevant database, consisting of records of a total 16,000 patients, in 16 diseases. Table 1 lists all studied diseases, each of which has 1000 patients, thousands of medical files, and a plenty of clinical variables. These variables are modeled real-life clinical events as guided by experts’ inputs and real-world statistics from health organizations, such as CDC [23] and NIH [24]. Furthermore, this database follows standardized coding for laboratory results, clinical diagnoses, and medications, such as LOINC [25], SNOMED-CT [26], RxNorm [27], and ICD-10 [28].
The number of files and variables are randomized to ensure a realistic representation of clinical scenarios. Notably, the database of 112,711 medical files includes multiple types and formats, such as FHIR, HL7, .cvs, .pdf, .txt, as well as unstructured formats, like free-text clinical notes, and imaging reports. Also, there are 781 clinical variables whose definitions are given in Supplementary Information—File S1. The variation of data formats renders a comprehensive patient representation, which is crucial for LizAI XT performance assessments, while variables vary across diseases based on complexity, data availability, and diagnostic requirements, enabling condition-specific evaluations.
In this study, we employ this database to evaluate different technical aspects of LizAI XT, including performance assessment in different diseases, overall accuracy, cross-checking various clinical variable groups, and identifying outliers to understand challenges that can help to improve data processing and enhance model robustness. Additionally, we categorized 781 clinical variables into ten sub-groups, as presented in Table 2, to investigate the adaptability of LizAI XT across diseases and variable types.

3.2. Clinical Data Mega-Structure by LizAI XT—A Case of Prostate Cancer

LizAI XT efficiently mega-structures all clinical data per disease into datasets by relevant variables (as reported in Table 1, and Supplementary Information—File S1), from the vastly fragmented database of 16,000 patients, 115,000 medical files, and 800 clinical variables. Figure 4 illustrates the data processing in LizAI XT and data-tables, as outputs, of some representative diseases. Our platform also includes the fully anonymized procedure, which guarantees the data privacy policies. Additionally, it is important to note that the outputs can be designed in any forms, such as graphs, and knowledge relationship.
We additionally present a simplified data-table example of all prostate cancer patients mega-structured by some selected variables in Table 3.

4. LizAI XT Performance Evaluation

4.1. Overall LizAI XT Performance Accuracy

In our latest assessment, LizAI XT achieves an overall accuracy of 95.79% ± 5.69% when structuring a database of 16,000 patients, ∼115,000 medical files, and 781 clinical variables, into datasets per disease by relevant variables (Figure 5). This result demonstrates the platform’s effectiveness across multiple diseases and the fragmented database. Notably, LizAI XT performance reaches the highest accuracy in complicated cases of colorectal cancer (99.12% ± 0.049%), prostate cancer (99.03% ± 0.08%), COPD (98.89% ± 0.076%), and contraceptives (98.28% ± 0.12%), each of which holds complexity of clinical databases and various numbers of variables. Only ear infections and bronchitis show sub-ninety accuracy, at 87.92% ± 0.176% and 87.57% ± 0.431%, respectively, which might be caused by the outliers with accuracy below 85% due to the variability in clinical variables, such as broad symptomatology, and overlapping diagnostic criteria (lists of variables in each disease are provided in Supplementary Information—File S1). Some examples of broad symptomatology include variables in bronchitis symptoms, such as cough and shortness of breath, which may overlap with other respiratory conditions like asthma or pneumonia. While there are some inconsistencies in ear infections records, as they may be documented as otitis media, ear pain, or effusion.

4.2. Analysis of Outliers in Accuracy and Their Impacts on the Overall LizAI XT Performance

We selected accuracy below 85% as outliers based on SE, which impact the performance of LizAI XT in some diseases, such as bronchitis, and ear infections. Notably, all variables in colorectal cancer, prostate cancer, ADD, and COPD population were structured at accuracy higher than 85% and thus not included in the latter analysis. Figure 6A reports the total outliers and their proportion in each disease, with majority below 10% (lists of outliers in each disease are provided in Supplementary Information—File S2). Despite their minor presentation, as the total 45 outliers across 16 diseases only account for 5.76% in nearly 800 variables (Figure 6B), we further investigate the impact of these outliers in order to improve the platform’s performance. We thus deep-dive into these 45 outliers by categories and identify the variables which contribute the most to the outlier group. Interestingly, names (medical), conditions, observations, care plans, and device variables are not among the outliers, indicating highly accurate performance in these categories (Figure 6C). The three largest contributors among outliers are symptoms, medications, and immunizations at 40%, 26.47%, and 24.44%, respectively.
We further examine their average accuracy of outlier variable categories to assess potential impacts and quantify their extent. While these outliers are lower compared to the maximum accuracy of 99%, their performance remains within an acceptable range (∼63% to ∼85%), as shown in Figure 6D, indicating that LizAI XT’s AI-powered model maintains a reasonable level of reliability. Notably, although the accuracy of outliers is statistically significant when compared to the group of variables with accuracy above 85%. However, statistical tests such as the t-test [36], Mann–Whitney U test [37], and bootstrapping [38] indicate a significant difference between outliers (scores below 85%) and the rest of the data, and the practical impact is minimal. Cohen’s d of −0.26 falls within the small effect range ( | d | < 0.3 ), meaning the shift in mean accuracy is minor and does not substantially affect LizAI XT’s overall performance [39], as shown in Figure 7. Figure 6E, in turn, illustrates impact of the outliers on different diseases. Immunization and symptom categories appear as outliers in a larger number of diseases, while the other only affect a few.

4.3. Speed of LizAI XT in Data Mega-Structure

We note that the investigation was performed on a 4x NVIDIA A30 (24GB) GPU setup. In our latest assessment, LizAI XT efficiently achieves sub-second processing speed even with the minimal cluster power. The data retrieve speed is inference time per clinical variable per patient, and we continuously track GPU utilization using NVIDIA-SMI and measure comparison time to evaluate how quickly LizAI XT structures and matches clinical variables to the ground truth. The data clearly demonstrates the platform’s capability for real-time clinical data structure.

5. Discussion and Conclusions

Healthcare data fragmentation at all levels, from functional departments, clinics/hospitals, health organizations like CDC, to government, poses a significant challenge for AI-driven healthcare innovations, as machine learning models require structured and standardized data for meaningful insights [12,13,14,15,16,40,41,42,43]. AI technologies, which enable real-time, accurate, and efficient data integration of all individual’s healthcare data, are urgently needed. In this report, we introduce an innovative platform LizAI XT which mega-structures all fragmented databases from different sources into datasets and unlock clinically relevant information, thereby enabling advanced analytics, clinical decision support, and precision medicine. Furthermore, structured datasets per disease can accelerate scientific and technological innovations, improves diagnostic accuracy, reduces physician burnout, enhances precision treatment, and optimizes management across all levels of healthcare, from clinics and insurance companies to government.
LizAI XT’s clinical data mega-structure was accessed on a fragmented database of 16,000 patients, ∼800 clinical variables, and 115,000 medical files in different types and formats. Overall, LizAI XT is a robust and reliable platform consistently achieving average 99% accuracy, especially in complicated and chronic diseases, including colorectal cancer, prostate cancer, and COPD. Importantly, the database prepared for the LizAI XT performance assessment was completely blinded and has never been exposed to the platform’s AI-powered model. Thus, the accuracy values truly reflect LizAI XT applicability in diseases beyond the studied list without relying on an established memorization, confirming our data reliability for real-world clinical applications without overfitting. Sub-ninety accuracy values are recorded in some diseases, such as ear infections and bronchitis, due to the contribution of outlier variables—below 85%. The performance in these outlier variables is likely impacted by the technical challenges, such as ambiguity, overlap, inconsistent formatting, and broad symptomatology. Medical codes like “Encounter for problem”, for example, are broad and can apply to multiple conditions, making precise classification challenging. Similarly, names such as “Encounter Module Scheduled Wellness” may contain multiple components that vary across documentation systems, leading to inconsistencies. Additionally, some terms have multiple meanings depending on the context, and variations in healthcare documentation further complicate standardization. These factors make it harder for the model to distinguish between similar entities, requiring improved context-aware processing, entity linking, and standardization techniques to enhance accuracy. Improving specificity and standardization in these areas could further enhance LizAI XT’s performance.
LizAI XT is optimized for speed and real-time processing on a minimal server setup, utilizing 4x A30 24GB GPUs. Even in this streamlined environment, our platform can process at sub-second speed per variable per patient. On the more powerful infrastructures, such as dual Intel Xeon Platinum processors, 1TB RAM, and NVMe SSD storage, LizAI XT scales seamlessly for hospital networks, research institutions, and national healthcare databases. Data structure speed can increase by three times with A100 GPUs, while upgrading to H100 GPUs can further enhance the performance by six times, enabling near-instantaneous large-scale medical data processing. Together with the demonstrated high accuracy, LizAI XT claims its capability for launching in various IP infrastructure and nationwide scalability.
The increasing digitalization in healthcare has led to widespread adoption of EHRs/EMRs worldwide, yet a significant portion of clinical data remains fragmented, unstructured, and underutilized. AI-powered data mega-structure by LizAI XT complements these EMRs/EHRs systems by addressing their limitations in data processing, especially the lack of deep AI integration for contextualizing and structuring free-text medical records [44,45,46,47]. Clinical data of cancer patients as an example can be stored in different hospitals’ EHRs and originated from different sources. Important details, such as tumor diagnosis and progression, prior treatments, and doctors’ notes, are trapped in various file types and formats, some of which are healthcare system notes, .pdf, handwritten notes, DICOM, and .xls files. Oncologists need all data insights for a treatment plan but spend a significant amount of time manually searching and reading, possibly in EHRs. This reality negatively impacts timely decision-making, despite having EHRs. Employing LizAI XT enables the structure of a single consolidated data-table, or any another designed formats, which includes all meaningful data of not only one but all patients. In this case, oncologists can efficiently make the best fit treatment decision in a timely manner, instead of going through thousands of medical files with risk of missing important information.
Importantly, LizAI XT enables fully client/government-controlled healthcare data, which can be installed both on-premises and cloud-based servers, and qualifies diverse security and infrastructure needs by adhering to the strict data security, privacy, and regulatory standards, of each region/country. In addition to the clinical specificity, LizAI XT’s features could outperform the conventional AI for healthcare applications and the current approach by big players, such as Amazon, and Google, and our technology can fulfill the needs of AI developments.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/bdcc1010000/s1; File S1: Lists of clinical variables for each disease; File S2: Lists of outliers in each disease.

Author Contributions

T.T.N. and D.R.E. contributed equally to this article and in all the following categories: conceptualization; methodology; software; validation; formal analysis; investigation; resources; data curation; writing—original draft preparation; writing—review and editing; visualization; supervision; project administration; and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All data are included in the manuscript and Supplementary Materials. Additional data and information are available upon request.

Acknowledgments

The authors would like to acknowledge the contributions of all those who provided expertise and guidance in the development of LizAI XT, and thank Salomon M. Stemmer, Institute of Oncology, Rabin Medical Center and Tel Aviv University, Israel, for his valuable insights and support.

Conflicts of Interest

Authors Trung Tin Nguyen and David Raphael Elmaleh were co-founders of LizAI Inc. Authors Trung Tin Nguyen and David Raphael Elmaleh are inventors and have filed the US patent application number 19/087,980, which claims intellectual property for the platform published herein.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
EHRElectronic Health Record
EMRElectronic Medical Record
NLPNatural Language Processing
LLMLarge Language Model
COPDChronic Obstructive Pulmonary Disease
HIPAAHealth Insurance Portability and Accountability Act
GDPRGeneral Data Protection Regulation
FHIRFast Healthcare Interoperability Resources
HL7Health Level Seven
DICOMDigital Imaging and Communications in Medicine
LOINCLogical Observation Identifiers Names and Codes
SNOMED-CTSystematized Nomenclature of Medicine Clinical Terms
ICD-10International Classification of Diseases, 10th Revision
GPUGraphics Processing Unit
AWSAmazon Web Services
CDCCenters for Disease Control and Prevention
NIHNational Institutes of Health

References

  1. OpenAI. ChatGPT. 2025. Available online: https://chatgpt.com/ (accessed on 15 January 2025).
  2. Cloud, G. Vertex AI: Machine Learning on Google Cloud. 2023. Available online: https://cloud.google.com/vertex-ai (accessed on 19 February 2023).
  3. Services, A.W. AWS HealthLake: Transforming Healthcare Data with AI. 2023. Available online: https://aws.amazon.com/healthlake/ (accessed on 19 February 2023).
  4. Microsoft. Copilot. 2025. Available online: https://copilot.microsoft.com/ (accessed on 15 January 2025).
  5. Liu, X.; Faes, L.; Kale, A.U.; Wagner, S.K.; Fu, D.J.; Bruynseels, A.; Mahendiran, T.; Moraes, G.; Shamdas, M.; Kern, C.; et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis. Lancet Digit. Health 2019, 1, e271–e297. [Google Scholar] [CrossRef] [PubMed]
  6. Kalra, S.; Tizhoosh, H.R.; Shah, S.; Choi, C.; Damaskinos, S.; Safarpoor, A.; Shafiei, S.; Babaie, M.; Diamandis, P.; Campbell, C.J.; et al. Pan-cancer diagnostic consensus through searching archival histopathology images using artificial intelligence. npj Digit. Med. 2020, 3, 31. [Google Scholar] [CrossRef] [PubMed]
  7. Rao, A.; Pang, M.; Kim, J.; Kamineni, M.; Li, W.; Prasad, A.K.; Landman, A.; Dreyer, K.J.; Succi, M.D. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J. Med. Internet Res. 2023, 25, e48659. [Google Scholar] [CrossRef] [PubMed]
  8. University of North Carolina School of Medicine; IBM Watson Health. IBM’s Watson AI Recommends Same Treatment as Doctors in 99% of Cancer Cases. Futurism (29 November 2016) News Article. Dom Galeon (Reporter). Available online: https://futurism.com/ibms-watson-ai-recommends-same-treatment-as-doctors-in-99-of-cancer-cases (accessed on 15 January 2025).
  9. Weiskopf, N.G.; Weng, C. Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 2013, 20, 144–151. [Google Scholar] [CrossRef]
  10. The Role of Data in AI. 2025. Available online: https://www.gpai.ai/projects/data-governance/role-of-data-in-ai.pdf (accessed on 15 January 2025).
  11. Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Hu, X. Data-centric ai: Perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM); SIAM: Philadelphia, PA, USA, 2023; pp. 945–948. [Google Scholar]
  12. Pirmani, A.; Oldenhof, M.; Peeters, L.M.; De Brouwer, E.; Moreau, Y. Accessible ecosystem for clinical research (federated learning for everyone): Development and usability study. JMIR Form. Res. 2024, 8, e55496. [Google Scholar] [CrossRef]
  13. Sande, D.v.d.; Genderen, M.E.v.; Huiskens, J.; Veen, R.E.R.; Meijerink, Y.; Gommers, D.; Bommel, J.v. Generating insights in uncharted territories: Real-time learning from data in critically ill patients—An implementer report. BMJ Health Care Inform. 2021, 28, e100447. [Google Scholar] [CrossRef]
  14. Sampson, R.; Shapiro, S.D.; He, W.; Denmark, S.; Kirchoff, K.; Hutson, K.; Paranal, R.; Forney, L.; McGhee, K.; Harvey, J. An integrated approach to improve clinical trial efficiency: Linking a clinical trial management system into the research integrated network of systems. J. Clin. Transl. Sci. 2022, 6, e63. [Google Scholar] [CrossRef]
  15. Kern, L.M.; Ringel, J.B.; Rajan, M.; Casalino, L.P.; Pesko, M.F.; Pinheiro, L.C.; Colantonio, L.D.; Safford, M.M. Ambulatory care fragmentation and total health care costs. Med. Care 2024, 62, 277–284. [Google Scholar] [CrossRef]
  16. Sainsbury, D.; Butterworth, S.; Fell, M.R.; Humphries, K.; Mehendale, F.V.; Richard, B. Towards breaking down cleft data silos to improve clinical research and patient outcomes. BMJ 2022, 378, o1799. [Google Scholar] [CrossRef]
  17. U.S. Department of Health and Human Services. Health Insurance Portability and Accountability Act of 1996 (HIPAA), Public Law 104–191. 1996. Available online: https://aspe.hhs.gov/reports/health-insurance-portability-accountability-act-1996 (accessed on 6 March 2025).
  18. General Data Protection Regulation. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46. Off. J. Eur. Union 2016, L 119, 1–88. [Google Scholar]
  19. Burnum, J.F. The misinformation era: The fall of the medical record. Ann. Intern. Med. 1989, 110, 482–484. [Google Scholar] [CrossRef]
  20. World Economic Forum. 4 Ways Data Is Improving Healthcare. 2025. Available online: https://www.weforum.org/ (accessed on 17 January 2025).
  21. Data Protection Commission. Data Protection Commission Launches Inquiry into Google AI Model. 2025. Available online: https://www.dataprotection.ie/en/news-media/press-releases/data-protection-commission-launches-inquiry-google-ai-model (accessed on 19 February 2025).
  22. Walonoski, J.; Kramer, M.; Nichols, J.; Quina, A.; Moesel, C.; Hall, D.; Duffett, C.; Blais, R.; Swain, A.; Clive, J.; et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 2018, 25, 230–238. [Google Scholar] [CrossRef]
  23. Centers for Disease Control and Prevention. Centers for Disease Control and Prevention Website. 2025. Available online: https://www.cdc.gov/ (accessed on 6 March 2025).
  24. National Institutes of Health. National Institutes of Health Website. 2025. Available online: https://www.nih.gov/ (accessed on 6 March 2025).
  25. Regenstrief Institute. Logical Observation Identifiers Names and Codes (LOINC). 2025. Available online: https://loinc.org/ (accessed on 6 March 2025).
  26. SNOMED International. Systematized Nomenclature of Medicine—Clinical Terms (SNOMED CT). 2025. Available online: https://www.snomed.org/ (accessed on 6 March 2025).
  27. National Library of Medicine. RxNorm: Standardized Nomenclature for Clinical Drugs. 2025. Available online: https://www.nlm.nih.gov/research/umls/rxnorm/ (accessed on 6 March 2025).
  28. World Health Organization. International Classification of Diseases, 10th Revision (ICD-10). 2025. Available online: https://www.who.int/classifications/icd/en/ (accessed on 6 March 2025).
  29. Cloud, A. Qwen2.5: A Large Language Model. 2024. Available online: https://qwenlm.github.io/ (accessed on 7 March 2024).
  30. Tabei, Y.; Kiryu, H.; Kin, T.; Asai, K. A fast structural multiple alignment method for long RNA sequences. BMC Bioinform. 2008, 9, 33. [Google Scholar] [CrossRef] [PubMed]
  31. Kim, D.; Langmead, B.; Salzberg, S.L. HISAT: A fast spliced aligner with low memory requirements. Nat. Chem. Biol. 2015, 12, 357–360. [Google Scholar] [CrossRef] [PubMed]
  32. Shatsky, M.; Nussinov, R.; Wolfson, H.J. Optimization of multiple-sequence alignment based on multiple-structure alignment. Proteins Struct. Funct. Bioinform. 2005, 60, 751–767. [Google Scholar] [CrossRef]
  33. Kehr, B.; Trappe, K.; Holtgrewe, M.; Reinert, K. Genome alignment with graph data structures: A comparison. BMC Bioinform. 2014, 15, 99. [Google Scholar] [CrossRef]
  34. Chen, Y.; Hong, J.; Cui, W.; Zaneveld, J.; Wang, W.; Gibbs, R.; Xiao, Y.; Chen, R. CGAP-Align: A High Performance DNA Short Read Alignment Tool. PLoS ONE 2013, 8, e61033. [Google Scholar] [CrossRef]
  35. Menke, M.; Berger, B.; Cowen, L. Matt: Local Flexibility Aids Protein Multiple Structure Alignment. PLoS Comput. Biol. 2008, 4, e1000100. [Google Scholar] [CrossRef]
  36. Gosset, W.S. The probable error of a mean. Biometrika 1908, 6, 1–25. [Google Scholar] [CrossRef]
  37. Mann, H.B.; Whitney, D.R. On a Test of Whether One of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Stat. 1947, 18, 50–60. [Google Scholar] [CrossRef]
  38. Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
  39. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Routledge: New York, NY, USA, 1988. [Google Scholar]
  40. Reddy, S.; Rogers, W.; Makinen, V.P.; Coiera, E.; Brown, P.; Wenzel, M.; Weicken, E.; Ansari, S.; Mathur, P.; Casey, A.; et al. Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inform. 2021, 28, e100444. [Google Scholar] [CrossRef] [PubMed]
  41. Gilvaz, A.; Abraham, G.M.; Radhakrishnan, S.; Hasmath, Z. From Admission to Discharge, How Artificial Intelligence Could Optimize Patient Care: A Brief Review. Am. J. Hosp. Med. 2019, 3, 2019.016. [Google Scholar] [CrossRef]
  42. Khan, M.; Zhao, X. Drawbacks of Artificial Intelligence and Their Potential Solutions in the Healthcare Sector. Biomed. Mater. Devices 2023, 1, 731–738. [Google Scholar] [CrossRef]
  43. Ogaga, M.; Zhao, X. The Rise of Artificial Intelligence and Machine Learning in HealthCare Industry. Int. J. Res. Innov. Appl. Sci. 2023, VIII, 250–253. [Google Scholar] [CrossRef]
  44. Ramakrishnaiah, Y.; Macesic, N.; Webb, G.; Peleg, A.; Tyagi, S. EHR-ML: A generalisable pipeline for reproducible clinical outcomes using electronic health records. medRxiv 2024. [Google Scholar] [CrossRef]
  45. Alanazi, A. Clinicians’ views on using artificial intelligence in healthcare: Opportunities, challenges, and beyond. Cureus 2023, 15, e45255. [Google Scholar] [CrossRef]
  46. Deliberato, R.O.; Celi, L.A.; Stone, D.J. Clinical note creation, binning, and artificial intelligence. JMIR Med. Inform. 2017, 5, e7627. [Google Scholar] [CrossRef]
  47. Ronquillo, C.E.; Mitchell, J.; Alhuwail, D.; Peltonen, L.M.; Topaz, M.; Block, L.J. The Untapped Potential of Nursing and Allied Health Data for Improved Representation of Social Determinants of Health and Intersectionality in Artificial Intelligence Applications: A Rapid Review. Yearb. Med. Inform. 2022, 31, 094–099. [Google Scholar] [CrossRef]
Figure 1. Overview of the AI-powered LizAI XT platform for client-controlled healthcare data management, illustrating the complete workflow from data collection to insight generation. The data collection layer integrates heterogeneous healthcare information systems, such as HIS, EMR/EHR, DUR, PACS/RIS, LIS/Labs, and CRM aggregating both structured and unstructured data types, including FHIR, HL7, PDFs, free-text notes, and DICOM images. In the data management layer, fragmented medical data are anonymized, extracted, and transformed into unified clinical megastructures organized by diseases or variables, ensuring compliance with HIPAA and GDPR standards. Advanced AI components, including natural language processing, computer vision, and semantic indexing, harmonize terminology, eliminate redundancy, and establish interoperability across multiple healthcare systems. The platform supports deployment on both on-premises and cloud-based infrastructures, providing scalability and data sovereignty for institutions and governments. In the data insights layer, users can perform semantic search, analytics, and predictive modeling to derive diagnostic, therapeutic, and cost-effectiveness insights from the consolidated datasets. Together, these integrated processes demonstrate LizAI XT’s ability to convert fragmented healthcare data into secure, interpretable, and actionable intelligence for precision medicine and clinical decision support.
Figure 1. Overview of the AI-powered LizAI XT platform for client-controlled healthcare data management, illustrating the complete workflow from data collection to insight generation. The data collection layer integrates heterogeneous healthcare information systems, such as HIS, EMR/EHR, DUR, PACS/RIS, LIS/Labs, and CRM aggregating both structured and unstructured data types, including FHIR, HL7, PDFs, free-text notes, and DICOM images. In the data management layer, fragmented medical data are anonymized, extracted, and transformed into unified clinical megastructures organized by diseases or variables, ensuring compliance with HIPAA and GDPR standards. Advanced AI components, including natural language processing, computer vision, and semantic indexing, harmonize terminology, eliminate redundancy, and establish interoperability across multiple healthcare systems. The platform supports deployment on both on-premises and cloud-based infrastructures, providing scalability and data sovereignty for institutions and governments. In the data insights layer, users can perform semantic search, analytics, and predictive modeling to derive diagnostic, therapeutic, and cost-effectiveness insights from the consolidated datasets. Together, these integrated processes demonstrate LizAI XT’s ability to convert fragmented healthcare data into secure, interpretable, and actionable intelligence for precision medicine and clinical decision support.
Bdcc 09 00275 g001
Figure 2. Some examples of our generated medical files for a patient: (A) a spreadsheet containing key clinical notes from doctors, (B) a scanned PDF of a checklist, (C) a printed PDF report summarizing the patient’s complete medical history, and (D) a PDF report of laboratory test results, covering macroscopic, microscopic, and chemical examinations.
Figure 2. Some examples of our generated medical files for a patient: (A) a spreadsheet containing key clinical notes from doctors, (B) a scanned PDF of a checklist, (C) a printed PDF report summarizing the patient’s complete medical history, and (D) a PDF report of laboratory test results, covering macroscopic, microscopic, and chemical examinations.
Bdcc 09 00275 g002
Figure 3. System diagram of the clinical data mega-structure by LizAI XT, showing the automated workflow for data collection, anonymization, processing, and structuring across multimodal healthcare sources. The process begins with personal data anonymization (Step 1) to remove identifiable information and ensure compliance with HIPAA, GDPR, and institutional privacy standards. The high-performance object storage system (Step 2) securely houses anonymized data and enables scalable access to structured and unstructured formats, including clinical notes, EHRs, imaging files, and audio records. Depending on data type, the system then routes information through specialized processing modules (Steps 3 and 4): • Text data (e.g., clinical notes and lab reports) are processed using NLP for entity extraction and normalization. • Medical imaging data, including DICOMformatted PET-CT, MRI, and CT scans, are analyzed using computer vision models that extract metadata and link imaging findings to textual reports. • Speech data, such as physician dictations or consultations, are transcribed via speech-to-text algorithms and analyzed through NLP pipelines for concept identification. Ontology and knowledge graph modules (Step 5) enrich contextual understanding by mapping data to medical standards such as LOINC, SNOMED-CT, RxNorm, and ICD-10, enhancing consistency and semantic accuracy. The embedding and refinement stage (Step 6) converts processed data into structured representations optimized for analytics and predictive modeling. Finally, Steps 7 and 8 extract disease-relevant clinical variables and compile them into comprehensive, condition-specific datasets that support research, diagnostics, and real-time decision support across diverse healthcare systems.
Figure 3. System diagram of the clinical data mega-structure by LizAI XT, showing the automated workflow for data collection, anonymization, processing, and structuring across multimodal healthcare sources. The process begins with personal data anonymization (Step 1) to remove identifiable information and ensure compliance with HIPAA, GDPR, and institutional privacy standards. The high-performance object storage system (Step 2) securely houses anonymized data and enables scalable access to structured and unstructured formats, including clinical notes, EHRs, imaging files, and audio records. Depending on data type, the system then routes information through specialized processing modules (Steps 3 and 4): • Text data (e.g., clinical notes and lab reports) are processed using NLP for entity extraction and normalization. • Medical imaging data, including DICOMformatted PET-CT, MRI, and CT scans, are analyzed using computer vision models that extract metadata and link imaging findings to textual reports. • Speech data, such as physician dictations or consultations, are transcribed via speech-to-text algorithms and analyzed through NLP pipelines for concept identification. Ontology and knowledge graph modules (Step 5) enrich contextual understanding by mapping data to medical standards such as LOINC, SNOMED-CT, RxNorm, and ICD-10, enhancing consistency and semantic accuracy. The embedding and refinement stage (Step 6) converts processed data into structured representations optimized for analytics and predictive modeling. Finally, Steps 7 and 8 extract disease-relevant clinical variables and compile them into comprehensive, condition-specific datasets that support research, diagnostics, and real-time decision support across diverse healthcare systems.
Bdcc 09 00275 g003
Figure 4. Illustration of data mega-structure by LizAI XT. Thousands of files and information of 16,000 patients in this study are fragmented in different types and formats, which can efficiently be structured into one data-table per disease by relevant variables.
Figure 4. Illustration of data mega-structure by LizAI XT. Thousands of files and information of 16,000 patients in this study are fragmented in different types and formats, which can efficiently be structured into one data-table per disease by relevant variables.
Bdcc 09 00275 g004
Figure 5. Assessment of LizAI XT’s performance based on accuracy of data mega-structure. The accuracy is calculated for each disease, and the overall performance is calculated for structuring the entire database.
Figure 5. Assessment of LizAI XT’s performance based on accuracy of data mega-structure. The accuracy is calculated for each disease, and the overall performance is calculated for structuring the entire database.
Bdcc 09 00275 g005
Figure 6. Assessment of 45 outliers (accuracy below 85%) in LizAI XT’s performance across 16 diseases. (A) Numbers and portions (in percentage) of outlier variables in each disease. Among 45 outlier variables, we calculate (B) the presentation of 45 outliers variables versus a total 781 clinical variables in all diseases; (C) the contribution of each variable categories (in percentage); and (D) the average accuracy score of each variable categories. (E) Impact of each outlier category on group of 16 diseases.
Figure 6. Assessment of 45 outliers (accuracy below 85%) in LizAI XT’s performance across 16 diseases. (A) Numbers and portions (in percentage) of outlier variables in each disease. Among 45 outlier variables, we calculate (B) the presentation of 45 outliers variables versus a total 781 clinical variables in all diseases; (C) the contribution of each variable categories (in percentage); and (D) the average accuracy score of each variable categories. (E) Impact of each outlier category on group of 16 diseases.
Bdcc 09 00275 g006
Figure 7. Visualizing the distributions of mean scores before and after removing outliers shows overlapping curves, reinforcing that their exclusion does not meaningfully alter overall accuracy. A truly significant impact would require Cohen’s d > 0.5 (medium) or 0.8 (large), which is not observed here.
Figure 7. Visualizing the distributions of mean scores before and after removing outliers shows overlapping curves, reinforcing that their exclusion does not meaningfully alter overall accuracy. A truly significant impact would require Cohen’s d > 0.5 (medium) or 0.8 (large), which is not observed here.
Bdcc 09 00275 g007
Table 1. Summaries of diseases, as well as number of patients, number of medical files, and number of clinical variables for each disease in our database. This clinically relevant database is a random mix of various clinical data types and formats, such as FHIR, HL7, .cvs, .pdf, .txt, free-text clinical notes, and imaging reports, which is generated as guided by experts inputs and real-world statistics from health organizations, such as CDC and NIH. The database is used for LizAI XT performance evaluation in data mega-structure in this report. The lists of clinical variables for each disease are given in Supplementary Information—File S1.
Table 1. Summaries of diseases, as well as number of patients, number of medical files, and number of clinical variables for each disease in our database. This clinically relevant database is a random mix of various clinical data types and formats, such as FHIR, HL7, .cvs, .pdf, .txt, free-text clinical notes, and imaging reports, which is generated as guided by experts inputs and real-world statistics from health organizations, such as CDC and NIH. The database is used for LizAI XT performance evaluation in data mega-structure in this report. The lists of clinical variables for each disease are given in Supplementary Information—File S1.
DiseaseShort DescriptionNumber
of
Patients
Number of
Medical
Files
Number of
Clinical
Variables
Colorectal CancerCancer affecting the colon or rectum.10005317105
Prostate CancerA common male cancer in the prostate gland.100022,78150
ContraceptivesMedications or devices used for birth control.1000571867
Female ReproductionConditions related to women’s reproductive health.1000510225
GoutArthritic condition caused by uric acid crystal buildup in joints.1000149241
Attention Deficit Disorder (ADD)Neurodevelopmental disorder affecting focus and impulse control.1000554941
EpilepsyNeurological disorder causing recurrent seizures.1000627936
COPDProgressive lung disease causing breathing difficulties.1000532775
AsthmaChronic condition causing airway inflammation and difficulty breathing.1000836067
Allergic RhinitisInflammation of nasal passages due to allergens.1000539742
BronchitisInflammation of bronchial tubes, leading to coughing and mucus production.100011,99151
DermatitisInflammation of the skin causing redness and itching.1000522942
AtopyGenetic tendency to develop allergic conditions.1000499625
Food AllergiesImmune response triggered by certain foods.1000548035
AppendicitisInflammation of the appendix, often requiring surgery.1000532243
Ear InfectionsInfections of the middle ear, causing pain and fluid buildup.1000837136
Total 16,000112,711781
Table 2. The total ∼800 clinical variables in all 16 diseases are categorized in ten groups, which supports the adaptability assessment of LizAI XT across diseases and variable types.
Table 2. The total ∼800 clinical variables in all 16 diseases are categorized in ten groups, which supports the adaptability assessment of LizAI XT across diseases and variable types.
Variables CategoriesDescriptionsExamples
ImmunizationsAdministered vaccines of any kinds.DTaP, Influenza Vaccine
Codes (medical)Medical encounter and/or procedure identifiers.Encounter for Check-Up, Death Certification
Names (medical)Titles of medical encounters.Chemotherapy Encounter, Routine Colonoscopy Encounter
Medications (treatments)Prescribed drugs or treatments.Oxaliplatin Injection, Leucovorin Injection
SymptomsReported health complaints.Abdominal Pain, Fatigue
ConditionsDiagnosed diseases and/or disorders.Anemia (Disorder), Malignant Tumor of Colon
ObservationsRecorded health measurements.Hemoglobin Level, Pain Severity Score
Care plansStructured treatment or health plans.Cancer Care Plan, Healthy Diet
ProceduresMedical interventions or diagnostics.Colonoscopy, Biopsy of Colon
Devices (methods)Medical equipment for patient use.Oxygen Concentrator (Physical Object), Wheelchair Accessory (Physical Object)
Table 3. The table presents a sample mega-structured dataset for prostate cancer, showing how multiple variables (e.g., PSA level, biopsy date, imaging modality, and treatment details) are merged into one anonymized structure per patient. We present a simplified mega-structured data-table of all prostate cancer patients by some selected variables. This mega-structured data-table allows semantic search and insights in all clinical data from a vastly fragmented sources. This outcome can be designed in any forms, such as graphs, knowledge relationship, and more. All data can be linked to the original sources for validation, and all clinical images can be shown in one page. Abbreviations: iPSA—initial prostate-specific antigen level; ISUP— International Society of Urological Pathology grade, ADT—Androgen Deprivation Therapy.
Table 3. The table presents a sample mega-structured dataset for prostate cancer, showing how multiple variables (e.g., PSA level, biopsy date, imaging modality, and treatment details) are merged into one anonymized structure per patient. We present a simplified mega-structured data-table of all prostate cancer patients by some selected variables. This mega-structured data-table allows semantic search and insights in all clinical data from a vastly fragmented sources. This outcome can be designed in any forms, such as graphs, knowledge relationship, and more. All data can be linked to the original sources for validation, and all clinical images can be shown in one page. Abbreviations: iPSA—initial prostate-specific antigen level; ISUP— International Society of Urological Pathology grade, ADT—Androgen Deprivation Therapy.
Anonymized IDiPSAISUP Score in Biopsy SpecimenDate of BiopsyImaging for Primary StagingADT DurationOther Systemic Therapy Primary TreatmentRadiation ProstateNumber of Pelvic Lymph Nodes in ImagingType of Local Salvage Treatment
P_84444 ng/mL8 (3 + 5)2017-07-31PSMA-PET/CT or PET/MRNoneEnzalutamideNone5Radiotherapy of the thoracic segment of the spinal column
P_3330 ng/mL10 (5 + 5)2015-12-11MRI of the pelvis–prostateNoneNoneYesNoneHDR
P_2728.8 ng/mL42020-05-16MRI of the pelvis–prostate9 monthsEnzalutamideYesNoneNone
P_22942 ng/mL5unknownPET/CT scan5 monthsEnzalutamideYesNoneConventional fractionation IMRT combined with HDR
P_47847 ng/mL7 (3 + 4)unknownPET/CT imagingNoneNoneYes2SBRT plus HDR
P_3218 ng/mL32019-09-27PET/CT9 monthsEnzalutamideYesNoneNone
P_44138 ng/mL3unknownMRI of the pelvis–prostateNoneNoneYesNoneSBRT plus HDR for 2 months
P_2218.8 ng/mLNone02.11.2022MRI of the pelvis–prostate4 yearsNoneYesNoneBrachytherapy
P_84444 ng/mL8 (3 + 5)2017-07-31PSMA-PET/CT or PET/MR imagingNoneEnzalutamideNone5Radiotherapy of the thoracic segment of the spinal column
P_3330 ng/mL10 (5 + 5)2015-12-11MRI of the pelvis–prostateNoneNoneYesNoneHDR
P_22728 ng/mLNone2017-02-15PET/CT scanNoneNoneYes2IMRT (Intensity-Modulated Radiation Therapy)
P_42441 ng/mLNone2002-06-22MRI of the pelvis–prostateNoneEnzalutamideNoneNoneSTRING: Brachytherapy
P_6738.8 ng/mLNone2003-09-09PSMA-PET/CT or PET/MRNoneNoneYesNoneBrachytherapy monotherapy
P_2728.8 ng/mL42020-05-16MRI of the pelvis–prostate9 monthsEnzalutamideYesNoneNone
P_22942 ng/mL5unknownPET/CT scan5 monthsEnzalutamideYesNoneConventional fractionation IMRT combined with HDR
P_47847 ng/mL7 (3 + 4)unknownPET/CT imagingNoneNoneYes2SBRT plus HDR
P_3218 ng/mL32019-09-27PET/CT9 monthsEnzalutamideYesNoneNone
P_44138 ng/mL3unknownMRI of the pelvis–prostateNoneNoneYesNoneSBRT plus HDR for 2 months
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nguyen, T.T.; Elmaleh, D.R. LizAI XT—AI-Accelerated Management Platform for Complex Healthcare Data at Scale, Beyond EMR/EHR and Dashboards. Big Data Cogn. Comput. 2025, 9, 275. https://doi.org/10.3390/bdcc9110275

AMA Style

Nguyen TT, Elmaleh DR. LizAI XT—AI-Accelerated Management Platform for Complex Healthcare Data at Scale, Beyond EMR/EHR and Dashboards. Big Data and Cognitive Computing. 2025; 9(11):275. https://doi.org/10.3390/bdcc9110275

Chicago/Turabian Style

Nguyen, Trung Tin, and David Raphael Elmaleh. 2025. "LizAI XT—AI-Accelerated Management Platform for Complex Healthcare Data at Scale, Beyond EMR/EHR and Dashboards" Big Data and Cognitive Computing 9, no. 11: 275. https://doi.org/10.3390/bdcc9110275

APA Style

Nguyen, T. T., & Elmaleh, D. R. (2025). LizAI XT—AI-Accelerated Management Platform for Complex Healthcare Data at Scale, Beyond EMR/EHR and Dashboards. Big Data and Cognitive Computing, 9(11), 275. https://doi.org/10.3390/bdcc9110275

Article Metrics

Back to TopTop