Next Article in Journal
OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults
Previous Article in Journal
Data-Driven Modeling and Simulation in Forestry and Agricultural Product Transportation Management by Small Businesses: A Case Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort

by
Aldo Marzullo
1,*,†,
Victor Savevski
1,†,
Maddalena Menini
1,2,†,
Alessandro Schilirò
1,
Gianluca Franchellucci
1,2,
Arianna Dal Buono
1,2,
Cristina Bezzio
1,2,
Roberto Gabbiadini
1,
Cesare Hassan
1,2,
Alessandro Repici
1,2 and
Alessandro Armuzzi
1,2,*
1
IRCCS Humanitas Research Hospital-Via Manzoni 56, Rozzano, 20089 Milan, Italy
2
Department of Biomedical Sciences, Humanitas University, Via Rita Levi Montalcini 4, Pieve Emanuele, 20072 Milan, Italy
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Data 2025, 10(7), 100; https://doi.org/10.3390/data10070100
Submission received: 17 February 2025 / Revised: 18 June 2025 / Accepted: 19 June 2025 / Published: 24 June 2025

Abstract

Research of Inflammatory Bowel Disease (IBD) involves integrating diverse and heterogeneous data sources, from clinical records to imaging and laboratory results, which presents significant challenges in data harmonization and exploration. These challenges are also reflected in the development of machine-learning applications, where inconsistencies in data quality, missing information, and variability in data formats can adversely affect the performance and generalizability of models. In this study, we describe the collection and curation of a comprehensive dataset focused on IBD. In addition, we present a dedicated research platform. We focus on ethical standards, data protection, and seamless integration of different data types. We also discuss the challenges encountered, as well as the insights gained during its implementation.

1. Introduction

Inflammatory Bowel Disease (IBD) research spans a wide range of disciplines, encompassing epidemiology, basic research, molecular bioscience, translational research, and the analysis of patient data to evaluate and compare therapeutic interventions. IBD care similarly integrates multiple fields, including ’omics’ sciences like genomics, to support comprehensive diagnostic and therapeutic approaches [1,2]. Within this realm, the structured collection and management of IBD data are crucial for advancing both research efforts and the development of machine-learning applications [2,3,4,5]. A well-organized repository of IBD data allows for training machine-learning algorithms, enabling the creation of sophisticated diagnostic tools and predictive models [6]. Moreover, curated datasets form the foundation for rigorous research studies, providing insights into disease mechanisms, treatment efficacy, and patient outcomes [2]. However, in a fully operational hospital setting, IBD data originate from diverse and often heterogeneous sources, creating challenges in harmonizing and standardizing this extensive information for research purposes. IBD patient records include a variety of data types—clinical records, imaging, pathology reports, and laboratory results—each with unique formats and coding standards. Integrating these disparate sources into a unified repository requires advanced data management strategies and interoperability solutions [3]. Additionally, IBD data frequently reside in silos across various departments and hospital systems, further complicating data organization efforts. Researchers often face fragmented data landscapes, making it challenging to access relevant information and slowing the pace of research. Establishing strong data governance frameworks and interoperability standards is essential to streamline data integration and promote collaborative research [7]. Despite these complexities, the benefits of a centralized IBD data repository are substantial. A standardized dataset improves data accessibility, promotes collaboration, and accelerates scientific discovery and clinical innovation. Organized IBD data enable researchers to uncover new insights into disease pathology, assess treatment outcomes, and understand patient responses, ultimately enhancing the quality of care of individuals affected by IBD.
This paper introduces the Humanendo-IBD dataset, a structured collection of heterogeneous data sources focused on IBD. The data set includes diverse types of information, from clinical records, imaging, and pathology reports to laboratory results, organized to ensure consistency, precision, and reliability. The dataset is designed to support both clinical and research applications, promote interoperability, and enable cross-institutional studies in IBD research. It can support several research questions, including: (i) which biomarkers predict response to biological therapy over time? (ii) Can baseline clinical or histological features identify patient subgroups with differing prognoses? (iii) How do clinical and laboratory variables evolve at follow-up in responders versus non-responders? (iv) Can histopathology images be used to discover features associated with inflammation severity or therapeutic resistance? In addition, we discuss the challenges encountered and the lessons learned throughout the data collection and curation process, addressing practical considerations for building a robust and privacy-compliant IBD dataset. While the Humanendo-IBD dataset is not currently publicly available due to patient privacy concerns and institutional restrictions, access can be granted for collaborative research upon reasonable request and following ethical approval. We acknowledge that open access would enhance reproducibility and are exploring GDPR-compliant solutions such as data enclaves or federated analysis platforms.
The paper is organized as follows: in Section 2, we describe the types of data collected and the technical details of the platform implementation with a focus on the user experience and privacy and security issues. In Section 3, we describe our dataset using basic statistics. Finally, in Section 4, we discuss our results, and we draw our conclusions in Section 5.

2. Materials and Methods

IBD activities generate a large set of heterogeneous data from different sources. Organizing and linking such data efficiently is critical for enabling research activities. Our dataset currently aggregates the following data types:
  • Clinical and demographic data sourced from clinical care systems, including endoscopic and treatment information
  • Digital pathology data, including histopathology images acquired at baseline
  • It is worth noticing that data are collected over time to allow the track of the evolution of the clinical situation. Specifically, the dataset comprises clinical and biological information from patients diagnosed with IBD, specifically Crohn’s disease (CD) and ulcerative colitis (UC). Data are organized into two main components: baseline demographic and clinical data, as well as treatment information, with a follow-up recorded one year after baseline (±3 months). Table 1 provides a summary of the data structure. Adult patients (>18 years) with active luminal UC or CD were eligible. Inclusion required treatment with anti-TNF- α agents (biosimilars included) or Vedolizumab as first-line therapy, or Ustekinumab/Tofacitinib (UC) or Ustekinumab/Vedolizumab (CD) as second-line therapy. Patients must have undergone baseline endoscopy/biopsies under biologics, regardless of treatment start date. Exclusion criteria included lack of follow-up and prior colectomy for UC.
The baseline data include demographic variables such as gender, date of birth, and smoking status, alongside the initial diagnosis and diagnosis date. Family history of autoimmune conditions and specific clinical metrics are recorded, including weight, height, and disease-specific scores like the Mayo score for UC and the Harvey–Bradshaw Index for CD. Symptom data are also available, detailing indicators such as abdominal pain, rectal bleeding, and extra-intestinal manifestations (e.g., ocular involvement and skin conditions). The dataset further incorporates detailed laboratory and endoscopic data at baseline, including C-reactive protein (CRP) levels, fecal calprotectin, hemoglobin, serum creatinine, and liver enzymes. Endoscopic findings specify lesion characteristics such as the presence of ulcers and stenosis level. Additionally, histopathological data includes digitized hematoxylin and eosin (H&E) stained images from biopsy samples collected at baseline, supporting the examination of tissue morphology and potential biomarker studies. The treatment-data component provides comprehensive information on the administration of medications commonly prescribed for IBD management. This includes records on the use of aminosalicylates, antibiotics, corticosteroids, immunomodulators, and biologics (e.g., anti-TNF, anti-IL, and JAK inhibitors). For each therapeutic category, details such as the specific drug type, start and end dates, dosages, and any treatment interruptions or secondary loss of response are documented. These entries facilitate the longitudinal assessment of therapeutic strategies, patient adherence, and medication efficacy. At the follow-up, the dataset tracks clinical outcomes and updates on patient health status, including changes in clinical scores, symptom progression, and updated laboratory values. Each patient had one piece of complete follow-up data at approximately one year (±3 months). By capturing data at both baseline and follow-up, the dataset enables a robust evaluation of disease progression, therapeutic impact, and the identification of predictive biomarkers.

2.1. Data Source Harmonization

In pursuing data harmonization, the initial phase involves a systematic examination and delineation of existing data reservoirs. Each group undertakes a methodical exploration of the retrospective datasets, accurately documenting their respective attributes. This documentation includes categorical data composition, variable parameters, patient cohort size, and temporal extent. Next, a comprehensive data dictionary is compiled, with the goal of delineating a core of shared information across units. Another important step is to ensure the congruence of data sources, which demands a substantial technical investment, as it requires an evaluation of the various clinical data collection tools and institutional databases. Critical to this effort is the creation of a universal patient identifier, which serves as a pivot for integration efforts. By leveraging this identifier, diverse data streams are seamlessly unified. Missing data were not explicitly handled as we retained all entries and tracked missingness patterns across variables. It is worth noticing that such patterns are mainly related to systematic gaps due to free-text entries and non-standard documentation of clinical history, particularly for smoking status, symptom reports and some laboratory values not relevant for the diagnosis. The visual summary of the steps of the pipeline is illustrated in Figure 1.

2.2. Software Components and Technical Details

The platform uses Microsoft Azure as the core infrastructure, leveraging its Data Lake capabilities as the DBMS for scalable and secure data storage. Azure services enable efficient handling of large, heterogeneous datasets. Python (https://python.org) was chosen for its rich ecosystem and strong compatibility with Azure tools for data processing and automation. These technologies provided an optimal balance between interoperability, scalability, and ease of integration with AI workflows. Extraction/Transform/Loading (ETL) pipeline is set up for the different data sources. Oracle SQL serves as the primary querying language for accessing and retrieving data from the hospital database, ensuring efficient and reliable data retrieval. Concerning digital pathology, the Philips Digital Pathology suite (https://www.philips.it/healthcare/solutions/pathology, accessed on 18 June 2025) is used for the management and analysis of digital pathology data, leveraging its advanced features for handling high-resolution images and extracting valuable insights from pathological specimens.
The Oracle SQL database was used as it is the standard DBMS integrated with our hospital’s electronic systems, ensuring secure, stable, and performant querying for large-scale clinical data.

2.3. Technical Details and Privacy

The platform runs on a set of secure Linux-based cloud virtual machines: one for the database and one for the platform back-end. The machines are equipped with 8 cores and 16 GB RAM. Data retrieval and harmonization procedures are implemented on premise. Pseudo-anonymization is performed on the harmonized data before being uploaded on the platform database according to the European Data Protection Board guidelines [8]: for each patient, all the directly identifying data are removed. Only month and year are left in for dates, shifted by a random number of days between −365 and 365. De-identification information is never shared outside the hospital’s servers.

2.4. Ethics, Legal, and Data-Property Issues

The study was conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the ethics committee at IRCCS Humanitas Research Hospital. Patients participating in the proof of concept study were provided with informed consent and were informed of their right to withdraw at any point. Platform users, including clinicians and researchers, are required to accept a user agreement outlining basic ethical principles. It is important to note that data access is restricted to users who have been authorized by the hospital.

3. Results

In this section, we present a comprehensive summary of the dataset, showcasing its basic statistics. Table 2 and Table 3 report demographic distributions along with Mayo score (for UC) and Harvey–Bradshaw Index (for CD) at baseline and follow-up. Table 4 and Table 5 capture the overall patient status, including smoking status, stool frequency, family history, autoimmune associations, as well as the overall presence of IBD sympthoms at baseline and follow-up. Finally, Table 6 and Table 7 provide insights into biological and treatment characteristics of the UC and CD cohort at baseline and follow-up.
To assess potential redundancy among the clinical and laboratory variables, we performed a correlation analysis across key numeric variables at baseline and follow-up. Pearson correlation was used for continuous variables and Cramer’s V for categorical variables. Preliminary results show that most biomarkers are not strongly collinear, suggesting that the dataset contains a sufficient range of statistically independent variables. Full correlation matrices are provided in Supplementary Materials.

4. Discussion

The collection of comprehensive clinical data for both ulcerative colitis and Crohn’s disease, including histopathology images, may provide valuable insights into disease progression and treatment outcomes. However, the process of collecting such data poses several technological and infrastructural challenges. This study highlights key obstacles faced during data collection and offers practical insights that could guide future research endeavors and help healthcare institutions optimize their data collection strategies.

4.1. Contribution

IBD research has significantly benefited from datasets focusing on various modalities such as clinical records, genomics, and imaging. For instance, the TriNetX IBD dataset [9] provides longitudinal EHR data filtered for IBD-relevant covariates, while the Gut Reaction project [10] integrates large-scale genomic and demographic data. Image-centric datasets like KVASIR [6] focus on gastrointestinal endoscopy, and the CHOC dataset offers histopathology slides for pediatric IBD [11]. However, these datasets are modality-specific, population-limited, or lack longitudinal biomarker tracking, restricting their potential for comprehensive disease modeling. This paper introduces a novel dataset that combines multimodal data, including clinical variables, longitudinal biomarker measurements, and histopathology images from IBD patients. Unlike existing resources, this dataset bridges the gap between imaging and clinical domains, enabling a holistic approach to understanding disease progression and treatment response. Compared to TriNetX, which focuses on EHR data, and Gut Reaction, which emphasizes genomic and demographic data, the Humanendo-IBD dataset integrates clinical, imaging, laboratory, and pathology information with temporal follow-up. Unlike KVASIR, which is limited to endoscopic image data, our dataset offers a multimodal perspective that supports comprehensive modeling of disease trajectories and treatment outcomes. It is worth noticing that the limited sample size represents an important limitation of this study. This constraint arises from the fact that the cohort is highly uniform, includes a follow-up period, and specifically consists of individuals undergoing a biological treatment, which significantly reduced the initial IBD population available in our center. Furthermore, the dataset is derived from a single-center Italian cohort and may not capture the full demographic or geographic variability of the broader IBD population in Italy. Therefore, we acknowledge that some categories of patients may be underrepresented. This limits the external generalizability of findings derived from the dataset. However, the platform we have developed may help overcome this limitation in the long term by enabling continuous updates and gradual enlargement of the sample size, thereby allowing for more comprehensive and generalizable insights in the future. Indeed, the modular architecture of our platform supports the addition of new patients and data types, such as multi-omics data or new imaging modalities. Using our ETL pipelines and standardized data models, the platform is designed to scale across institutions, paving the way for multicenter data federation. Finally, although the dataset itself is not openly released, the innovation of this work lies in demonstrating the feasibility of harmonizing multimodal, longitudinal data from a real-world hospital setting. The architecture, data schema, and integration strategies shared in this manuscript serve as a template for other institutions seeking to build similar datasets, offering direct utility to the research community even in the absence of full data release. Nevertheless, data are available upon specific request.

4.2. Technological and Infrastructural Challenges

The process of our data collection presents several technological and infrastructural challenges. One of the first challenges encountered was the technological infrastructure required to manage and store vast amounts of clinical, laboratory, and imaging data. It is important to note that the final population was the result of an extensive screening process that demanded considerable efforts in resource organization and management. Hospitals and research institutions often face limitations with existing data management systems [12], which may not be equipped to handle the diverse data types involved in IBD research. In this study, data from multiple sources—including clinical records, laboratory test results, treatment regimens, and histopathological images—needed to be integrated into a single research database. The lack of seamless interoperability between different healthcare systems was a major barrier. Electronic Health Record (EHR) systems often lack the flexibility required for easy data extraction and integration with other platforms, necessitating significant manual work in harmonizing the data. Moving forward, hospitals should adopt integrated systems that support cross-platform data exchange, which would reduce the time spent on data preparation and facilitate a smoother research process [13]. To this aim, collaborative data collection tools, such as cloud-based platforms, are suggested for facilitating coordination across different institutions and research teams. These platforms offer substantial advantages in terms of data accessibility and sharing. However, they also introduce challenges related to patient privacy and security [14,15,16]. Dedicated agreements with cloud providers are required to ensure secure communication and to allow the safe use of AI-based tools integrated within these platforms. Furthermore, the effective use of these tools presents its own set of challenges. Research teams—including clinical staff, pathologists, and data scientists—must be adequately trained in how to use these platforms, which can be resource-intensive. Additionally, a standardized approach to data entry is essential for maintaining consistency across all contributors. During data collection, inconsistencies in how data were recorded, particularly with free-text fields, required multiple revision steps to standardize the dataset. For example, smoking status and patient histories were inconsistently documented across different operators, leading to missing or incomplete data. To mitigate such issues, future research should implement standardized data entry templates and ensure comprehensive training for all research staff [17,18]. Large language models could serve this purpose. Automated data validation checks should also be integrated to flag inconsistencies or errors during the data entry process, thus reducing the need for manual revisions. In retrospect, employing digital pathology tools with standardized annotation features and AI-assisted data harmonization (such as locally deployed large language models [19]) could have streamlined this process, improving data quality and reducing human error. Finally, managing high-resolution whole-slide images presented storage and processing challenges. We addressed this by keeping pathology images on our internal storage systems, linking them with their corresponding patient using structured metadata. We utilized efficient image formats and pyramidal tiling to reduce memory usage and allow for multi-resolution access. However, more advanced technical solutions could be explored, such as the integration of image compression algorithms based on deep learning to optimize storage without compromising diagnostic fidelity.

4.3. Data Governance and Quality Control

Beyond technological improvements, several insights can guide hospitals and research teams in collecting similar data in the future. Establishing a comprehensive data governance framework is essential to ensure data quality and consistency [12,20]. This framework should define protocols for data collection, standardization, and validation while assigning clear roles and responsibilities for data management. It is also crucial to establish a feedback loop between clinical and research teams to address issues that arise during data collection promptly. Hospitals should also prioritize the creation of systems to facilitate long-term data collection. The value of longitudinal data, particularly in chronic conditions like UC and CD, cannot be overstated. By implementing automated follow-up reminders and patient tracking systems, institutions can reduce the loss of follow-up and ensure consistent collection of valuable longitudinal data. Additionally, protocols should be developed to document the reasons for missing or incomplete data, thus allowing for more robust and actionable results. Finally, fostering greater collaboration and communication between researchers, clinicians, and administrative staff is essential to streamline data collection and maintain efficiency.

4.4. Informed Consent Process

A critical insight from this study is the importance of preparing hospitals for the informed consent process well in advance of data collection. Obtaining informed consent is a crucial step in any clinical research study, especially when handling sensitive health data. However, challenges were encountered during patient recruitment, as the informed consent process was occasionally delayed or inconsistent. These delays hindered the timely initiation of data collection and added unnecessary complexity to the workflow. To avoid such issues in future studies, hospitals should ensure that the informed consent process is streamlined and organized before data collection begins. A clear, standardized consent form covering all aspects of the study—such as data usage, patient privacy, and the inclusion of biological samples—will help prevent delays and ensure that patients are well-informed about their participation from the outset.

4.5. Biological Sample Handling and Digital Pathology

Finally, issues related to the storage and handling of biological samples, specifically histopathology specimens, presented additional challenges. While histopathology images were not analyzed in this study, difficulties arose in the storage process of biological samples. Ensuring the integrity and reliability of histopathology data depends on proper preservation of samples. In this case, inconsistencies in sample storage—such as temperature control and processing delays—led to complications in ensuring that all samples were suitable for future analysis. To address this issue, hospitals should digitalize histopathology images within the clinical workflow. Implementing digital pathology systems that capture high-resolution images at the point of care will not only streamline the data collection process but also reduce the risk of sample degradation or mishandling. This shift would help ensure that high-quality, analyzable images are consistently available for future research, reducing potential errors in sample management.

5. Conclusions

In this paper, a novel dataset for machine-learning-ready IBD research was presented. We integrated diverse and heterogeneous data sources, ranging from clinical records to imaging and laboratory results. The dataset is structured for machine-learning applications, including supervised classification of treatment response, regression models to predict biomarker levels, time-series modeling of disease evolution, and multimodal fusion approaches combining clinical, lab, and image features. Clinically, this dataset can inform the development of AI tools that predict early response to therapy, identify high-risk patients, and tailor treatment plans. It supports personalized medicine approaches by enabling patient stratification based on objective biomarkers, endoscopic features, and clinical profiles. We believe that the structure and depth of this dataset can serve as a model for similar data collection efforts in other centers, promoting multicenter collaborations and larger-scale studies. Such efforts could accelerate the validation and generalizability of predictive models, ultimately contributing to more precise and effective patient care across diverse clinical settings.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/data10070100/s1, Figure S1: Cramer’s V correlation matrices (categorical variables) for Ulcerative Colitis (UC) and Chron’s Disease (CD) at baseline and followup; Figure S2: Pearson correlation matrices (continuous variables) for Ulcerative Colitis (UC) and Chron’s Disease (CD) at baseline and followup.

Author Contributions

Conceptualization, A.M., V.S., C.H., A.R. and A.A.; Validation, A.D.B., C.B., R.G., C.H. and A.A.; Investigation, A.M.; Resources, V.S., M.M., A.S., A.D.B., C.B., R.G., C.H., A.R. and A.A.; Data curation, A.M., M.M., A.S. and G.F.; Writing—original draft, A.M.; Supervision, C.H., A.R. and A.A.; Project administration, A.M., V.S. and C.H.; Funding acquisition, V.S. and A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the 5 × 1000 research funding program awarded to IRCCS Humanitas for scientific research (A.F. 2019). The funder had no influence on any aspect of the study, including its design, data collection, analysis and interpretation, manuscript preparation, or the decision to publish.

Institutional Review Board Statement

The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Institutional Ethics Committee of IRCCS Humanitas Research Hospital (Prot. Nr. CE Humanitas ex D.M. 8/2/2013 82/23 and Prot. Nr. CET Lombardia 5 87/23).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in this study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

Author A.A. has received consulting and advisory board fees from AbbVie, Abivax, Alfa Sigma, AstraZeneca, Biogen, Boehringer Ingelheim, Bristol-Myers Squibb, Celltrion, Eli Lilly, Ferring, Galapagos, Gilead, Giuliani, Janssen, Lionhealth, Merck, Nestlé, Pfizer, Protagonist Therapeutics, Roche, Sanofi, Samsung Bioepis, Sandoz, Takeda, and Tillots Pharma. Additionally, A.A. has received speaker’s fees from AbbVie, Abivax, AG Pharma, Alfa Sigma, Biogen, Bristol-Myers Squibb, Celltrion, Eli Lilly, Ferring, Galapagos, Gilead, Janssen, Lionhealth, Merck, Novartis, Pfizer, Roche, Samsung Bioepis, Sandoz, Takeda, and Teva Pharmaceuticals, as well as research grants from Biogen, Merck, Pfizer, and Takeda. Author R.G. has received speaker’s fees from Pfizer, MSD, Ferring, Eli Lilly, and Celltrion, as well as consultant fees from Pfizer and AbbVie. Author A.D.B. has received speaker’s fees from AbbVie, Galapagos, Eli Lilly, and Celltrion, and consulting fees from Ferring. The authors declare no other conflicts of interest related to this manuscript.

References

  1. Kozarek, R. Basic research in endoscopy. Ital. J. Gastroenterol. Hepatol. 1999, 31, 743–748. [Google Scholar] [PubMed]
  2. Tabib, N.S.S.; Madgwick, M.; Sudhakar, P.; Verstockt, B.; Korcsmaros, T.; Vermeire, S. Big data in IBD: Big progress for clinical practice. Gut 2020, 69, 1520–1532. [Google Scholar] [CrossRef] [PubMed]
  3. Rance, B.; Canuel, V.; Countouris, H.; Laurent-Puig, P.; Burgun, A. Integrating heterogeneous biomedical data for cancer research: The CARPEM infrastructure. Appl. Clin. Inform. 2016, 7, 260–274. [Google Scholar] [PubMed]
  4. Marzullo, A.; Moccia, S.; Calimeri, F.; De Momi, E. AIM in Endoscopy Procedures. In Artificial Intelligence in Medicine; Lidströmer, N., Ashrafian, H., Eds.; Springer: Cham, Switzerland, 2021; pp. 1–11. [Google Scholar] [CrossRef]
  5. Gubatan, J.; Levitte, S.; Patel, A.; Balabanis, T.; Wei, M.T.; Sinha, S.R. Artificial intelligence applications in inflammatory bowel disease: Emerging technologies and future directions. World J. Gastroenterol. 2021, 27, 1920. [Google Scholar] [CrossRef] [PubMed]
  6. Pogorelov, K.; Randel, K.R.; Griwodz, C.; Eskeland, S.L.; de Lange, T.; Johansen, D.; Spampinato, C.; Dang-Nguyen, D.T.; Lux, M.; Schmidt, P.T.; et al. Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the 8th ACM on Multimedia Systems Conference, Taipei, Taiwan, 20–23 June 2017; pp. 164–169. [Google Scholar]
  7. Parasa, S.; Berzin, T.; Leggett, C.; Gross, S.; Repici, A.; Ahmad, O.F.; Chiang, A.; Coelho-Prabhu, N.; Cohen, J.; Dekker, E.; et al. Consensus statements on the current landscape of artificial intelligence applications in endoscopy, addressing roadblocks, and advancing artificial intelligence in gastroenterology. Gastrointest. Endosc. 2024, 101, 2–9.e1. [Google Scholar] [CrossRef] [PubMed]
  8. Regulation, P. Regulation (EU) 2016/679 of the European Parliament and of the Council. Regulation 2016, 679, 2016. [Google Scholar]
  9. Real-World Data for the Life Sciences and Healthcare | TriNetX—trinetx.com. Available online: https://trinetx.com/ (accessed on 7 January 2025).
  10. Gut Reaction. Available online: https://bioresource.nihr.ac.uk/centres-programmes/ibd-bioresource/gut-reaction/ (accessed on 7 January 2025).
  11. Martin-King, C.; Nael, A.; Ehwerhemuepha, L.; Calvo, B.; Gates, Q.; Janchoi, J.; Ornelas, E.; Perez, M.; Venderby, A.; Miklavcic, J.; et al. Histopathology imaging and clinical data including remission status in pediatric inflammatory bowel disease. Sci. Data 2024, 11, 761. [Google Scholar] [CrossRef] [PubMed]
  12. Klumpp, M.; Hintze, M.; Immonen, M.; Ródenas-Rigla, F.; Pilati, F.; Aparicio-Martínez, F.; Çelebi, D.; Liebig, T.; Jirstrand, M.; Urbann, O.; et al. Artificial intelligence for hospital health care: Application cases and answers to challenges in European hospitals. Healthcare 2021, 9, 961. [Google Scholar] [CrossRef] [PubMed]
  13. Rajaei, O.; Khayami, S.R.; Rezaei, M.S. Smart hospital definition: Academic and industrial perspective. Int. J. Med. Inform. 2024, 182, 105304. [Google Scholar] [CrossRef] [PubMed]
  14. Khalid, N.; Qayyum, A.; Bilal, M.; Al-Fuqaha, A.; Qadir, J. Privacy-preserving artificial intelligence in healthcare: Techniques and applications. Comput. Biol. Med. 2023, 158, 106848. [Google Scholar] [CrossRef] [PubMed]
  15. Williamson, S.M.; Prybutok, V. Balancing privacy and progress: A review of privacy challenges, systemic oversight, and patient perceptions in AI-driven healthcare. Appl. Sci. 2024, 14, 675. [Google Scholar] [CrossRef]
  16. Bartoletti, I. AI in healthcare: Ethical and privacy challenges. In Proceedings of the Artificial Intelligence in Medicine: 17th Conference on Artificial Intelligence in Medicine, AIME 2019, Poznan, Poland, 26–29 June 2019; Proceedings 17; Springer: Berlin/Heidelberg, Germany, 2019; pp. 7–10. [Google Scholar]
  17. Misra, R.; Keane, P.A.; Hogg, H.D.J. How should we train clinicians for artificial intelligence in healthcare? Future Healthc. J. 2024, 11, 100162. [Google Scholar] [CrossRef] [PubMed]
  18. Patel, M.R.; Balu, S.; Pencina, M.J. Translating AI for the Clinician. JAMA 2024, 332, 1701–1702. [Google Scholar] [CrossRef] [PubMed]
  19. Wiest, I.C.; Ferber, D.; Zhu, J.; van Treeck, M.; Meyer, S.K.; Juglan, R.; Carrero, Z.I.; Paech, D.; Kleesiek, J.; Ebert, M.P.; et al. Privacy-preserving large language models for structured medical information retrieval. NPJ Digit. Med. 2024, 7, 257. [Google Scholar] [CrossRef] [PubMed]
  20. Compagnucci, M.C.; Wilson, M.L.; Fenwick, M.; Forgó, N.; Bärnighausen, T. AI in EHealth: Human Autonomy, Data Governance and Privacy in Healthcare; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Figure 1. Process implemented to create the final harmonized database. Step 1: Patient data meeting inclusion criteria were extracted from hospital databases. Step 2: Data were structured, cleaned, and harmonized to enable linking across different data sources and formats. Step 4: The harmonized data were securely uploaded to the Azure cloud platform for final storage and access.
Figure 1. Process implemented to create the final harmonized database. Step 1: Patient data meeting inclusion criteria were extracted from hospital databases. Step 2: Data were structured, cleaned, and harmonized to enable linking across different data sources and formats. Step 4: The harmonized data were securely uploaded to the Azure cloud platform for final storage and access.
Data 10 00100 g001
Table 1. Overview of the Humanendo-IBD dataset structure.
Table 1. Overview of the Humanendo-IBD dataset structure.
Data ComponentVariableDescription
Clinical DatagenderPatient gender
patient_weightPatient weight at baseline (kg)
patient_heightPatient height (cm)
date_of_birthDate of birth
baseline_diagnosisDiagnosis (CD or UC)
baseline_date_of_diagnosisDate of initial diagnosis
baseline_age_at_diagnosisAge at initial diagnosis (years)
baseline_history_of_autoimmuneHistory of autoimmune diseases (Yes/No)
baseline_family_historyFamily history of IBD (Yes/No)
smoking_statusSmoking status (Yes/No)
number_of_stools_per_dayNumber of stools per day
abdominal_painPresence of abdominal pain (Yes/No)
abdominal_massPresence of abdominal mass (Yes/No)
rectal_bleedingPresence of rectal bleeding (Yes/No)
arthralgiaPresence of joint pain (Yes/No)
ocular_involvementOcular involvement (Yes/No)
pyodema_gangrenosumPresence of pyoderma gangrenosum (Yes/No)
erythema_nodosumPresence of erythema nodosum (Yes/No)
new_fistulaPresence of a new fistula (Yes/No)
harvey_bradshaw_indexHarvey–Bradshaw Index for CD patients
mayo_scoreMayo score for UC patients
Laboratory Datac_reactive_protein_crpCRP level (mg/dL)
fecal_calprotectin_fcFecal calprotectin level (µg/g of stool)
biology_hemoglobinHemoglobin level (g/dL)
biology_platelet_countPlatelet count ( × 109 /L)
biology_serum_creatinine_valueSerum creatinine value (mg/dL)
biology_ferritin_level_valueFerritin level (ng/mL)
ast_aspartate_amino_valueAspartate Aminotransferase level (U/L)
alt_alanine_trans_valueAlanine Aminotransferase level (U/L)
ggt_gamma_glutamyl_transf_valueGamma-Glutamyl Transferase level (U/L)
alp_alkaline_phosphatase_valueAlkaline Phosphatase level (U/L)
biology_albumin_valueAlbumin level (g/dL)
biology_vitamin_b12_valueVitamin B12 level (pg/mL)
biology_vitamin_d_valueVitamin D level (ng/mL)
Endoscopy Dataendoscopy_typeType of endoscopy performed
visit_date_endoscopyDate of endoscopy
aphthousPresence of aphthous ulcers (Yes/No)
ulcerationsPresence of ulcerations (Yes/No)
stenosis_levelLevel of stenosis if present
stenosis_classClassification of stenosis (e.g., inflammatory, fibrotic)
uc_extensionDisease extension in UC patients
Virology Dataviral_sierology_hbvHepatitis B virus serology (Positive/Negative)
viral_sierology_hcvHepatitis C virus serology (Positive/Negative)
viral_sierology_ebvEpstein-Barr virus serology (Positive/Negative)
Pathology and Imaging Datapathology_digitised_heDigitized H&E stained histopathological images
biopsy_locationBiopsy location
Treatment Dataaminosalicylates_adminAminosalicylate administration (Yes/No)
aminosalicylates_typeType of aminosalicylate administered
aminosalicylates_start_dateStart date of aminosalicylate treatment
aminosalicylates_end_dateEnd date of aminosalicylate treatment
aminosalicylates_doseDosage of aminosalicylate
aminosalicylates_interruptionTreatment interruption (Yes/No)
antibiotics_adminAntibiotic administration (Yes/No)
antibiotics_start_dateStart date of antibiotic treatment
antibiotics_end_dateEnd date of antibiotic treatment
antibiotics_doseDosage of antibiotics
steroids_adminSteroid administration (Yes/No)
steroids_typeType of steroid administered
steroids_start_dateStart date of steroid treatment
steroids_end_dateEnd date of steroid treatment
steroids_doseDosage of steroids
steroids_interruptionSteroid treatment interruption (Yes/No)
immunomodulator_adminImmunomodulator administration (Yes/No)
immunomodulator_typeType of immunomodulator used
immunomodulator_start_dateStart date of immunomodulator treatment
immunomodulator_end_dateEnd date of immunomodulator treatment
anti_tnf_adminAnti-TNF therapy administration (Yes/No)
anti_tnf_typeType of anti-TNF therapy
anti_tnf_secondary_lossSecondary loss of response (Yes/No)
Table 2. Ulcerative colitis (UC) patient summary of baseline and follow-up data.
Table 2. Ulcerative colitis (UC) patient summary of baseline and follow-up data.
CategoryFemale (F)-BaselineMale (M)-BaselineFemale (F)-Follow-UpMale (M)-Follow-Up
Gender Distribution35 (36%)63 (64%)
Average Age35.92 years30.83 years
Age Groups18–30 (21), 31–50 (42),
51–70 (21), 71+ (14)
Height Groups<150 cm (2),
150–170 cm (46),
>170 cm (50)
Patient Weight (kg)Mean: 59.37,
Range: 43.0–88.0
Mean: 73.35,
Range: 49.0–99.0
Mean: 59.37,
Range: 43.0–88.0
Mean: 73.60,
Range: 49.0–99.0
Patient Height (cm)Mean: 163.47,
Range: 148.0–182.0
Mean: 174.75,
Range: 160.0–192.0
Age at Diagnosis (years)Median: 35.92,
Range: 11–74
Median: 30.83,
Range: 0–77
Mayo ScoreMedian: 3.0,
Max: 3.0, Min: 2.0
Median: 3.0,
Max: 3.0, Min: 2.0
Median: 1.0,
Max: 3.0, Min: 0.0
Median: 1.5,
Max: 3.0, Min: 0.0
Table 3. Crohn’s disease (CD) patient summary of baseline and follow-up data.
Table 3. Crohn’s disease (CD) patient summary of baseline and follow-up data.
CategoryFemale (F)-BaselineMale (M)-BaselineFemale (F)-Follow-UpMale (M)-Follow-Up
Gender Distribution41 (36%)61 (62%)
Average Age32.17 years33.38 years
Age Groups18–30 (21), 31–50 (42),
51–70 (21), 71+ (14)
Height Groups<150 cm (2),
150–170 cm (46),
>170 cm (50)
Patient Weight (kg)Mean: 63.61,
Range: 41.0–104.0
Mean: 74.38,
Range: 45.0–105.0
Mean: 59.37,
Range: 43.0–88.0
Mean: 73.60,
Range: 49.0–99.0
Patient Height (cm)Mean: 162.78,
Range: 150.0–180.0
Mean: 175.52,
Range: 160.0–198.0
Age at Diagnosis (years)Median: 32.0,
Range: 11–54
Median: 33.0,
Range: 11–64
Harvey–Bradshaw IndexMedian: 5.5, Max: 17.0,
Min: 0.0
Median: 3.0, Max: 13.0,
Min: 0.0
Median: 3.0, Max: 14.0,
Min: 0.0
Median: 2.0, Max: 12.0,
Min: 0.0
Table 4. Ulcerative colitis (% UC) status summary (baseline and follow-up).
Table 4. Ulcerative colitis (% UC) status summary (baseline and follow-up).
CategoryStatus at BaselineStatus at Follow-Up
Smoking StatusNon-Smoker (46),
Former Smoker (34),
Active Smoker (16),
Unknown (3)
Non-Smoker (43),
Former Smoker (1),
Active Smoker (14),
Unknown (4)
Number of Stools per Day (Median)3.02.0
Number of Stools per Day (Min–Max)2–71–10
Family History13
History of Autoimmune Disease19
Pyodema Gangrenosum00
Erythema Nodosum00
New Fistula00
Patients Suffering Aphthous20
Patients Suffering Ulcerations199
Patients Suffering Abdominal Pain6637
Patients Suffering Rectal Bleeding6426
Patients Suffering Arthralgia5747
Patients Suffering Ocular Involvement00
Patients Suffering Pyodema Gangrenosum00
Patients Suffering Erythema Nodosum00
Patients Suffering New Fistula990
Table 5. Crohn’s disease (CD) status summary (baseline and follow-up).
Table 5. Crohn’s disease (CD) status summary (baseline and follow-up).
CategoryStatus at BaselineStatus at Follow-Up
Smoking StatusNon-Smoker (38),
Former Smoker (34),
Active Smoker (31),
Unknown (3)
Non-Smoker (38),
Former Smoker (30),
Active Smoker (34),
Unknown (3)
Number of Stools per Day (Median)2.02.0
Number of Stools per Day (Min–Max)1–201–10
Family History15
History of Autoimmune Disease11
Pyodema Gangrenosum00
Erythema Nodosum00
New Fistula00
Patients Suffering Aphthous00
Patients Suffering Ulcerations00
Patients Suffering Abdominal Pain6637
Patients Suffering Rectal Bleeding6426
Patients Suffering Arthralgia5747
Patients Suffering Ocular Involvement00
Patients Suffering Pyodema Gangrenosum00
Patients Suffering Erythema Nodosum00
Patients Suffering New Fistula00
Table 6. Summary of biological and treatment information for ulcerative colitis (UC) patients, including baseline and follow-up measurements of various biomarkers and treatment types.
Table 6. Summary of biological and treatment information for ulcerative colitis (UC) patients, including baseline and follow-up measurements of various biomarkers and treatment types.
CategoryPatientsMean (Std)Range
TREATMENT INFO
Aminosalicylates193331.0 (2799.21)59.0–8135.0
Antibiotics1304.0 (0)304.0–304.0
Steroids72500.75 (1220.79)2.0–6962.0
Immunomodulator49837.14 (1352.2)1.0–6848.0
Anti-TNF30259.97 (304.8)18.0–1569.0
Anti-IL1166.0 (0)166.0–166.0
Anti-Integrin8321.88 (212.34)80.0–762.0
JAK Inhibitor0
BIOLOGICAL STATUS
CRP Level at Baseline652.17 (4.1)0.02–21.79
CRP Level at Follow-up900.7 (1.4)0.006–7.67
Fecal Calprotectin at Baseline51700.4 (1106.44)5.0–5583.0
Fecal Calprotectin at Follow-up87422.02 (776.86)5.0–6000.0
Hemoglobin at Baseline8413.62 (2.12)7.6–18.1
Hemoglobin at Follow-up9614.24 (1.43)10.1–17.0
Platelet Count at Baseline79307.57 (96.89)150.0–651.0
Platelet Count at Follow-up94267.6 (71.5)141.0–490.0
Serum Creatinine at Baseline461.0 (0.43)0.2–2.58
Serum Creatinine at Follow-up690.98 (0.25)0.54–2.38
Ferritin Level at Baseline43111.19 (166.65)2.61–1027.0
Ferritin Level at Follow-up7797.88 (80.0)4.1–430.0
Aspartate Aminotransferase (AST) Baseline3122.84 (9.22)12.0–44.0
Aspartate Aminotransferase (AST) Follow-up3724.41 (13.43)11.0–76.0
Alanine Transaminase (ALT) Baseline3233.91 (19.87)12.0–92.0
Alanine Transaminase (ALT) Follow-up4231.1 (19.81)6.0–110.0
Gamma-Glutamyl Transferase (GGT) Baseline2943.81 (48.47)10.0–208.0
Gamma-Glutamyl Transferase (GGT) Follow-up4531.04 (32.5)6.0–183.0
Alkaline Phosphatase (ALP) Baseline2588.94 (33.64)52.0–183.0
Alkaline Phosphatase (ALP) Follow-up3885.6 (29.63)38.0–209.0
Albumin Level at Baseline1223.05 (18.27)0.599–44.5
Albumin Level at Follow-up53.58 (1.73)0.509–4.56
Vitamin B12 Level at Baseline24455.24 (165.51)168.0–771.0
Vitamin B12 Level at Follow-up48492.88 (128.48)113.0–694.0
Vitamin D Level at Baseline1730.54 (16.72)13.0–65.0
Vitamin D Level at Follow-up5333.95 (17.99)8.8–65.9
Table 7. Summary of biological and treatment information for Crohn’s disease (CD) patients, including baseline and follow-up measurements of various biomarkers and treatment types.
Table 7. Summary of biological and treatment information for Crohn’s disease (CD) patients, including baseline and follow-up measurements of various biomarkers and treatment types.
CategoryPatientsMean (Std)Range
TREATMENT INFO
Aminosalicylates512073.61 (2597.28)4.0–8716.0
Antibiotics635.17 (28.96)6.0–89.0
Steroids64526.77 (1611.94)1.0–8491.0
Immunomodulator272558.7 (7858.55)29.0–41,086.0
Anti-TNF421024.57 (955.05)79.0–3956.0
Anti-IL11323.91 (230.16)45.0–936.0
Anti-Integrin5768.0 (1085.09)54.0–2588.0
JAK Inhibitor1220.0220.0–220.0
BIOLOGICAL STATUS
CRP Level at Baseline762.35 (3.81)0.0–19.7
CRP Level at Follow-up1031.43 (3.11)0.0–21.4
Fecal Calprotectin at Baseline67494.33 (902.57)12.5–5893.0
Fecal Calprotectin at Follow-up89198.73 (343.96)1.9–2000.0
Hemoglobin at Baseline8115.22 (14.69)10.2–145.2
Hemoglobin at Follow-up10013.99 (1.38)8.9–16.7
Platelet Count at Baseline752176.65 (16,246.77)188.0–141,000.0
Platelet Count at Follow-up98292.56 (292.16)135.0–3058.0
Serum Creatinine at Baseline460.89 (0.22)0.063–1.54
Serum Creatinine at Follow-up720.96 (0.14)0.68–1.31
Ferritin Level at Baseline4984.57 (81.69)7.0–414.0
Ferritin Level at Follow-up6890.22 (72.95)9.0–424.0
Aspartate Aminotransferase (AST) Baseline2022.1 (8.97)11.0–52.0
Aspartate Aminotransferase (AST) Follow-up4125.45 (19.75)14.0–135.0
Alanine Transaminase (ALT) Baseline2329.3 (13.77)12.0–65.0
Alanine Transaminase (ALT) Follow-up4434.26 (21.04)13.0–105.0
Gamma-Glutamyl Transferase (GGT) Baseline3034.52 (46.73)9.0–247.0
Gamma-Glutamyl Transferase (GGT) Follow-up4537.94 (77.59)12.0–531.0
Alkaline Phosphatase (ALP) Baseline1885.06 (29.35)41.0–184.0
Alkaline Phosphatase (ALP) Follow-up4077.92 (10.02)54.0–108.0
Albumin Level at Baseline826.14 (23.25)0.584–58.0
Albumin Level at Follow-up681.91 (153.03)4.0–354.0
Vitamin B12 Level at Baseline24381.25 (190.03)125.0–781.0
Vitamin B12 Level at Follow-up55410.27 (167.8)86.0–550.0
Vitamin D Level at Baseline1935.11 (20.66)8.0–65.0
Vitamin D Level at Follow-up5630.92 (18.3)8.0–65.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Marzullo, A.; Savevski, V.; Menini, M.; Schilirò, A.; Franchellucci, G.; Dal Buono, A.; Bezzio, C.; Gabbiadini, R.; Hassan, C.; Repici, A.; et al. Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort. Data 2025, 10, 100. https://doi.org/10.3390/data10070100

AMA Style

Marzullo A, Savevski V, Menini M, Schilirò A, Franchellucci G, Dal Buono A, Bezzio C, Gabbiadini R, Hassan C, Repici A, et al. Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort. Data. 2025; 10(7):100. https://doi.org/10.3390/data10070100

Chicago/Turabian Style

Marzullo, Aldo, Victor Savevski, Maddalena Menini, Alessandro Schilirò, Gianluca Franchellucci, Arianna Dal Buono, Cristina Bezzio, Roberto Gabbiadini, Cesare Hassan, Alessandro Repici, and et al. 2025. "Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort" Data 10, no. 7: 100. https://doi.org/10.3390/data10070100

APA Style

Marzullo, A., Savevski, V., Menini, M., Schilirò, A., Franchellucci, G., Dal Buono, A., Bezzio, C., Gabbiadini, R., Hassan, C., Repici, A., & Armuzzi, A. (2025). Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort. Data, 10(7), 100. https://doi.org/10.3390/data10070100

Article Metrics

Back to TopTop