Data-Centric AI Manifesto: How Data Quality Drives Modern AI
Abstract
1. Introduction
1.1. Context: From Model-Centric AI to Data-Centric AI
1.2. Data Quality Indicators
- Data Quality is defined as the degree to which a dataset is fit for its intended use in a DCAI pipeline, assessed with respect to two primary indicators: the free-of-error and relevancy. The free-of-error indicator captures data correctness—formally measured as the ratio of error-free data units to total data units—and directly conditions model robustness because training on incorrect records propagates systematic prediction errors. The relevancy indicator ensures that only data pertinent to the target task is retained, preventing noise-induced overfitting and improving both the stability and generalizability of learned representations.
- Semantic Consistency is defined as the extent to which a dataset conveys meaning in a coherent and complete manner across its schema and instances. It is assessed through understandability and completeness. The understandability indicator reflects how clearly the data’s structure and semantics can be interpreted by human stakeholders and automated pipelines alike, directly increasing the interpretability of model outputs. Completeness—operationalized across schema, columns, and population levels via ratio-based metrics [4]—ensures that no systematic gaps exist across demographic or feature subgroups, thereby conditioning fairness outcomes: incomplete coverage of a protected group introduces representational bias into model training.
- Representativeness is defined as the degree to which a dataset faithfully reflects the target population and is accessible for analytical use, assessed through concise representation, consistent representation, and ease of manipulation. The concise representation indicator ensures that no redundant or irrelevant attributes inflate the feature space, improving model interpretability. The consistent representation indicator—measured as the ratio of constraint violations (e.g., referential integrity breaches) to total consistency checks—prevents conflicting encodings of the same real-world entity, which would otherwise pose a direct threat to robustness. The ease-of-manipulation indicator reflects how easily stakeholders can process and transform the data for downstream uses. It is operationally linked to fairness, as datasets that are difficult to audit or rebalance resist bias correction interventions.
1.3. DCAI Formalization
- is the feature vector (input/covariates);
- is the label (output/target variable).
- The model parameters, such as connection weights, bias terms, or thresholds, which we denote as ;
- The model hyperparameters, such as the number of hidden layers, the number of neurons per layer, or the activation functions, which we denote as ;
- The loss function evaluated on the dataset D under parameters and hyperparameters , which we denote as .
- , which is used to fit model parameters during the model training;
- , which is used to validate model training in an outer optimization loop;
- , which is held out for final, unbiased evaluation.
- : The free-of-error ratio;
- : Relevancy;
- : Completeness;
- : The consistency ratio.
1.4. Why a Paradigm Shift Is Necessary
- Label inconsistency and noise: Systematic labeling errors are among the most pervasive threats to the reliability and performance of AI predictive systems [5]. If a training dataset contains inconsistent or incorrect labels, then the decision model may learn incorrectly to reproduce the annotation noise rather than the ground-truth, underlying predictive patterns.
- Data distribution problems: Beyond labeling issues, real-world datasets often exhibit distributional problems that may limit decision models’ generalization and robustness in deployment. Common challenges include class imbalance and feature distribution skew. In particular, class imbalance—under-representation of minority classes in a training dataset—may lead to biased decision boundaries and poor recall for rare classes [6], while feature distribution skew (a shift in the data distribution from the training data to real-world deployment data) may cause a significant decrease in the accuracy performance of the decision model when used in the deployment environment [7].
- Technical debt arising from data issues: Poor data quality may create cascading technical debts throughout the AI lifecycle [8]. When foundational data problems remain unresolved, they multiply across model learning iterations and infrastructure layers. Key manifestations of this issue include training instability and reduced interpretability. Training instability is commonly caused by high levels of label noise, which may hinder convergence and exacerbate model overfitting to spurious correlations, while the presence of inconsistent or biased patterns in data may reduce the reliability of both feature importance analysis and explainability.
1.5. Expected Impacts on Research, Industry, and Society
2. Data-Centric AI and Generative AI: Convergences and Synergies
2.1. GenAI as a Catalyst for DCAI
2.2. DCAI as a Foundation for Responsible GenAI
2.3. GenAI in Service of DCAI
3. Data Lifecycle
3.1. Training-Data Development
- Data collection [31]—This entails identifying the most relevant and useful datasets from the available data sources, such as data lakes and marketplaces, often requiring data integration.
- Data labeling [32]—This involves assigning one or more labels to data samples to enable supervised learning. As labeling is time-consuming and resource-intensive, techniques like crowdsourcing, consensus learning, and semi-supervised or active learning are often employed to enhance efficiency and reduce costs.
- Data reduction [35]—This task involves simplifying datasets while preserving their relevant information. This is done by reducing either the feature dimensions (dimensionality reduction) or sample size (sampling). Among the many possible approaches to performing sample size reduction, instance selection methods such as DBSCAN [36] are some of the most widely used. This algorithm groups data into clusters based on their density in space, with representative points drawn from the innermost core of each cluster and from well-connected points on the borders. This approach is essentially based on the shape of the data. Alternatively, to preserve the information content of the data rather than the dataset’s structure, the authors of [37] present an instance selection method that exploits the concept of entropy to cluster together points that have the same informative content. Then, using a convex hull approach, representative elements are drawn at the boundary of the clusters to also ensure data diversity among the selected points. Thus, the chosen points are both representative and diverse—desirable properties for downstream tasks. On the feature dimension side, recent selection methods increasingly rely on deep learning solutions. For example, ref. [38] proposes DeepFS, an approach that extracts low-dimensional representations from ultra high-dimensional, low-sample-size data and then performs feature screening using multivariate rank distance correlation. This enables precise identification of significant features.
- Data transformation and enrichment [39]—This entails extracting smart data from raw data, where the former denotes a refined, semantically enriched representation obtained through systematic transformation and curation operations—such as feature extraction, semantic annotation, and multimodal integration—that increase relevancy and understandability (as defined in Section 1.2) and expose latent patterns suitable for downstream learning tasks. This is achieved by obtaining more explainable objects and handling rich information (e.g., multi-view data [40] and multimodal data [41]) in different formats (e.g., vectors, sequences, images, and graphs).
3.2. Inference Data Development
- In-distribution evaluation [44]—This process produces samples that are similar to the training data. This helps identify underrepresented groups, prevent bias, reveal decision boundaries, and examine ethical considerations.
- Out-of-distribution evaluation [45]—This process generates samples that differ significantly from the training data. Practical examples include developing inference data in dynamic contexts, such as business process executions, where data can change over time [46,47]. In particular, in the dynamic settings, the evaluation should account for the ability to detect and handle concept drifts and obtain decision models that maintain high performance over time. For example, [48] provides a concrete example demonstrating how predictive models can be effective and adaptive in dynamic Industry 4.0 settings.
3.3. Data Maintenance
- Data comprehension [49]—Techniques like visual summarization, clustering, or statistics are used to organize complex data and produce human-readable insights.
- Data quality assurance [4]—Data are monitored and improved continuously. The relevant quality metrics include objective measures (accuracy, timeliness, consistency, and completeness) and subjective assessments from a human perspective. This step also involves addressing the explainability of decision models with respect to both local and global data.
- Data storage and retrieval [50]—This involves managing growing datasets through resource allocation strategies to optimize throughput and latency in data systems.
- Data representativeness [51]— Five metrics are introduced to evaluate whether an example is representative or an outlier in a dataset, and practical methods for measuring representativeness are proposed. These metrics are given below:
- −
- Adversarial Robustness: This measures how difficult it is to modify an input enough such that its classification is changed. If a large perturbation is required, then the example is near the center of its class and therefore more representative.
- −
- Holdout Retraining: A model is compared with and without a given example. If predictions change little, the example is well supported by other data and therefore representative.
- −
- Ensemble Agreement: This involves training multiple models and measuring their agreement on a given example. If they produce similar predictions, the example is typical and easy to learn—and therefore representative.
- −
- Model Confidence: This metric examines how confident models are in their predictions (i.e., high probability for a class). Greater confidence indicates clearer, more representative examples.
- −
- Privacy-preserving Training: This involves training models with noise (differential privacy). Representative examples remain correctly classified even with noise, while outliers are more likely to be misclassified.
4. Techniques and Tools for DCAI
4.1. Techniques for Data Cleaning and Selection
4.2. Techniques for Handling Noise and Incorrect Labels
4.3. Techniques for Smart-Data Extraction and Data Enrichment
4.4. Techniques for Semantic Data Preparation
4.5. Active Learning and Data Augmentation
4.6. Transfer Learning and Fine-Tuning
4.7. Libraries and Tools
4.7.1. Data Profiling
- YData Profiling—YData Profiling supports both Pandas and Spark DataFrames, providing a fast and straightforward visual data understanding.
- SweetViz—SweetViz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code.
- DataPrep.EDA—DataPrep.EDA is an EDA (Exploratory Data Analysis) tool implemented in Python. It allows developers to understand a Pandas/Dask DataFrame with a few lines of code in seconds.
- Pycol—Pycol implements 29 overlap measures designed to capture class overlap in imbalanced real-world scenarios [86].
4.7.2. Synthetic Data
- YData Synthetic—YData Synthetic uses Generative Adversarial Networks for synthetic generation of tabular and time-series data.
- Synthetic Data Vault(SDV)—SDV is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn patterns from single-table, multi-table, and time-series datasets. Learned patterns can be subsequently used to generate new Synthetic Data that has the same format and statistical properties as the original dataset.
- Pomegranate—Pomegranate is a package for building probabilistic models in Python. It is implemented in Cython for speed. Most of the integrated models can sample data.
4.7.3. Data Labeling
- LabelStudio—Label Studio is an open-source data-labeling tool. It allows users to label data types like audio, text, images, videos, and time series with a simple and straightforward UI and export labeled data into various model formats.
- Annotation Lab—Annotation Lab is a Natural Language Processing annotation tool included in spark-nlp.
5. Methodological Evolution
- Business Understanding—Definition of objectives, constraints, success metrics, and the business context in which data mining will operate.
- Data Understanding—Initial data collection, exploratory analysis, identification of data quality issues, detection of biases, and preliminary assessment of data suitability.
- Data Preparation—Construction of the final dataset through cleaning, transformation, integration, and feature selection.
- Modeling—Selection and configuration of modeling techniques, hyperparameter tuning, and model construction.
- Evaluation—Assessment of the model’s ability to meet business and technical objectives.
- Deployment—Integration of the model into operational processes, monitoring, documentation, and maintenance.
5.1. CRISP-DM Limitations in DCAI
- Explicit data-centric cycles;
- Systematic dataset versioning and documentation;
- Continuous evaluation of data quality;
- Integration with MLOps and emerging GenAIOps practices;
- Reproducible processes for synthetic data generation, data labeling, and data governance.
5.2. Revisiting CRISP-DM
- Business goals are not abstract—they are instantiated through data. A business hypothesis, such as “supplier quality affects process adherence,” is meaningful only insofar as it can be operationalized via available or collectible data. In this sense, business comprehension must be embedded within data comprehension.
- Data exploration is not neutral—it is guided by business semantics. Data profiling is never performed in the abstract; it is hypothesis-driven and shaped by domain expectations. The selection of sources, the assessment of data quality, and the interpretation of distributions all depend on business meaning.
- Source identification and acquisition: Locate and access the data sources specified in the business–data requirements phase, ensuring compliance with licensing, privacy, and ethical constraints.
- Operationalization of labeling schema: Set up annotation protocols and prepare annotators or automated tools to ensure semantic consistency and domain validity.
- Annotation execution: Conduct annotation with trained annotators or domain experts, and validate outputs iteratively.
- Labeling-quality assurance: Perform inter-annotator agreement checks, validate against gold standards or domain knowledge, and refine ambiguous cases.
- Active-learning integration: Use model feedback to prioritize labeling of uncertain or high-value samples, iteratively expanding the labeled dataset with maximum efficiency.
- Documentation and governance: Record collection protocols, annotation guidelines, and quality metrics, ensuring compliance with ethical and legal standards and providing transparent documentation to ensure reproducibility.
- Model selection based on data semantics: Rather than choosing models solely based on statistical properties, selection is guided by the structure and meaning of the data. For example, curriculum-learning strategies may favor models that can adapt to ordered training sequences.
- Integration with active learning loops: Modeling is intertwined with active learning—the model identifies uncertain or high-impact samples, which are then prioritized for labeling or refinement. This creates a feedback loop between model and data.
- Sensitivity to annotation quality: Models are evaluated not only on predictive accuracy but also on their robustness to label noise, annotation bias, and semantic ambiguity. This requires diagnostic tools that assess model behavior in relation to data quality.
- Alignment with business–data requirements: The model must not only perform well statistically but also produce outputs that are both interpretable to stakeholders and actionable within the specific domain context.
- Transparent documentation of data–model interactions: All modeling decisions—including feature selection, training dynamics, and performance metrics—are documented with reference to the curated dataset. This ensures reproducibility and supports governance.
- Cleaning mislabeled or inconsistent examples;
- Adding representative samples;
- Augmenting with synthetic or contextual data;
- Harmonizing semantics across sources;
- Structuring training sequences (curriculum learning).
- Business achievement assessment: This process corresponds to the traditional CRISP-DM objective of determining whether model outputs effectively support decision-making, align with strategic goals, and generate tangible value in the target application domain. This includes verifying that the model’s behavior is consistent with operational constraints, performance thresholds, regulatory requirements, and organizational priorities.
- Data achievement assessment: This stage involves evaluating whether the iterative data curation actions have been successfully implemented, stabilized, and made sustainable. The central purpose of this assessment is to determine whether—and to what extent—improvements in data quality are demonstrably linked to measurable gains in model robustness, interpretability, reliability, and trustworthiness.
5.3. Axes of Methodological Differentiation from CRISP-DM
- Axis 1: Process Temporality and Versioning
- Axis 2: Optimization Target Inversion
- Axis 3: Model-Guided Data Remediation via Active Learning
5.4. Implications for Industrial Adoption and Links to MLOps and GenAIOps
6. Use Case: Data-Centric Improvement of Text Classification via Confident Learning
7. DCAI in Real-Life Applications
8. Infrastructure for Intensive DCAI: Features and Solutions
9. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Hu, X. Data-centric AI: Perspectives and Challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), Minneapolis, MN, USA, 27–29 April 2023; pp. 945–948. [Google Scholar] [CrossRef]
- Ng, A. Unbiggen AI. IEEE Spectrum, 9 February 2022.
- Miller, R.; Whelan, H.; Chrubasik, M.; Whittaker, D.; Duncan, P.; Gregório, J. A Framework for Current and New Data Quality Dimensions: An Overview. Data 2024, 9, 151. [Google Scholar] [CrossRef]
- Pipino, L.L.; Lee, Y.W.; Wang, R.Y. Data quality assessment. Commun. ACM 2002, 45, 211–218. [Google Scholar] [CrossRef]
- Northcutt, C.; Jiang, L.; Chuang, I. Confident Learning: Estimating Uncertainty in Dataset Labels. J. Artif. Int. Res. 2021, 70, 1373–1411. [Google Scholar] [CrossRef]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
- Quiñonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; Lawrence, N.D. Dataset Shift in Machine Learning; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar] [CrossRef]
- Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; MIT Press: Cambridge, MA, USA, 2015; Available online: https://dl.acm.org/doi/10.5555/2969442.2969519 (accessed on 25 February 2026).
- Majeed, A.; Hwang, S.O. A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges. Electronics 2024, 13, 2156. [Google Scholar] [CrossRef]
- Luley, P.P.; Deriu, J.M.; Yan, P.; Schatte, G.A.; Stadelmann, T. From Concept to Implementation: The Data-Centric Development Process for AI in Industry. In Proceedings of the 2023 10th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland, 22–23 June 2023; pp. 73–76. [Google Scholar] [CrossRef]
- Xu, X.; Wu, Z.; Qiao, R.; Verma, A.; Shu, Y.; Wang, J.; Niu, X.; He, Z.; Chen, J.; Zhou, Z.; et al. Position Paper: Data-Centric AI in the Age of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11895–11913. [Google Scholar] [CrossRef]
- Pasquadibisceglie, V.; Appice, A.; Malerba, D.; Fiameni, G. Leveraging a large language model (LLM) to predict hospital admissions of emergency department patients. Expert Syst. Appl. 2025, 240, 128224. [Google Scholar] [CrossRef]
- Pasquadibisceglie, V.; Recchia, V.; Appice, A.; Malerba, D.; Fiameni, G. GANDALF: A LLM-based approach to map bark beetle outbreaks in semantic stories of Sentinel-2 images. In Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, Catania, Italy, 31 March–4 April 2025; SAC ’25, pp. 1074–1081. [Google Scholar] [CrossRef]
- Casciani, A.; Bernardi, M.L.; Cimitile, M.; Marrella, A. Enhancing next activity prediction in process mining with Retrieval-Augmented Generation. Inf. Syst. 2026, 137, 102642. [Google Scholar] [CrossRef]
- Umer, F.; Adnan, N. Generative artificial intelligence: Synthetic datasets in dentistry. BDJ Open 2024, 10, 13. [Google Scholar] [CrossRef] [PubMed]
- Nieberl, M.; Zeiser, A.; Timinger, H.; Friedrich, B. Enhancing the Performance of Computer Vision Systems in Industry: A Comparative Evaluation Between Data-Centric and Model-Centric Artificial Intelligence. Electronics 2025, 14, 4366. [Google Scholar] [CrossRef]
- Chen, Y.; Yan, Z.; Zhu, Y. A comprehensive survey for generative data augmentation. Neurocomputing 2024, 600, 128167. [Google Scholar] [CrossRef]
- Bhuyan, S.S.; Sateesh, V.; Mukul, N.; Galvankar, A.; Mahmood, A.; Nauman, M.; Rai, A.; Bordoloi, K.; Basu, U.; Samuel, J. Generative artificial intelligence use in healthcare: Opportunities for clinical excellence and administrative efficiency. J. Med. Syst. 2025, 49, 10. [Google Scholar] [CrossRef]
- Desai, A.P.; Ravi, T.; Luqman, M.; Mallya, G.; Kota, N.; Yadav, P. Opportunities and Challenges of Generative-AI in Finance. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 4913–4920. [Google Scholar] [CrossRef]
- Chen, I.Y.; Joshi, S.; Ghassemi, M. Treating health disparities with artificial intelligence. Nat. Med. 2020, 26, 16–17. [Google Scholar] [CrossRef] [PubMed]
- Giuffrè, M.; Shung, D.L. Harnessing the power of synthetic data in healthcare: Innovation, application, and privacy. npj Digit. Med. 2023, 6, 186. [Google Scholar] [CrossRef]
- Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Papernot, N.; Anderson, R.; Gal, Y. AI models collapse when trained on recursively generated data. Nature 2024, 631, 755–759. [Google Scholar] [CrossRef] [PubMed]
- Aragon, C.; Guha, S.; Kogan, M.; Muller, M.; Neff, G. Human-Centered Data Science: An Introduction; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
- Dellermann, D.; Calma, A.; Lipusch, N.; Weber, T.; Weigel, S.; Ebel, P. The future of human-AI collaboration: A taxonomy of design knowledge for hybrid intelligence systems. arXiv 2021, arXiv:2105.03354. [Google Scholar] [CrossRef]
- Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation and Synthesis: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 930–957. [Google Scholar] [CrossRef]
- Huang, T.H.; Cao, C.; Bhargava, V.; Sala, F. The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2024; pp. 62648–62672. [Google Scholar] [CrossRef]
- Mei, Y.; Song, S.; Fang, C.; Yang, H.; Fang, J.; Long, J. Capturing Semantics for Imputation with Pre-trained Language Models. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 61–72. [Google Scholar] [CrossRef]
- Vito, G.; Starace, L.L.L.; Martino, S.; Ferrucci, F.; Palomba, F. Large language models in software engineering: A focus on issue report classification and user acceptance test generation. In Proceedings of the Ital-IA Intelligenza Artificiale-Thematic Workshops co-located with the 4th CINI National Lab AIIS Conference on Artificial Intelligence (Ital-IA 2024), Naples, Italy, 29–30 May 2024; pp. 48–53. [Google Scholar]
- Jaimovitch-López, G.; Ferri, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez-Quintana, M.J. Can language models automate data wrangling? Mach. Learn. 2022, 112, 2053–2082. [Google Scholar] [CrossRef]
- Alviano, M.; Macrì, P.; Reiners, L.A.R. ASP Chef Chats with Large Language Models. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, QC, Canada, 16–22 August 2025; pp. 10989–10993. [Google Scholar] [CrossRef]
- Stonebraker, M.; Ilyas, I.F. Data integration: The current status and the way forward. IEEE Data Eng. Bull. 2018, 41, 3–9. [Google Scholar]
- Gilyazev, R.A.; Turdakov, D.Y. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Program. Comput. Softw. 2018, 44, 476–491. [Google Scholar] [CrossRef]
- Wan, M.; Zha, D.; Liu, N.; Zou, N. In-Processing Modeling Techniques for Machine Learning Fairness: A Survey. ACM Trans. Knowl. Discov. Data 2023, 17, 35. [Google Scholar] [CrossRef]
- Pereira, R.C.; Abreu, P.H.; Rodrigues, P.P.; Figueiredo, M.A. Imputation of data Missing Not at Random: Artificial generation and benchmark analysis. Expert Syst. Appl. 2024, 249, 123654. [Google Scholar] [CrossRef]
- Ciavotta, M.; Cutrona, V.; De Paoli, F.; Nikolov, N.; Palmonari, M.; Roman, D. Supporting Semantic Data Enrichment at Scale. In Technologies and Applications for Big Data Value; Curry, E., Auer, S., Berre, A.J., Metzger, A., Perez, M.S., Zillner, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 19–39. [Google Scholar] [CrossRef]
- Deng, D. DBSCAN Clustering Algorithm Based on Density. In Proceedings of the 2020 7th International Forum on Electrical Engineering and Automation (IFEEA), Hefei, China, 25–27 September 2020; pp. 949–953. [Google Scholar] [CrossRef]
- Riccio, D.; Tortora, G.; Sangiovanni, M. RAZOR: Refining Accuracy by Zeroing Out Redundancies. arXiv 2024, arXiv:2410.14254. [Google Scholar] [CrossRef]
- Li, K.; Wang, F.; Yang, L.; Liu, R. Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. Neurocomputing 2023, 538, 126186. [Google Scholar] [CrossRef]
- García-Gil, D.; Luque-Sánchez, F.; Luengo, J.; García, S.; Herrera, F. From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification. Int. J. Intell. Syst. 2019, 34, 3260–3274. [Google Scholar] [CrossRef]
- Aversano, L.; Bernardi, M.L.; Cimitile, M.; Iammarino, M.; Verdone, C. A data-aware explainable deep learning approach for next activity prediction. Eng. Appl. Artif. Intell. 2023, 126, 106758. [Google Scholar] [CrossRef]
- Pasquadibisceglie, V.; Donadello, I.; Appice, A.; Lanz, O.; Maggi, F.M.; Fiameni, G.; Malerba, D. Multimodal predictive process monitoring and its application to explainable clinical pathways. Inf. Syst. 2026, 139, 102698. [Google Scholar] [CrossRef]
- Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A Comprehensive Survey on Data Augmentation. IEEE Trans. Knowl. Data Eng. 2025, 38, 47–66. [Google Scholar] [CrossRef]
- Pasquadibisceglie, V.; Appice, A.; Castellano, G.; Malerba, D. JARVIS: Joining Adversarial Training With Vision Transformers in Next-Activity Prediction. IEEE Trans. Serv. Comput. 2024, 17, 1593–1606. [Google Scholar] [CrossRef]
- Chung, Y.; Kraska, T.; Polyzotis, N.; Tae, K.H.; Whang, S.E. Slice Finder: Automated Data Slicing for Model Validation. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1550–1553. [Google Scholar] [CrossRef]
- Liu, J.; Shen, Z.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; Cui, P. Towards Out-Of-Distribution Generalization: A Survey. arXiv 2023, arXiv:2108.13624. [Google Scholar] [CrossRef]
- Pauwels, S.; Calders, T. Incremental Predictive Process Monitoring: The Next Activity Case. In Proceedings of the Business Process Management: 19th International Conference, BPM 2021, Rome, Italy, 6–10 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 123–140. [Google Scholar] [CrossRef]
- Pasquadibisceglie, V. Handling concept drifts with traditional process discovery algorithms. J. Intell. Inf. Syst. 2025, 64, 179–213. [Google Scholar] [CrossRef]
- Kumar, D.; Addula, S.R.; Lind, M.; Brown, S.; Odion, S. AI-Driven Hybrid Deep Learning and Swarm Intelligence for Predictive Maintenance of Smart Manufacturing Robots in Industry 4.0. Electronics 2026, 15, 715. [Google Scholar] [CrossRef]
- Burch, M.; Weiskopf, D. On the Benefits and Drawbacks of Radial Diagrams. In Handbook of Human Centric Visualization; Huang, W., Ed.; Springer: New York, NY, USA, 2014; pp. 429–451. [Google Scholar] [CrossRef]
- Van Aken, D.; Pavlo, A.; Gordon, G.J.; Zhang, B. Automatic Database Management System Tuning Through Large-scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; SIGMOD ’17, pp. 1009–1024. [Google Scholar] [CrossRef]
- Carlini, N.; Erlingsson, Ú.; Papernot, N. Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications. arXiv 2019, arXiv:1910.13427. [Google Scholar] [CrossRef]
- Stonebraker, M.; Rezig, E.K. Machine Learning and Big Data: What is Important? IEEE Data Eng. Bull. 2019, 42, 3–7. [Google Scholar]
- Lakshminarayan, K.; Harp, S.; Goldman, R.; Samad, T. Imputation of missing data using machine learning techniques. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Simoudis, E., Han, J., Fayyad, U., Eds.; AAAI Press: Washington, DC, USA, 1996; pp. 140–145. [Google Scholar]
- Jiang, Z.; Zhou, K.; Liu, Z.; Li, L.; Chen, R.; Choi, S.H.; Hu, X. An information fusion approach to learning with instance-dependent label noise. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Čreslovnik, D.; Košmerlj, A.; Ciavotta, M. Using historical and weather data for marketing and category management in ecommerce: The experience of EW-shopp. In Proceedings of the 12th European Conference on Software Architecture: Companion Proceedings, ECSA ’18, Madrid, Spain, 24–28 September 2018. [Google Scholar] [CrossRef]
- Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2017, 50, 94. [Google Scholar] [CrossRef]
- Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
- Liu, Y.P.; Xu, N.; Zhang, Y.; Geng, X. Label Distribution for Learning with Noisy Labels. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, Yokohama, Japan, 7–15 January 2021; Bessiere, C., Ed.; ACM: New York, NY, USA, 2021; pp. 2568–2574. [Google Scholar] [CrossRef]
- Andresini, G.; Appice, A.; Ienco, D.; Recchia, V. DIAMANTE: A data-centric semantic segmentation approach to map tree dieback induced by bark beetle infestations via satellite images. J. Intell. Inf. Syst. 2024, 62, 1531–1558. [Google Scholar] [CrossRef]
- Recchia, V.; Andresini, G.; Appice, A.; Fontana, G.; Malerba, D. An Attention-Based CNN Approach to Detect Forest Tree Dieback Caused by Insect Outbreak in Sentinel-2 Images. In Proceedings of the Discovery Science; Pedreschi, D., Monreale, A., Guidotti, R., Pellungrini, R., Naretto, F., Eds.; Springer: Cham, Switzerland, 2025; pp. 183–199. [Google Scholar] [CrossRef]
- Putrama, I.M.; Martinek, P. Heterogeneous data integration: Challenges and opportunities. Data Brief 2024, 56, 110853. [Google Scholar] [CrossRef] [PubMed]
- Malerba, D.; Pasquadibisceglie, V. Data-Centric AI. J. Intell. Inf. Syst. 2024, 62, 1493–1502. [Google Scholar] [CrossRef]
- Oved, A.; Shlomov, S.; Zeltyn, S.; Mashkif, N.; Yaeli, A. Snap: Semantic stories for next activity prediction. Proc. AAAI Conf. Artif. Intell. 2025, 39, 28871–28877. [Google Scholar] [CrossRef]
- Antoniadi, A.M.; Du, Y.; Guendouz, Y.; Wei, L.; Mazo, C.; Becker, B.A.; Mooney, C. Current challenges and future opportunities for XAI in machine learning-based clinical decision support systems: A systematic review. Appl. Sci. 2021, 11, 5088. [Google Scholar] [CrossRef]
- Hogan, A.; Blomqvist, E.; Cochez, M.; D’amato, C.; Melo, G.D.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S.; et al. Knowledge Graphs. ACM Comput. Surv. 2021, 54, 71. [Google Scholar] [CrossRef]
- Cimiano, P.; Paulheim, H. Knowledge graph refinement: A survey of approaches and evaluation methods. Semant. Web 2017, 8, 489–508. [Google Scholar] [CrossRef]
- Xue, B.; Zou, L. Knowledge Graph Quality Management: A Comprehensive Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 4969–4988. [Google Scholar] [CrossRef]
- Peng, C.; Xia, F.; Naseriparsa, M.; Osborne, F. Knowledge graphs: Opportunities and challenges. Artif. Intell. Rev. 2023, 56, 13071–13102. [Google Scholar] [CrossRef]
- Masmoudi, M.; Ben Abdallah Ben Lamine, S.; Karray, M.H.; Archimede, B.; Baazaoui Zghal, H. Semantic Data Integration and Querying: A Survey and Challenges. ACM Comput. Surv. 2024, 56, 209. [Google Scholar] [CrossRef]
- Futia, G.; Vetrò, A. On the integration of knowledge graphs into deep learning models for a more comprehensible AI—Three challenges for future research. Information 2020, 11, 122. [Google Scholar] [CrossRef]
- Zhang, J.; Chen, B.; Zhang, L.; Ke, X.; Ding, H. Neural, symbolic and neural-symbolic reasoning on knowledge graphs. AI Open 2021, 2, 14–35. [Google Scholar] [CrossRef]
- Angles, R.; Gutierrez, C. Survey of graph database models. ACM Comput. Surv. 2008, 40, 1. [Google Scholar] [CrossRef]
- Angles, R.; Arenas, M.; Barceló, P.; Hogan, A.; Reutter, J.; Vrgoč, D. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 2017, 50, 68. [Google Scholar] [CrossRef]
- Baader, F.; Horrocks, I.; Lutz, C.; Sattler, U. An Introduction to Description Logic, 1st ed.; Cambridge University Press: Cambridge, MA, USA, 2017. [Google Scholar] [CrossRef]
- Krötzsch, M. Ontologies for Knowledge Graphs? In Proceedings of the 30th International Workshop on Description Logics, Montpellier, France, 18–21 July 2017; Artale, A., Glimm, B., Kontchakov, R., Eds.; CEUR Workshop Proceedings; RWTH Aachen University: Aachen, Germany, 2017; Volume 1879. [Google Scholar]
- Lenzerini, M.; Lepore, L.; Poggi, A. Metamodeling and metaquerying in OWL2QL. Artif. Intell. 2021, 292, 103432. [Google Scholar] [CrossRef]
- Brickley, D.; Guha, R.V. RDF Schema 1.1. W3c Recommendation, World Wide Web Consortium (W3C): 2014. Available online: https://www.bibsonomy.org/bibtex/9ff17d493abee300f9fff3bff7d2a339 (accessed on 24 February 2026).
- ter Horst, H.J. Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. Web Semant. 2005, 3, 79–115. [Google Scholar] [CrossRef]
- Franconi, E.; Gutierrez, C.; Mosca, A.; Pirrò, G.; Rosati, R. The Logic of Extensional RDFS. In Proceedings of the 12th International Semantic Web Conference–Part I; ISWC ’13; Springer: Berlin/Heidelberg, Germany, 2013; pp. 101–116. [Google Scholar] [CrossRef]
- de Bruijn, J.; Heymans, S. Logical Foundations of (e)RDF(S): Complexity and Reasoning. In The Semantic Web; Aberer, K., Choi, K.S., Noy, N., Allemang, D., Lee, K.I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 86–99. [Google Scholar] [CrossRef]
- Delfino, R.M.; Lenzerini, M.; Poggi, A. RDFS Knowledge Graphs Through the Lens of Logic: Semantics and Query Answering. In Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025); Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2025; Volume 413, pp. 1511–1518. [Google Scholar] [CrossRef]
- Tharwat, A.; Schenck, W. A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics 2023, 11, 820. [Google Scholar] [CrossRef]
- Moles, L.; Andres, A.; Echegaray, G.; Boto, F. Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics 2024, 12, 1898. [Google Scholar] [CrossRef]
- Yang, J.; Wang, H.; Wu, S.; Chen, G.; Zhao, J. Towards Controlled Data Augmentations for Active Learning. In Proceedings of the 40th International Conference on Machine Learning; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; JMLR.org: Norfolk, MA, USA, 2023; Volume 202, pp. 39524–39542. [Google Scholar]
- Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
- Apóstolo, D.; Santos, M.S.; Lorena, A.C.; Henriques Abreu, P. Pycol: A Python package for dataset complexity measures. Neurocomputing 2025, 640, 130311. [Google Scholar] [CrossRef]
- Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C.; Wirth, R. CRISP-DM 1.0: Step-by-Step Data Mining Guide; SPSS Inc.: Chicago, IL, USA, 2000; Volume 9, pp. 1–73. [Google Scholar]
- Wirth, R.; Hipp, J. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; Volume 1, pp. 29–39. [Google Scholar]
- Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández-Orallo, J.; Kull, M.; Lachiche, N.; Ramírez-Quintana, M.J.; Flach, P. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Trans. Knowl. Data Eng. 2021, 33, 3048–3061. [Google Scholar] [CrossRef]
- Barrak, A.; Eghan, E.E.; Adams, B. On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 9–12 March 2021; pp. 422–433. [Google Scholar] [CrossRef]
- Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid training data creation with weak supervision. VLDB J. 2020, 29, 709–730. [Google Scholar] [CrossRef]
- Testi, M.; Ballabio, M.; Frontoni, E.; Iannello, G.; Moccia, S.; Soda, P.; Vessio, G. MLOps: A Taxonomy and a Methodology. IEEE Access 2022, 10, 63606–63618. [Google Scholar] [CrossRef]
- Nadǎş, M.; Dioşan, L.; Tomescu, A. Synthetic Data Generation Using Large Language Models: Advances in Text and Code. IEEE Access 2025, 13, 134615–134633. [Google Scholar] [CrossRef]
- Azeta, J.; Omeche, T.T.; Daniyan, I.; Abiola, J.O.; Daniyan, L.; Phuluwa, H.S.; Muvunzi, R. Artificial intelligence and robotics in predictive maintenance: A comprehensive review. Front. Mech. Eng. 2026, 11, 2025. [Google Scholar] [CrossRef]
- Grandi, C.; Bettoni, D.; Boccali, T.; Carlino, G.; Cesini, D.; dell’Agnello, L.; Donvito, G.; Salomoni, D.; Zoccoli, A. ICSC: The Italian National Research Centre on HPC, Big Data and Quantum computing. EPJ Web Conf. 2024, 295, 10003. [Google Scholar] [CrossRef]
- Retico, A.; Avanzo, M.; Boccali, T.; Bonacorsi, D.; Botta, F.; Cuttone, G.; Martelli, B.; Salomoni, D.; Spiga, D.; Triann, A.; et al. Enhancing the impact of Artificial Intelligence in Medicine: A joint AIFM-INFN Italian initiative for a dedicated cloud-based computing infrastructure. Phys. Medica 2021, 91, 140–150. [Google Scholar] [CrossRef]
- Salomoni, D.; Campos, I.; Gaido, L.; de Lucas, J.M.; Solagna, P.; Gomes, J.; Matyska, L.; Fuhrman, P.; Hardt, M.; Donvito, G.; et al. INDIGO-DataCloud: A Platform to Facilitate Seamless Access to E-Infrastructures. J. Grid Comput. 2018, 16, 381–408. [Google Scholar] [CrossRef]
- Ceccanti, A.; Hardt, M.; Wegh, B.; Millar, A.P.; Caberletti, M.; Vianello, E.; Licehammer, S. The INDIGO-Datacloud Authentication and Authorization Infrastructure. J. Phys. Conf. Ser. 2017, 898, 102016. [Google Scholar] [CrossRef]
- Gargliardi, F.; Jones, B.; Reale, M.; Burke, S. European DataGrid Project: Experiences of Deploying a Large Scale Testbed for E-science Applications. Lect. Notes Comput. Sci. 2002, 2459, 480–499. [Google Scholar] [CrossRef]
- Antonacci, M.; Costantini, A.; Donvito, G.; Giommi, L.; Grandi, C.; Martelli, B.; Spiga, D.; Serra, E.; Savarese, G.; Vianello, E. The evolution of INFN’s Cloud Platform: Improvements in Orchestration and User Experience. EPJ Web Conf. 2025, 337, 01113. [Google Scholar] [CrossRef]
- Savarese, G.; Giommi, L.; Calanducci, A.; Casale, A.; Costantini, A.; Donvito, G.; Fanzago, F.; Gasparetto, J.; Grandi, C.; Martelli, B.; et al. Plan for a renewed PaaS Orchestration solution in the DataCloud Project at INFN. In Proceedings of the International Symposium on Grids and Clouds (ISGC2025); Academia Sinica Grid Computing Centre (ASGC): Taipei, Taiwan, 2025. [Google Scholar] [CrossRef]
- Barisits, M.; Beermann, T.; Berghaus, F.; Bockelman, B.; Bogado, J.; Cameron, D.; Christidis, D.; Ciangottini, D.; Dimitrov, G.; Elsing, M.; et al. Rucio: Scientific Data Management. Comput. Softw. Big Sci. 2019, 3, 11. [Google Scholar] [CrossRef]
- Kiryanov, A.; Álvarez Ayllón, A.; Salichos, M.; Keeble, O. FTS3-A File Transfer Service for Grids, HPCs and Clouds. In International Symposium on Grids and Clouds (ISGC) 2015; Academia Sinica: Taipei, Taiwan, 2015. [Google Scholar] [CrossRef]
- Anderlini, L.; Barbetti, M.; Bianchini, G.; Ciangottini, D.; Pra, S.D.; Michelotto, D.; Spiga, D. Developing Artificial Intelligence in the Cloud: The AI Infn Platform. Comput. Sci. 2025, 26, 20. [Google Scholar] [CrossRef]
- Camerlingo, M.T. ML-based classification in an open-source framework for the ALICE heavy-flavour analysis. EPJ Web Conf. 2025, 337, 01049. [Google Scholar] [CrossRef]
- Rossi, F.; Battaglieri, M.; Gavalian, G.; Ragusa, E.; Gastaldo, P. Real Time implementation of Artificial Intelligence compression algorithm for High-Speed Streaming Readout signals. EPJ Web Conf. 2025, 337, 01135. [Google Scholar] [CrossRef]
- Ciangottini, D.; Spiga, D.; Memon, A.S.; Manzi, A.; Filipcic, A.; Troja, A.; Fanzago, F.; Bianchini, G.; Sgaravatto, M.; Prica, T.; et al. Unlocking the compute continuum: Scaling out from cloud to HPC and HTC resources. EPJ Web Conf. 2025, 337, 01296. [Google Scholar] [CrossRef]



| Phase | Main Tasks |
|---|---|
| Business Understanding | - Determine Business Objectives—background, business success criteria |
| - Assess Situation—inventory of resources, requirements, assumptions, constraints, risks, and contingencies | |
| - Determine Data Mining Goals—goals, success criteria | |
| - Produce Project Plan—project plan, initial assessment of tools and techniques | |
| Data Comprehension | - Collect Initial Data—initial data collection report |
| - Describe Data—data description report | |
| - Explore Data—data exploration report | |
| - Verify Data Quality—data quality report | |
| Data Preparation | - Data-Set Description |
| - Select Data—rationale for inclusion/exclusion | |
| - Clean Data—data-cleaning report | |
| - Construct Data—derived attributes, generated records | |
| - Integrate Data—merged data | |
| - Format Data—reformatted data | |
| Modeling | - Select Modeling Technique—technique, assumptions |
| - Generate Test Design | |
| - Build Model Parameter Settings—models, model description | |
| - Assess Model—assessment, revised parameters | |
| Evaluation | - Evaluate Results—alignment with business success criteria |
| - Approve Models—review of process | |
| - Determine Next Steps—possible actions, decisions | |
| Deployment | - Plan Deployment |
| - Plan Monitoring and Maintenance | |
| - Produce Final Report—report, presentation | |
| - Review Project—experience documentation |
| Original CRISP-DM Task | Extended DCAI Task |
|---|---|
| Select Data | Move beyond simple selection to prioritize representative examples and order training data strategically, improving learning efficiency (e.g., via curriculum learning). |
| Clean Data | Extend cleaning to include systematic outlier detection and refinement of labels to ensure ongoing semantic reliability. |
| Construct Data | Extend feature construction to introduce synthetic examples and external contextual features to enhance generalization and expressiveness. |
| Integrate Data | Extend beyond technical merging to include semantic harmonization. This ensures that when heterogeneous sources are combined, the resulting units of analysis are not only technically coherent but also conceptually valid. |
| Format Data | Move beyond reshaping datasets for modeling, ensuring that transformations are well documented, transparent, traceable, and aligned with ethical and regulatory requirements. |
| CRISP-DM Phase | Revised DCAI Phase | Main Conceptual Difference |
|---|---|---|
| Business Understanding | Understanding Business and Data Requirements | Business goals are inherently data-dependent. |
| Data Understanding | Data Collection and Labeling | The focus shifts from exploratory data inspection to the active construction of the dataset (data acquisition and labeling governance). |
| Data Preparation | Data Curation | Data preparation is reinterpreted as a continuous, dataset-versioned, and auditable curation process (e.g., bias correction, label refinement, and enrichment) instead of a one-off pre-modeling activity. |
| Modeling | Model Training | Training is performed on stabilized, versioned datasets to enable causal attribution of performance changes to data interventions rather than to algorithmic exploration. The focus shifts from model search to controlled learning on curated data. |
| Evaluation | Evaluation (Business and Data Achievements) | Evaluation no longer assesses only business performance. Instead, continuous quality reporting is performed to explicitly verify whether data-centric interventions causally improve robustness, interpretability, reliability, and trustworthiness, alongside business KPIs. |
| Deployment | Deployment and Continuous Data-Centric Operations | Deployment is extended to include governance logs for automated data quality monitoring, drift detection, post-deployment data curation, and feedback loops between operational data and dataset evolution. The system remains “data-alive” after release. |
| Three DCAI Stages | Corresponding Revised DCAI Phases (from Table 3) |
|---|---|
| Training Data Development | Understanding Business and Data Requirements → Data Collection and Labeling → Data Curation |
| Inference Data Development | Model Training → Evaluation (Business and Data Achievements) |
| Data Maintenance | Deployment and Continuous Data-Centric Operations |
| Action | Tools | Comment | Reference/URL |
|---|---|---|---|
| Authentication /Authorization | INDIGO-IAM | Industry-graded authorization and authentication mechanism, including the definition of roles and groups. | [97,98] |
| Cloud Orchestration | INDIGO-PaaS | Coordination of the provisioning of virtualized computation resources on both private and public cloud management frameworks. | [99,100,101] |
| Storage Services | storm, S3, Rados, Pandora, Sync&Share | Distributed and redundant, at the PB level with optional ISO 27001 certification. Complete integration with authentication and authorization. | storm url (https://italiangrid.github.io/storm/, accessed on 25 February 2026), Amazon S3 url (https://aws.amazon.com/s3, accessed on 25 February 2026), Rados Documentation (https://docs.ceph.com/en/reef/man/8/rados/, accessed on 25 February 2026), INFN pandora url (https://pandora.infn.it/, accessed on 25 February 2026) |
| Data Management and Transfer | Rucio, FTS | Data Management and Transfer tools scaling at the Exabyte level. | [102,103] |
| Remote Streaming | Xrootd, WebDAV, Kafka | High speed remote data access and completely integrated with the authorization and authentication service. | xrootd url (http://www.xrootd.org/, accessed on 25 February 2026, WebDAV url (http://www.webdav.org/, accessed on 25 February 2026), Kafka url (http://kafka.apache.org/, accessed on 25 February 2026) |
| Machine Learning processing environment (suited for DCAI R&D) | AI_INFN | Multi-user environment providing access to specialized hardware (GPU and NVMe) in a scalable and transparent manner. It supports both interactive and distributed processing. | AI_INFN Documentation (https://ai-infn.baltig-pages.infn.it/wp-1/docs/, accessed on 25 February 2026) [104] |
| Multi-purpose analysis processing environments | JupyterHub PaaS | Scalable, transparent, and multi-user environment providing access to resources within Data-Cloud, based on Virtual Machines and Docker. It also supports both interactive and distributed processing. | JupyterHub PaaS Documentation (https://guides.cloud.infn.it/users_guides/sysadmin/compute/jh_with_persistence/, accessed on 25 February 2026) |
| (suited for DCAI R&D) | JupterHub SaaS | Local service provided by ReCaS (one of the INFN Cloud site), based on Kubernetes. | INFN ReCaS SaaS url (https://www.recas-bari.it/index.php/it/recas-bari-i-servizi-it/recas-bari-i-servizi/cloud-recas-software-as-a-service, accessed on 25 February 2026) [105] Kubernetes url (https://kubernetes.io/, accessed on 25 February 2026) |
| Anonymization Service | DICOM ToolKit libraries, and DicomAnonymizer Python | Requirement for sensitive data, such as medical dataset. | DICOM ToolKit url (https://www.dcmtk.org/, accessed on 25 February 2026) |
| DCAI | package Heterogeneous resources | Different hardware implementations allowing one to find the best platform with which to achieve real-time performance for specific applications, such as the AI compression algorithm for High-Speed Streaming Readout signals. | [106] |
| Offloading | interLink, Virtual-kubelet | A “transparent offloading” of containerized payloads using Virtual-Kubelet and interLink extension, creating a common cloud-native interface with which to access any number of external hardware machines and any type of backend. | interLink (https://intertwin-eu.github.io/interLink/, accessed on 25 February 2026), [107] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Malerba, D.; Poggi, A.; Alviano, M.; Boccali, T.; Camerlingo, M.T.; Delfino, R.M.; Diacono, D.; Elia, D.; Pasquadibisceglie, V.; Sangiovanni, M.; et al. Data-Centric AI Manifesto: How Data Quality Drives Modern AI. Electronics 2026, 15, 1913. https://doi.org/10.3390/electronics15091913
Malerba D, Poggi A, Alviano M, Boccali T, Camerlingo MT, Delfino RM, Diacono D, Elia D, Pasquadibisceglie V, Sangiovanni M, et al. Data-Centric AI Manifesto: How Data Quality Drives Modern AI. Electronics. 2026; 15(9):1913. https://doi.org/10.3390/electronics15091913
Chicago/Turabian StyleMalerba, Donato, Antonella Poggi, Mario Alviano, Tommaso Boccali, Maria Teresa Camerlingo, Roberto Maria Delfino, Domenico Diacono, Domenico Elia, Vincenzo Pasquadibisceglie, Mara Sangiovanni, and et al. 2026. "Data-Centric AI Manifesto: How Data Quality Drives Modern AI" Electronics 15, no. 9: 1913. https://doi.org/10.3390/electronics15091913
APA StyleMalerba, D., Poggi, A., Alviano, M., Boccali, T., Camerlingo, M. T., Delfino, R. M., Diacono, D., Elia, D., Pasquadibisceglie, V., Sangiovanni, M., Spinoso, V., & Vino, G. (2026). Data-Centric AI Manifesto: How Data Quality Drives Modern AI. Electronics, 15(9), 1913. https://doi.org/10.3390/electronics15091913

