Data-Centric AI Manifesto: How Data Quality Drives Modern AI

Malerba, Donato; Poggi, Antonella; Alviano, Mario; Boccali, Tommaso; Camerlingo, Maria Teresa; Delfino, Roberto Maria; Diacono, Domenico; Elia, Domenico; Pasquadibisceglie, Vincenzo; Sangiovanni, Mara; Spinoso, Vincenzo; Vino, Gioacchino

doi:10.3390/electronics15091913

Open AccessArticle

Data-Centric AI Manifesto: How Data Quality Drives Modern AI

by

Donato Malerba

^1,*

,

Antonella Poggi

^2,*

,

Mario Alviano

³

,

Tommaso Boccali

⁴

,

Maria Teresa Camerlingo

⁵

,

Roberto Maria Delfino

²

,

Domenico Diacono

⁵

,

Domenico Elia

⁵

,

Vincenzo Pasquadibisceglie

¹

,

Mara Sangiovanni

⁶

,

Vincenzo Spinoso

⁵

and

Gioacchino Vino

⁵

¹

Department of Computer Science, University of Bari Aldo Moro, 70125 Bari, Italy

²

Department of Computer, Control and Management Engineering, Sapienza University of Rome, 00185 Rome, Italy

³

Department of Mathematics and Computer Science, University of Calabria, 70125 Bari, Italy

⁴

INFN Pisa, National Institute for Nuclear Physics, 56127 Pisa, Italy

⁵

INFN Bari, National Institute for Nuclear Physics, 70125 Bari, Italy

⁶

Department of Electrical Engineering and Information Technology, University of Naples Federico II, 80125 Naples, Italy

^*

Authors to whom correspondence should be addressed.

Electronics 2026, 15(9), 1913; https://doi.org/10.3390/electronics15091913

Submission received: 25 February 2026 / Revised: 10 April 2026 / Accepted: 28 April 2026 / Published: 1 May 2026

(This article belongs to the Special Issue Data-Centric Artificial Intelligence: New Methods for Data Processing, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Artificial Intelligence (AI) has traditionally been developed according to a model-centric paradigm, in which progress is driven by increasingly sophisticated learning architectures applied to largely fixed datasets. However, this paradigm exhibits well-known limitations, including sensitivity to label noise, distribution shifts, adversarial perturbations, and limited transparency and reproducibility. These issues indicate that many of the current bottlenecks of AI systems arise from deficiencies in data rather than from model design. In this paper, we adopt and formalize the Data-Centric Artificial Intelligence (DCAI) paradigm, which places data quality, semantic consistency, and representativeness at the core of the AI lifecycle. From this perspective, performance, robustness, interpretability, and regulatory compliance are primarily achieved through systematic data engineering, including data curation, enrichment, validation, and continuous monitoring, rather than through repeated model re-engineering. The contributions of this work are threefold. First, a conceptual framework is provided to clarify the epistemic and methodological foundations of DCAI and distinguish it from traditional model-centric approaches. Second, a data-centric lifecycle is presented, covering training data development, inference data design, and data maintenance and integrating techniques such as semantic data representation, active learning, synthetic data generation, and drift-aware quality control. Third, the role of DCAI in the context of Generative AI is analyzed, showing how data-centric practices are essential to ensure robustness, accountability, and responsible deployment of large-scale generative models. Overall, this work positions DCAI as a coherent methodological and technological framework for the development of trustworthy, resilient, and sustainable AI systems, making a research contribution and providing a reference model for industrial and regulatory contexts.

Keywords:

data-centric AI; methodology; synthetic data generation; data maintenance

1. Introduction

Over the past decade, the surge of Artificial Intelligence (AI), particularly in Computer Vision and Natural Language Processing fields, has been repeatedly assessed in the literature. Several AI methods based on Deep Learning (DL) have recently contributed to introducing advanced decision capabilities in multiple application domains, including healthcare, biology, and finance.

This progress has been largely driven by a Model-Centric AI paradigm, in which increasingly sophisticated learning algorithms are applied to largely fixed datasets. However, the dominance of this paradigm has exposed three interrelated classes of failure. First, technical failures such as adversarial vulnerability, distribution shift, and label noise reveal that model sophistication alone cannot compensate for data deficiencies. Second, operational and governance requirements imposed by emerging regulatory frameworks (e.g., the EU AI Act) mandate the perception of data quality, traceability, and accountability as design-time constraints, not afterthoughts. Third, risks specific to Generative AI, including hallucination, bias amplification, and provenance gaps, are qualitatively distinct from classical brittleness and stem directly from pre-training corpus composition and curation failures. Collectively, these drivers establish data as the primary bottleneck in the AI development chain and motivate the Data-Centric AI (DCAI) paradigm shift: improving data quality, robustness, and governance to achieve accuracy, resilience, and transferability levels unattainable with model-centric approaches alone.

Based on these premises, in this paper, we adopt and operationalize the DCAI paradigm as a coherent methodological and technological response to the limitations of model-centric approaches. This study was designed to be a perspective and conceptual contribution: rather than reporting results from a single experimental pipeline, it synthesizes converging evidence from the literature, formalizes the foundations of DCAI, and proposes a structured methodological framework for its adoption. While sharing the three-stage lifecycle taxonomy introduced in [1]—training data development, inference data development, and data maintenance—this work makes three distinct contributions. First, it provides a formal mathematical characterization of the DCAI paradigm as a bi-level optimization problem, enabling causal attribution of performance gains to data-centric interventions. Second, it proposes a concrete methodological revision of CRISP-DM that operationalizes DCAI principles into a structured, industrially adoptable process. Third, it presents a case study on the infrastructure requirements for intensive data-centric workloads, illustrating how a federated cloud architecture supports the data management and curation activities central to a DCAI pipeline, offering a concrete scenario to ground the discussion in practice.

1.1. Context: From Model-Centric AI to Data-Centric AI

The European Data Act (https://digital-strategy.ec.europa.eu/en/policies/data-act, accessed on 25 February 2026), which came into effect in January 2024, was designed to position the European Union (EU) as a leader in a data-driven society. In particular, it formulates some guidelines with which to establish a unified, free market for EU data and improve the accessibility of data for its use for AI. Data has always been central to AI, but conventional model-centric AI handles data as static resources, focusing on optimizing the hyperparameters of decision models using specific inference datasets, learning about decision models for specific tasks, and often overlooking issues related to the real quality of the inference data considered. This makes it challenging to transfer decision models across tasks or datasets that are also within the same problem domain. In contrast, the emerging DCAI paradigm [2] is focused on systematically and algorithmically engineering optimal data with which to feed AI systems. Specifically, the DCAI paradigm is mainly guided by the principle that the effect of data on AI performance is more significant than the decision model’s performance. Accordingly, DCAI systematically employs automated engineering principles and techniques to iteratively enhance data quality and, hence, AI performance thanks to the extraction of smart data from raw data.

This conceptual shift is illustrated in Figure 1, which contrasts the model-centric pipeline—where feedback loops prioritize model refinement—with the data-centric pipeline, in which feedback from users is directed toward improving data quality. The figure encapsulates the epistemic transition from treating data as passive input to recognizing it as a dynamic asset central to AI optimization.

1.2. Data Quality Indicators

Data quality has been widely discussed in the recent literature [3]. For instance, ref. [4] introduces subjective and objective assessments of data quality and presents three functional approaches for developing objective data quality indicators. Specifically, we report the quality indicators introduced in [4] for the DCAI properties of data quality, semantic consistency, and representativeness. Below, we illustrate the rationale for each mapping between DCAI properties and indicators and describe how each indicator influences robustness, fairness, and interpretability outcomes.

Data Quality is defined as the degree to which a dataset is fit for its intended use in a DCAI pipeline, assessed with respect to two primary indicators: the free-of-error and relevancy. The free-of-error indicator captures data correctness—formally measured as the ratio of error-free data units to total data units—and directly conditions model robustness because training on incorrect records propagates systematic prediction errors. The relevancy indicator ensures that only data pertinent to the target task is retained, preventing noise-induced overfitting and improving both the stability and generalizability of learned representations.
Semantic Consistency is defined as the extent to which a dataset conveys meaning in a coherent and complete manner across its schema and instances. It is assessed through understandability and completeness. The understandability indicator reflects how clearly the data’s structure and semantics can be interpreted by human stakeholders and automated pipelines alike, directly increasing the interpretability of model outputs. Completeness—operationalized across schema, columns, and population levels via ratio-based metrics [4]—ensures that no systematic gaps exist across demographic or feature subgroups, thereby conditioning fairness outcomes: incomplete coverage of a protected group introduces representational bias into model training.
Representativeness is defined as the degree to which a dataset faithfully reflects the target population and is accessible for analytical use, assessed through concise representation, consistent representation, and ease of manipulation. The concise representation indicator ensures that no redundant or irrelevant attributes inflate the feature space, improving model interpretability. The consistent representation indicator—measured as the ratio of constraint violations (e.g., referential integrity breaches) to total consistency checks—prevents conflicting encodings of the same real-world entity, which would otherwise pose a direct threat to robustness. The ease-of-manipulation indicator reflects how easily stakeholders can process and transform the data for downstream uses. It is operationally linked to fairness, as datasets that are difficult to audit or rebalance resist bias correction interventions.

Together, these mappings provide examples of measurable and reproducible criteria for each property based on a well-established metric framework. They also clarify how each indicator influences the robustness, fairness, and interpretability of DCAI systems.

These quality indicators serve as operational criteria throughout the DCAI lifecycle described in Section 3. In particular, the free-of-error and completeness indicators inform the data curation and labeling phases (Section 3.1), while consistency and ease of manipulation are directly relevant to the monitoring and remediation activities of data maintenance (Section 3.3). The same indicators reappear in the revised CRISP-DM methodology (Section 5.2) as measurable targets for the data achievement assessment introduced in the evaluation phase.

1.3. DCAI Formalization

Before discussing why the AI field has shifted from a model-centric to a data-centric paradigm, we formalize both paradigms. In particular, we make what is usually implicit in machine learning pipelines explicit—namely, what is being optimized (model vs. data), and under which assumptions. This formalization is instrumental in enabling causal attribution of performance improvements. Without loss of generality, we focus on predictive tasks.

Let each example in a dataset be a pair

(x, y)

, where:

$x \in X$ is the feature vector (input/covariates);
$y \in Y$ is the label (output/target variable).

A dataset D is a finite sample from a joint probability distribution

p (x, y)

that captures both the data generation process and the labeling mechanism:

D = {(x_{i}, y_{i})}_{i = 1}^{n} \sim p (x, y) .

Three main ingredients characterize model learning from D:

The model parameters, such as connection weights, bias terms, or thresholds, which we denote as $θ \in Θ$ ;
The model hyperparameters, such as the number of hidden layers, the number of neurons per layer, or the activation functions, which we denote as $λ \in Λ$ ;
The loss function evaluated on the dataset D under parameters $θ$ and hyperparameters $λ$ , which we denote as $L (θ, λ; D)$ .

Therefore, a model is fully defined by the couple

(θ, λ)

. Model selection corresponds to fixing

λ

, which encodes design choices made by practitioners (e.g., a neural architecture). Once the model is selected, model fitting or model training is possible, which corresponds to finding the best

θ

that minimizes the loss function

L

for a dataset D.

Typically, the dataset D is partitioned into three non-overlapping subsets:

$D_{train}$ , which is used to fit model parameters $θ$ during the model training;
$D_{val}$ , which is used to validate model training in an outer optimization loop;
$D_{test}$ , which is held out for final, unbiased evaluation.

The main difference between the model-centric and data-centric paradigms is in the outer optimization loop, as clarified below.

Model-Centric AI

In the model-centric paradigm, performance improvements are pursued by modifying the model

(θ, λ)

, while D is treated as static. The objective takes the form of a bi-level optimization problem:

λ^{*} = \underset{λ \in Λ}{arg min} L (θ^{*}, λ; D_{val})

(1)

s . t . θ^{*} = \underset{θ \in Θ}{arg min} L (θ, λ; D_{train}) .

(2)

In the outer loop (1), a model is selected by searching the hyperparameter space

Λ

, while the inner loop (2) involves searching for the best model parameter

θ^{*}

on the bases of both

λ

and the training set

D_{train}

. The dataset D is entirely passive: it is treated as a fixed constant at both levels. This is the static dataset assumption that characterizes the model-centric approach. Therefore, performance improvements are entirely attributable to changes in

(θ, λ)

, and any confounding effect of data variation is structurally excluded by construction.

Data-Centric AI

In the data-centric paradigm, the perspective is inverted, since the optimization focus shifts from the model to the data. The selected model is static (static model assumption); that is, the hyperparameters

λ

are fixed a priori–based on domain knowledge or prior experience–and improvements are sought by modifying the training dataset. The object of optimization is no longer

λ

but rather the training dataset itself: we search for an improved dataset

D^{'}

that, when used to train the selected model, yields better generalization with respect to

D_{val}

.

Formally, let

Q (D_{train})

be the set of datasets of higher quality than

D_{train}

with respect to a quality indicator

q : D \to R

:

Q (D_{train}) = \{D^{'} \in D | q (D^{'}) > q (D_{train})\} .

The function q depends on the basic data quality indicators, such as those introduced in Section 1.2. A concretely computable scalar function q is a weighted linear combination of the following

$e (D)$ : The free-of-error ratio;
$r (D)$ : Relevancy;
$c (D)$ : Completeness;
$ρ (D)$ : The consistency ratio.

Here, the weights

w = (w_{e}, w_{r}, w_{c}, w_{ρ}) \in R_{\geq 0}^{4}

, with

\sum_{i} w_{i} = 1

, are domain-specific hyper-parameters of the data-engineering process selected prior to any dataset intervention and held fixed across all iterations. This normalization ensures that

q (D) \in [0, 1]

, making the quality score directly interpretable and comparable across dataset versions. Therefore,

Q (D_{train})

does not represent arbitrary datasets; it only represents those obtainable through admissible data-centric operations that improve q with respect to

D_{val}

, such as cleaning and relabeling, which improve

e (D)

; augmentation, which improves

c (D)

; and dataset enriching, which improves

r (D)

(see Section 4).

The data-centric objective is then

D_{train}^{*} = \underset{D^{'} \in Q (D_{train})}{arg min} L (θ^{*}, λ; D_{val})

(3)

s . t . θ^{*} = \underset{θ \in Θ}{arg min} L (θ, λ; D^{'}) .

(4)

In the outer loop (3), a training set of better quality is selected via searching

Q (D_{train})

, while the inner loop (4) involves searching for the best model parameter

θ^{*}

based on both

λ

and the better-quality dataset

D^{'}

.

Note that

D_{val}

is held constant in both paradigms. However, in the data-centric case, since

λ

,

D_{val}

, and the training procedure are all fixed, any reduction in

L (θ, λ; D_{val})

can be causally attributed to the transition

D_{train} \to D^{'}

and not to any algorithmic variation.

In summary, the two paradigms define complementary bi-level optimization problems over disjoint search spaces:

model - centric AI : Θ \times Λ, D fixed;

data - centric AI : Q (D_{train}) \times Θ, λ fixed .

The two spaces are not structurally equivalent: the shift from

Θ \times Λ

to

Q (D_{train}) \times Θ

is not a trivial notational substitution but rather a qualitative change; it can even be a change in the nature of the optimization problem. This constitutes a structural shift in what is treated as the object of an engineering effort versus what is treated as given. It formalizes DCAI evaluation as a dataset ablation study with a frozen model—structurally analogous to the architectural ablation studies standard in deep learning research, but applied to data rather than model components. It also provides the epistemological foundation for the claim, central to the DCAI paradigm, that data quality, rather than model complexity, is the primary driver of performance gains. The question that naturally arises, then, is why such a shift is necessary in the first place, as discussed in the following section.

1.4. Why a Paradigm Shift Is Necessary

The shift from a model-centric perspective to a data-centric perspective definitely changes how AI pipelines are designed and developed. Specifically, DCAI encompasses new approaches to handling data, shifting from the consideration of fixed constraints within a static environment to the consideration of more controllable and influential elements in AI systems. This shift is necessary because real-world data quality is often compromised by several issues, including the following:

Label inconsistency and noise: Systematic labeling errors are among the most pervasive threats to the reliability and performance of AI predictive systems [5]. If a training dataset contains inconsistent or incorrect labels, then the decision model may learn incorrectly to reproduce the annotation noise rather than the ground-truth, underlying predictive patterns.
Data distribution problems: Beyond labeling issues, real-world datasets often exhibit distributional problems that may limit decision models’ generalization and robustness in deployment. Common challenges include class imbalance and feature distribution skew. In particular, class imbalance—under-representation of minority classes in a training dataset—may lead to biased decision boundaries and poor recall for rare classes [6], while feature distribution skew (a shift in the data distribution from the training data to real-world deployment data) may cause a significant decrease in the accuracy performance of the decision model when used in the deployment environment [7].
Technical debt arising from data issues: Poor data quality may create cascading technical debts throughout the AI lifecycle [8]. When foundational data problems remain unresolved, they multiply across model learning iterations and infrastructure layers. Key manifestations of this issue include training instability and reduced interpretability. Training instability is commonly caused by high levels of label noise, which may hinder convergence and exacerbate model overfitting to spurious correlations, while the presence of inconsistent or biased patterns in data may reduce the reliability of both feature importance analysis and explainability.

1.5. Expected Impacts on Research, Industry, and Society

The birth of the DCAI paradigm can be traced back to 2021, when Andrew Ng publicly launched the concept of “Data-Centric AI” during the Data-Centric AI Workshop (https://www.datacentricai.org/neurips21/, accessed on 25 February 2026) co-located with the NeurIPS 2021 conference. The academic and industrial community responded promptly to the paradigm shift advocating for the AI lifecycle by organizing various events and scientific activities to raise awareness of DCAI in both academia and industry.

For example, in 2024, several major international venues hosted workshops dedicated to DCAI research, showing the growing prominence of this paradigm within the AI community. These events included the “DMLR: Data-Centric Machine Learning Research Workshop” (DMLR 2024) (https://dmlr.ai/, accessed on 25 February 2026), co-located with ICML 2024, and the “4th International Workshop on Data-Centric AI” (DCAI24) (https://data-centric-ai-dev.github.io/CIKM2024/, accessed on 25 February 2026), co-located with CIKM 2024. In the same period, some members of the FAIR Transversal Project TP7—Data-Centric AI and Infrastructure—organized two editions of the “International Workshop & Tutorial on Data-Centric AI” (DEARING 2024 and DEARING 2025) (https://dearing-workshop.github.io, accessed on 25 February 2026), which was co-located with ECML-PKDD in 2024 and 2025. These events contributed to emphasizing the sustained and expanding interest in DCAI paradigm within the machine learning and data-mining research community. In addition, lectures held to illustrate the research into DCAI conducted within the FAIR Transversal Project TP7 were organized in the 2024 (https://essai2024.di.uoa.gr/ESSAI-courses.html, accessed on 25 February 2026) and 2025 (https://essai2025.eu/courses/, accessed on 25 February 2026) editions of the European Summer School on Artificial Intelligence (ESSAI 2024 and ESSAI 2025). A tutorial on research advancements helping to transform raw event data into smart predictive knowledge under the umbrella of the DCAI paradigm was also held in ITADATA 2023 (https://www.itadata.it/2023/, accessed on 25 February 2026).

The surge in DCAI has raised awareness not only in academia but also across industries. In 2025, the Gartner report (https://www.gartner.com/en/articles/data-centric-approach-to-ai, accessed on 25 February 2026) clarified that business objectives can be successfully achieved with AI models by accounting for the four pillars of DCAI [9]: exploratory data analysis and preparation, feature engineering, data labeling and annotation, and data augmentation. Following this recommendation, the authors of [10], designed a concrete and flexible approach to implementing a DCAI development process in the mechanical and manufacturing sectors.

2. Data-Centric AI and Generative AI: Convergences and Synergies

The remarkable generative capabilities of recent Large Language Models (LLMs) are well known in both academia and industry. The main factor contributing to this success is the vast amount of data that has been recently made available, in various forms, scales, and usages, for training the LLMs. Despite the extraordinary enthusiasm behind LLM research, most efforts have been focused on improving LLMs, often neglecting the key role of data in the development and inference stages [11]. Based on these premises, in the following sections, we discuss the role of DCAI in the Generative AI (GenAI) ecosystem and how GenAI models can contribute to the development of the DCAI pipeline. There are several recent practical examples of the use of LLMs in DCAI pipelines in multiple domains, such as healthcare [12], remote sensing [13] and business processes [14].

2.1. GenAI as a Catalyst for DCAI

GenAI techniques are undoubtedly a key enabler in the production of synthetic datasets capable of overcoming the problems that plague traditionally acquired datasets [15]. The transition from traditional data augmentation techniques to generative data augmentation techniques represents a significant advance in synthetic data creation [16]. In particular, synthetic data generation enables the use of GenAI models to create artificial data, either by leveraging models pre-trained on large-scale datasets or by training models from scratch on target datasets [17]. In this regard, numerous studies have recently been published in various application domains, including healthcare [18] and finance [19].

Nevertheless, the use of these GenAI techniques should be integrated within a DCAI pipeline to mitigate the risk of facilitating overfitting to synthetic samples and amplifying biases inherent in the training data. As highlighted in [20], synthetic datasets derived from biased or distorted inputs can reinforce existing disparities, particularly in safety-critical settings such as clinical decision support systems. Instead, in accordance with DCAI principles, rigorous auditing, validation, and bias assessment procedures are imperative to ensure the responsible and trustworthy deployment of GenAI systems [21].

2.2. DCAI as a Foundation for Responsible GenAI

GenAI cannot be considered aligned with the principles of responsible AI if its development and use neglect key DCAI principles, such as upholding data quality, transparency, and governance. One of the main problems identified in the recent literature is recursive-data contamination [22]. Also called model collapse or recursive pollution, this is a critical AI failure mode in which generative models are trained on data produced by earlier generations of themselves. As AI-generated content (synthetic data) increasingly populates the internet and is used to train new models, this feedback loop causes models to drift far away from the true underlying data distribution. This leads to quality degradation, loss of diversity, and amplification of errors and biases.

Based on these premises, systematic adoption of DCAI principles in GenAI may provide an effective opportunity to strengthen the notion of responsible GenAI by integrating principles from human-centered data science. In fact, this discipline challenges the incorrect assumptions that “having enough data is sufficient” and that “the human element can be removed”. Instead, it emphasizes the importance of considering that “data is a product of human choices, shaped by ethical and social contexts” [23].

Accordingly, all human activities that shape data (e.g., collection, design, curation, and refinement) should be guided by both formal methodologies and adaptive, context-specific practices. Hence, despite the recent progress achieved in regard to AI, particularly in GenAI, real-world tasks cannot be solved by machines alone. This underscores the need for socio-technological ensembles that integrate humans and AI to collectively achieve superior results and learn from one another. Such co-intelligence has paved the way for a symbiotic approach to GenAI, ensuring that human values and contextual judgment remain central to responsible and sustainable GenAI development [24].

2.3. GenAI in Service of DCAI

GenAI can be integrated into DCAI pipelines to improve, curate, and annotate data. For example, as reported in [25], LLMs (e.g., GPT-4, PaLM, and Claude) can support annotation automation for tasks such as data labeling, classification, and summarization. In [26], the authors highlight how GenAI models can be used as annotators, helping to replace or augment crowdworkers and enabling the distillation of general-purpose models into smaller, task-specific models. In particular, an interesting direction involves the use of LLMs in missing-data imputation, as proposed, for instance, in [27]. Unlike classical methods such as mean imputation, k-nearest neighbors, MICE, or MissForest, LLMs can exploit semantic relations across heterogeneous columns, domain knowledge (for example, codes, units, and temporal logic), and weak textual signals present in notes or categorical labels, often yielding more realistic values. However, to avoid hallucinations and preserve consistency, the prompt commands should embed hard constraints (such as ranges, regular expressions, or enumerations), and outputs should be post-validated with programmatic checks. Varying the temperature control parameter might enable multiple imputation, allowing uncertainty to be quantified through the variance across draws. Along these lines, in an ongoing study (currently being revised) that builds upon [28], the authors address the problem of correctly annotating a large dataset. Starting from a small, manually verified subset, LLMs were leveraged to generate additional information following a structured prompt design to maintain coherence and relevance. Two main problems were considered: the presence of very large chunks of textual data (a common issue in text-based LLM processing) and the reproducibility of the answers to specific prompts. To improve reliability, multiple prompt variations and different LLMs were used, and outputs were aggregated through a consensus mechanism. A portion of the generated content was manually reviewed to ensure quality. Afterward, LLMs were again used, this time in an augmentation setting, to generate new data using the curated data as prompt examples in a few-shot learning setting.

In addition, GenAI can play a significant role in data curation and cleaning. In this regard, the authors of [29] explore the use of LLMs, such as GPT, for data-wrangling tasks, that is, for the cleaning and transformation of “messy” data. These tasks are complex since they typically can only count on a few concrete examples but require substantial domain knowledge. LLMs, by contrast, can learn from short natural-language instructions while integrating extensive domain knowledge. A recent line of work shows how GenAI can be embedded into symbolic, data-centric pipelines to combine natural-language capabilities with auditable reasoning. In particular, ASP Chef was extended with dedicated LLM operations, enabling recipes where answer sets are transformed into prompts, the LLM responses are post-processed, and the extracted facts are reinjected into subsequent steps of the pipeline [30]. This integration supports data-centric tasks such as interactive data exploration, semi-automatic annotation, and iterative data cleaning while keeping the full process grounded in explicit constraints and traceable transformations expressed in ASP. From a DCAI perspective, ASP Chef provides a controlled orchestration layer where GenAI outputs can be validated, filtered, or corrected before affecting downstream models, thus mitigating hallucinations and bias propagation and improving the transparency and reproducibility of GenAI-assisted data workflows [30]. In general, GenAI and DCAI represent two parallel trajectories converging towards a single evolution: the former provides generative capabilities, while the latter ensures rigor and responsibility.

3. Data Lifecycle

The shift from the model-centric to the Data-Centric pipeline generated important changes regarding the data life cycle in AI. Specifically, the DCAI pipeline can be divided into three primary stages: training-data development, inference data development, and data maintenance [1]. Each stage contributes to designing a robust DCAI pipeline instance. These three stages are closely interconnected.

3.1. Training-Data Development

The primary objective of the training-data development stage is to gather and produce high-quality, comprehensive datasets that support the training of decision models. Acquiring data with high quality and in large quantities involves the completion of two key activities: data creation and data processing. Data creation focuses on encoding human intentions into datasets, while data processing prepares data for the learning stage. Specifically, training-data development encompasses the following steps:

Data collection [31]—This entails identifying the most relevant and useful datasets from the available data sources, such as data lakes and marketplaces, often requiring data integration.
Data labeling [32]—This involves assigning one or more labels to data samples to enable supervised learning. As labeling is time-consuming and resource-intensive, techniques like crowdsourcing, consensus learning, and semi-supervised or active learning are often employed to enhance efficiency and reduce costs.
Data preparation [33]—This involves handling noise, inconsistencies, and irrelevant information to prevent the acquisition of biased or inaccurate results. Examples of techniques used in this step include missing-data handling [34] and data correlation removal.
Data reduction [35]—This task involves simplifying datasets while preserving their relevant information. This is done by reducing either the feature dimensions (dimensionality reduction) or sample size (sampling). Among the many possible approaches to performing sample size reduction, instance selection methods such as DBSCAN [36] are some of the most widely used. This algorithm groups data into clusters based on their density in space, with representative points drawn from the innermost core of each cluster and from well-connected points on the borders. This approach is essentially based on the shape of the data. Alternatively, to preserve the information content of the data rather than the dataset’s structure, the authors of [37] present an instance selection method that exploits the concept of entropy to cluster together points that have the same informative content. Then, using a convex hull approach, representative elements are drawn at the boundary of the clusters to also ensure data diversity among the selected points. Thus, the chosen points are both representative and diverse—desirable properties for downstream tasks. On the feature dimension side, recent selection methods increasingly rely on deep learning solutions. For example, ref. [38] proposes DeepFS, an approach that extracts low-dimensional representations from ultra high-dimensional, low-sample-size data and then performs feature screening using multivariate rank distance correlation. This enables precise identification of significant features.
Data transformation and enrichment [39]—This entails extracting smart data from raw data, where the former denotes a refined, semantically enriched representation obtained through systematic transformation and curation operations—such as feature extraction, semantic annotation, and multimodal integration—that increase relevancy and understandability (as defined in Section 1.2) and expose latent patterns suitable for downstream learning tasks. This is achieved by obtaining more explainable objects and handling rich information (e.g., multi-view data [40] and multimodal data [41]) in different formats (e.g., vectors, sequences, images, and graphs).
Data augmentation [42]—This technique increases the size and diversity of the dataset. For example, in [43], the authors investigate the effect of including adversarial samples in the training set to reduce overfitting and improve both the accuracy and robustness of predictive process monitoring.

3.2. Inference Data Development

The model-centric AI paradigm evaluates decision models primarily through performance metrics (e.g., accuracy, computation time, and memory usage). However, this evaluation approach can overlook critical aspects such as model resilience, robustness, and adaptability as well as reasoning. The main objective of inference data development is to generate evaluation datasets that provide deeper insights into the decision model’s behavior or activate specific decision model capabilities. Key sub-goals of this stage include the following:

In-distribution evaluation [44]—This process produces samples that are similar to the training data. This helps identify underrepresented groups, prevent bias, reveal decision boundaries, and examine ethical considerations.
Out-of-distribution evaluation [45]—This process generates samples that differ significantly from the training data. Practical examples include developing inference data in dynamic contexts, such as business process executions, where data can change over time [46,47]. In particular, in the dynamic settings, the evaluation should account for the ability to detect and handle concept drifts and obtain decision models that maintain high performance over time. For example, [48] provides a concrete example demonstrating how predictive models can be effective and adaptive in dynamic Industry 4.0 settings.

3.3. Data Maintenance

In real-world applications, data must be continuously updated and curated rather than created each time from scratch. The data maintenance stage ensures the quality and reliability of datasets in dynamic environments. This stage involves the following:

Data comprehension [49]—Techniques like visual summarization, clustering, or statistics are used to organize complex data and produce human-readable insights.
Data quality assurance [4]—Data are monitored and improved continuously. The relevant quality metrics include objective measures (accuracy, timeliness, consistency, and completeness) and subjective assessments from a human perspective. This step also involves addressing the explainability of decision models with respect to both local and global data.
Data storage and retrieval [50]—This involves managing growing datasets through resource allocation strategies to optimize throughput and latency in data systems.
Data representativeness [51]— Five metrics are introduced to evaluate whether an example is representative or an outlier in a dataset, and practical methods for measuring representativeness are proposed. These metrics are given below:
−
Adversarial Robustness: This measures how difficult it is to modify an input enough such that its classification is changed. If a large perturbation is required, then the example is near the center of its class and therefore more representative.
−
Holdout Retraining: A model is compared with and without a given example. If predictions change little, the example is well supported by other data and therefore representative.
−
Ensemble Agreement: This involves training multiple models and measuring their agreement on a given example. If they produce similar predictions, the example is typical and easy to learn—and therefore representative.
−
Model Confidence: This metric examines how confident models are in their predictions (i.e., high probability for a class). Greater confidence indicates clearer, more representative examples.
−
Privacy-preserving Training: This involves training models with noise (differential privacy). Representative examples remain correctly classified even with noise, while outliers are more likely to be misclassified.

The risk of recursive-data contamination discussed in Section 2.2—wherein generative models trained on AI-produced data progressively drift from the true underlying distribution—leads us to argue that continuous data quality assurance and representativeness monitoring become structurally necessary in any DCAI pipeline that incorporates GenAI components. In this case, complementary tools are required, such as distribution divergence measures relative to a reference human-generated dataset, internal diversity metrics, and data-provenance-tracking mechanisms.

4. Techniques and Tools for DCAI

In this section, we illustrate some relevant DCAI techniques for improving data quality in terms of features and labels and highlight useful libraries. In addition, we provide an overview of some tools that may be used within a DCAI pipeline.

4.1. Techniques for Data Cleaning and Selection

Although data collection and data quality assurance are crucial for AI development, AI research has largely focused on learning algorithms rather than the data itself. According to [52], a common concern in industry is that research institutions spend 90% of their efforts on learning algorithms and only 10% on data preparation. Instead, based on the actual time required, the ratio should be reversed, with 90% devoted to data and 10% devoted to learning algorithms.

As a consequence of this research trend, several learning-based approaches have been developed for data curation, e.g., predictive models trained for missing-value estimation [53] or labeling-error correction [54]. An additional challenge pertains to the need to provide a broader context for data. In fact, the role of the data enrichment step is to expand an existing dataset with additional information from external sources by recognizing and reconciling values between primary and external sources [35]. This practice is essential in analytics-based projects where the decision model can take advantage of accounting for external features (i.e., features not present in the original dataset). As an example, consider the case of analyzing the performance of marketing campaigns in relation to weather conditions [55]. Finally, with the growing collection of big data, data reduction has gained importance as a useful means to increase training efficiency. Reducing the number of samples makes a training dataset simpler and more representative, reducing memory and computation constraints and helping developers to balance majority classes [35]. Additionally, reducing the number of features by eliminating irrelevant or redundant ones may help mitigate the risk of overfitting [56].

4.2. Techniques for Handling Noise and Incorrect Labels

Developing supervised learning algorithms capable of handling incorrect labels and noise in training datasets is a challenge with considerable practical relevance [57]. With regard to incorrect labels, various approaches to mitigating this phenomenon have been described in the recent literature. A pioneering article was authored by [5], who introduce the concept of Confident Learning, focusing on the quality of labels in training datasets rather than on predictions of learned decision models. The authors’ study was inspired by the principles of noise identification and removal, probabilistic noise estimation, and example ranking for safe training. They combine these concepts into a general framework based on the assumption of a class-conditional noise process. A further contribution is described in [58], where the authors propose a Label-Distribution-Based Confidence Estimation (LDCE) method for using label distribution to estimate label confidence in training datasets. This allows developers to clearly distinguish between clean and noisy labels. Notably, integrating LDCE with existing learning algorithms enables the training of more robust deep neural networks. Alternatively, focusing attention on the issues related to noised input data, the authors of [59] describe how noise, defects, and clouds may compromise the quality of a remote sensing model developed with satellite data. Specifically, the authors describe a DCAI pipeline that prepares satellite data obtained with Sentinel-1 and Sentinel-2 and show how they can be combined to gain accuracy in forest-disturbance-mapping tasks. The same pipeline is also used in [60], where explanations are also used to monitor how the certainty of semantic segmentation decisions may be related to satellite data.

4.3. Techniques for Smart-Data Extraction and Data Enrichment

The big data era is characterized by a massive amount of heterogeneous data that contains significant latent value [61]. For this reason, one of the key objectives of the DCAI paradigm, as introduced in Section 3.1, is to shed new light on raw data by transforming it into smart data [62]. For example, consider text data commonly obtained in multiple problems. Thanks to the widespread availability of LLMs, we are able to leverage contextual embedding representations of text corpora (e.g., representations obtained with BERT) to obtain informative and explainable knowledge that is useful for both supervised and unsupervised tasks. Recent studies have explored this direction in the field of predictive process monitoring, where raw process instances are converted into semantic narratives suitable for inference by LLMs [12,63]. Elsewhere, smart-data-extraction techniques have recently been used to transform raw process instances into images [43]. Notably, both images and graphs are able to capture the complex relationships present across the various perspectives that typically characterize a business process. These are also explainable objects that facilitate the use of XAI techniques to equip decision models with explainable decision-making abilities and effectively support stakeholders in their decision-making processes [64].

4.4. Techniques for Semantic Data Preparation

One of the principles of DCAI is that models should be trained and evaluated on data with clear and explicit meaning. This allows organizations to fully leverage data management practices, ensuring greater compliance with data-quality principles and enhancing system performance through formal semantic techniques that enable several automated reasoning tasks, such as consistency checking, inference of new knowledge, and query answering. Several of the limitations of AI systems ultimately stem from semantic issues: inadequate data modeling, ambiguous or inconsistent relationships, and domain assumptions that are not explicitly provided with the data. Such issues can be addressed by equipping data with machine-interpretable formal semantic characterizations, typically in the form of knowledge graphs or ontologies.

Knowledge graphs (KGs) [65] are particularly suited to this role, as they allow one to encapsulate both intensional and extensional knowledge within a single representation. Their dual nature makes KGs a useful tool for data-centric pipelines in at least two scenarios: (i) data preparation, where KGs help in cleaning and enriching data [66,67] and integrating heterogeneous data sources [35,68,69], and (ii) neuro-symbolic AI, where symbolic and sub-symbolic components are integrated into a single decision process [70,71]. In both cases, well-founded semantics are essential. Despite the widespread adoption of KGs in several applications, they are often equipped with inadequate semantics, which severely affects the benefits that KGs could provide. In particular, KGs are sometimes treated simply as plain directed edge-labeled graphs or property graphs [72,73]. In such a context, KGs do not fully leverage the power of proper and rich semantics, as they are effectively restricted to the closed-world Assumption; i.e., a KG consists of a single model corresponding to the graph instance itself. Other richer semantics for KGs are based on Description Logics [74,75], which fall under the open-world assumption; i.e., they allow for several possible interpretations for a KG to exist. While Description Logics are characterized by well-founded semantics grounded in formal logic, apart from a few exceptions [76], they also present limitations in terms of the modeling capabilities offered by the formalism in that they do not go beyond classical first-order logic interpretations. In other words, they do not allow for more complex modeling designs that are sometimes required by real KGs and that would benefit from the so-called metamodeling capabilities (also known as multilevel modeling), e.g., those allowing for the representation of entities that simultaneously play different roles in the domain represented by the KB. One of the most relevant and widely adopted formalisms for the management of KGs is RDFS [77], which falls under the open-world assumption, like Description Logics, and also naturally offers metamodeling capabilities. It has been shown that, despite its widespread use, RDFS presents serious drawbacks concerning its standard semantics and its ability to properly manage incomplete information [78,79,80], since it is not grounded in classical logic. This might lead to unexpected or undesired results when performing reasoning tasks over RDFS KGs. In [81], the authors provide a semantic characterization of RDFS KGs that is grounded in classical logic and supports metamodeling capabilities. The same work also presents a thorough analysis of the computational complexity of query answering over KGs under the proposed metamodeling semantics and establishes a theoretical foundation for practical implementations. In particular, regarding the computational viability of the proposed tools compared to traditional data-processing methods, their findings show that, in common circumstances, curating KGs is no more computationally expensive than curating traditional relational data.

Before concluding this section, a discussion of the scalability–latency trade-off inherent to KG-based semantic data preparation is in order. In general, the computational expense of formal semantic reasoning and graph traversal can be significant, which is seemingly at odds with the low-latency requirements of real-time AI pipelines. However, this apparent conflict is resolved by strictly decoupling the data-engineering lifecycle from the model execution environments. In our envisioned data-centric AI architecture, KG-based operations aimed at improving the quality of data are inherently offline or asynchronous processes. By confining these computationally intensive tasks to the upstream data preparation and data maintenance stages, one can ensure the latency of the real-time inference pipeline remains completely unaffected. This architectural separation ensures that deployed models can operate at peak efficiency while still benefiting from highly curated, semantically rich training data. Consequently, leveraging RDFS-based KGs for large-scale semantic data preparation in data-centric AI pipelines is not merely a theoretical ideal but rather a useful, highly viable strategy that increases data richness and quality without compromising the operational speed of the downstream AI ecosystem.

4.5. Active Learning and Data Augmentation

Active learning is an iterative process that selects the most informative unlabeled examples to reduce the amount of labeled data needed to train supervised machine learning models [82]. Although effective in several areas, active learning has limitations, particularly with imbalanced datasets. Indeed, in imbalanced problems, the predominance of certain classes can introduce bias in sample selection, worsening the performance of minority classes [83]. However, the combination of active learning and data augmentation represents a promising strategy. In particular, generating high-quality synthetic data can reduce dependence on real data, balance the dataset, and improve performance on minority samples, making the selection of new samples to be labeled more efficient and fair. An example of this integration is illustrated in [84], where the authors introduced CAMPAL (Controllable Augmentation Manipulator for Active Learning). This approach integrates data augmentation in a targeted and controlled manner, applying separate policies to labeled and unlabeled data, controlling the intensity of transformations, and ensuring flexibility for most active learning methods. Another example of data augmentation was recently described in [43], where data augmentation is done with adversarial samples produced to fuel an adversarial training procedure. In this case, adversarial training is done to improve the robustness of the decision model to out-of-distribution data.

4.6. Transfer Learning and Fine-Tuning

The recent surge in the availability of foundation models, particularly in the fields of Natural Language Processing and Computer Vision, has led the scientific community towards increasing the reuse of foundation models instead of training new decision models from scratch. This requires abilities to transfer or adapt the knowledge of general foundation models to specific downstream tasks. This idea aligns well with the principles of DCAI, which emphasize the value of data quality over model optimization.

Based on these premises, the authors of several recent studies have applied fine-tuning strategies to adapt foundation models—originally trained on large-scale datasets—to specific domains such as remote sensing [13] and healthcare [12].

Moreover, fine-tuning strategies have also been explored in the context of business processes [46] as a means of adapting predictive deep neural models to the dynamic nature of drifting data. To mitigate catastrophic forgetting and maintain accurate decision models, including in the presence of concept drifts, the authors of these recent studies employed fine-tuning strategies to update the predictive model whenever concept drift is detected in the streamed data.

Additionally, formulating Parameter-Efficient Fine-Tuning (PEFT) techniques has emerged as a promising alternative to full fine-tuning [85]. By updating only a small subset of foundation model parameters, PEFT techniques significantly reduce computational costs and memory requirements while achieving remarkable accuracy. This also makes PEFT techniques particularly suitable for dynamic settings, where frequent model updates may be required.

4.7. Libraries and Tools

In the last few years, the scientific community has witnessed the proliferation of numerous libraries and tools that support the principles of DCAI. In the following, we present a non-exhaustive list of useful tools and libraries that may be integrated into DCAI-aware projects.

4.7.1. Data Profiling

YData Profiling—YData Profiling supports both Pandas and Spark DataFrames, providing a fast and straightforward visual data understanding.
SweetViz—SweetViz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code.
DataPrep.EDA—DataPrep.EDA is an EDA (Exploratory Data Analysis) tool implemented in Python. It allows developers to understand a Pandas/Dask DataFrame with a few lines of code in seconds.
Pycol—Pycol implements 29 overlap measures designed to capture class overlap in imbalanced real-world scenarios [86].

4.7.2. Synthetic Data

YData Synthetic—YData Synthetic uses Generative Adversarial Networks for synthetic generation of tabular and time-series data.
Synthetic Data Vault(SDV)—SDV is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn patterns from single-table, multi-table, and time-series datasets. Learned patterns can be subsequently used to generate new Synthetic Data that has the same format and statistical properties as the original dataset.
Pomegranate—Pomegranate is a package for building probabilistic models in Python. It is implemented in Cython for speed. Most of the integrated models can sample data.

4.7.3. Data Labeling

LabelStudio—Label Studio is an open-source data-labeling tool. It allows users to label data types like audio, text, images, videos, and time series with a simple and straightforward UI and export labeled data into various model formats.
Annotation Lab—Annotation Lab is a Natural Language Processing annotation tool included in spark-nlp.

5. Methodological Evolution

A transition toward DCAI requires not only new techniques and tools but also a rigorous and industrial-grade methodology capable of structuring the entire lifecycles of data and models. Methodologies play a critical role in industrial AI projects: they standardize processes, promote reproducibility, ensure traceability of decisions, reduce operational risks, and support cross-functional collaboration among data engineers, domain experts, and decision-makers. In data-intensive contexts, where datasets evolve continuously, quality issues propagate across pipelines, and model behavior is tightly coupled with the dynamic properties of data, a well-established methodology becomes indispensable to ensure robustness, accountability, and sustained performance.

Although several organizations already rely on best practices from software engineering and MLOps (Machine Learning Operations), these frameworks alone do not fully address the specific challenges introduced by a data-centric paradigm, such as iterative data debugging, systematic data quality assurance, dataset versioning, and the orchestration of data-focused development loops. For this reason, the evolution of methodological frameworks becomes a necessary complement to the technological advances discussed in previous sections.

Any attempt to define a modern methodology for DCAI must acknowledge the influence of Cross-Industry Standard Process for Data Mining (CRISP-DM) [87,88], which remains the most widely adopted structured process model for analytics and early AI projects. Developed in 1996–2000 through a major European initiative involving Daimler-Benz, SPSS, and academic partners, CRISP-DM was explicitly designed to provide an industry-neutral, tool-agnostic, and repeatable process for data-driven projects at a time when organizations lacked formal guidelines for transforming raw data into actionable insights.

CRISP-DM’s enduring success stems from its conceptual clarity and the recognition—already in existence in its original formulation—that data analysis is inherently cyclical and exploratory. The methodology is organized into six phases represented in a circular process model that emphasizes iteration and feedback:

Business Understanding—Definition of objectives, constraints, success metrics, and the business context in which data mining will operate.
Data Understanding—Initial data collection, exploratory analysis, identification of data quality issues, detection of biases, and preliminary assessment of data suitability.
Data Preparation—Construction of the final dataset through cleaning, transformation, integration, and feature selection.
Modeling—Selection and configuration of modeling techniques, hyperparameter tuning, and model construction.
Evaluation—Assessment of the model’s ability to meet business and technical objectives.
Deployment—Integration of the model into operational processes, monitoring, documentation, and maintenance.

The iconic CRISP-DM process diagram is shown in Figure 2. It is commonly depicted with arrows that illustrate the typical dependencies among phases, together with an outer circle emphasizing that the process does not conclude at the deployment stage. Insights derived from a deployed model can, in fact, generate new questions and objectives, thereby initiating a fresh cycle. Within this inherently cyclical framework, bidirectional connections denote stronger interdependence between specific phases.

Beyond the six canonical phases, the CRISP-DM framework is articulated into a series of generic and specialized tasks that provide operational guidance in each stage of the process (Table 1). Each task is accompanied by documentation and reporting activities. This layered structure underscores the methodological richness of CRISP-DM: it is not merely a linear sequence of phases but rather a comprehensive life cycle in which tasks and deliverables are systematically organized to ensure transparency, reproducibility, and alignment with business goals.

Although CRISP-DM remains widely adopted and continues to serve as a solid reference model, it is important to situate it within its historical context in the late 1990s. As noted by Martinez-Plumed et al. [89], the framework performs well in structured, goal-driven projects but becomes less adequate when data science work takes more exploratory or non-linear forms. This observation is directly relevant for DCAI: modern data-centric workflows require flexible methodological trajectories and continuous attention to data quality, aspects that call for an updated examination of CRISP-DM’s limitations with respect to contemporary data-centric AI.

5.1. CRISP-DM Limitations in DCAI

CRISP-DM is not a data-centric AI methodology in the modern sense, but it incorporates several elements that resonate with data-centric principles. Although developed in the late 1990s for classical data mining, its strong emphasis on data comprehension and data preparation anticipated today’s recognition that data quality is critical for model performance. CRISP-DM also includes iterative feedback loops that allow practitioners to revisit earlier phases, including data preparation, thus acknowledging—albeit implicitly—the importance of refining data rather than exclusively focusing on model optimization.

However, the CRISP-DM framework was conceived in a technological landscape that preceded large-scale machine learning pipelines, continuous data streams, Generative AI, and the operational frameworks that underpin modern AI deployments (such as MLOps and GenAIOps). Industrial adoption of DCAI requires methodological refinements and extensions that incorporate

Explicit data-centric cycles;
Systematic dataset versioning and documentation;
Continuous evaluation of data quality;
Integration with MLOps and emerging GenAIOps practices;
Reproducible processes for synthetic data generation, data labeling, and data governance.

These requirements pave the way for a critical examination of the original CRISP-DM framework in relation to DCAI.

5.2. Revisiting CRISP-DM

In our revised data-centric methodology, the initial phase assumes an even more critical role. We designate this stage as “Understanding Business and Data Requirements.” The traditional separation in CRISP-DM between Business Understanding and Data Understanding—where the former precedes the latter—reflects a legacy of model-centric thinking in which business goals are articulated before any systematic analysis of the data. This separation also mirrors historical organizational roles: business analysts define objectives, while data scientists assess feasibility later on. Yet this division often produces a misalignment between ambition and empirical possibility.

The emergence of DCAI challenges this assumption, highlighting that business objectives and data realities are deeply interdependent and should be addressed jointly from the outset:

Business goals are not abstract—they are instantiated through data. A business hypothesis, such as “supplier quality affects process adherence,” is meaningful only insofar as it can be operationalized via available or collectible data. In this sense, business comprehension must be embedded within data comprehension.
Data exploration is not neutral—it is guided by business semantics. Data profiling is never performed in the abstract; it is hypothesis-driven and shaped by domain expectations. The selection of sources, the assessment of data quality, and the interpretation of distributions all depend on business meaning.

By unifying these two phases, the methodology encourages continuous collaboration among stakeholders. Business experts and data scientists co-design data representations, validate assumptions, and refine objectives iteratively. A unified phase reduces documentation redundancy, accelerates feedback loops, and ensures that strategic intent remains continuously aligned with empirical reality.

Beyond clarifying the strategic objectives of the project, this phase demands a systematic elicitation of stakeholder needs and a precise articulation of the data requirements necessary to support them. A key innovation introduced by the data-centric perspective is the early identification of the data modalities—such as text, images, video, multimodal streams, or structured records—and the assessment of their suitability for the intended goal. This phase calls for a preliminary evaluation focused on data availability, representativeness, integrity, and potential biases, including whether critical subgroups are sufficiently captured. These diagnostic activities, traditionally postponed until data preparation or even after model development, must instead occur at the outset to ensure there is a robust and accountable data foundation for the entire AI lifecycle.

A gap in the CRISP-DM framework is the absence of a dedicated phase for data collection and labeling. While CRISP-DM emphasizes data preparation and understanding, it does not explicitly recognize the centrality of data acquisition and annotation as a methodological step. In DCAI, however, the quality of data—not merely the sophistication of models—defines the success of the system. This calls for a revision of CRISP-DM that elevates Data Collection and Labeling to a formal phase.

Data collection itself is no longer just about pulling from a single source. Data now comes from a diverse array of sources, ranging from the web, sensors, and databases to archives. We need to understand where our data is coming from and ensure that it is representative of the problem we are trying to solve. This is especially true in applications like remote sensing or healthcare, where data can be highly heterogeneous.

Once all the data has been collected, the challenge of data labeling arises. Labeling represents the foundation of supervised learning, as each data point must be annotated accurately. Manual annotation has traditionally been used, but it is often labor-intensive and prone to error. Automated annotation can provide support, provided that it is accompanied by careful validation. Ensuring high quality and consistency across annotations remains a crucial aspect of the process. To improve accuracy, domain experts—whether from healthcare, finance, or geospatial applications—are frequently involved. Their expertise ensures that the labels are not only correct but also meaningful.

Finally, the use of active learning should be considered. This technique enables the selection of the most informative examples within a large dataset, avoiding the allocation of resources to data that would not significantly enhance model performance. As a result, costs can be reduced, since the model guides the labeling process and directs human effort only to where it is most valuable.

In summary, the data collection and labeling stage is not just about gathering data—it is about ensuring that data is of the highest quality and consistency. It is a foundational step in building robust AI models, and with techniques like expert annotators and active learning, we are well-positioned to overcome some of the traditional barriers in data-centric projects.

To formalize this phase within a revised CRISP-DM, we propose the following tasks:

Source identification and acquisition: Locate and access the data sources specified in the business–data requirements phase, ensuring compliance with licensing, privacy, and ethical constraints.
Operationalization of labeling schema: Set up annotation protocols and prepare annotators or automated tools to ensure semantic consistency and domain validity.
Annotation execution: Conduct annotation with trained annotators or domain experts, and validate outputs iteratively.
Labeling-quality assurance: Perform inter-annotator agreement checks, validate against gold standards or domain knowledge, and refine ambiguous cases.
Active-learning integration: Use model feedback to prioritize labeling of uncertain or high-value samples, iteratively expanding the labeled dataset with maximum efficiency.
Documentation and governance: Record collection protocols, annotation guidelines, and quality metrics, ensuring compliance with ethical and legal standards and providing transparent documentation to ensure reproducibility.

In the original CRISP-DM methodology, the data preparation phase was conceived as a technical step focused on selecting, cleaning, integrating, and formatting data for modeling. While effective in a model-centric paradigm, this conception is insufficient for DCAI, where the dataset itself becomes the primary object of optimization. To reflect this epistemological shift, we propose renaming and expanding the phase to data curation. This term emphasizes active stewardship of data quality, semantic validity, and learning-aware structuring, positioning the phase as a central determinant of model robustness and trustworthiness.

Table 2 illustrates how the original CRISP-DM tasks can be extended from a DCAI perspective. It clarifies that what was once a narrowly technical phase in CRISP-DM has now become a strategic and epistemic intervention in DCAI. Each original task is preserved but extended: selection evolves into prioritization and curriculum structuring; cleaning expands to include outlier detection and ongoing label refinement; construction incorporates enrichment and augmentation; integration requires semantic harmonization to ensure conceptual validity; and formatting is reframed as a governance-oriented activity that guarantees transparency and reproducibility.

In this way, data curation is not simply a rebranding of data preparation but a substantive redefinition. It positions the dataset as the primary object of optimization, ensuring that the information fed into models is not only technically sound but also semantically coherent, strategically structured, and ethically accountable. This extension underscores the centrality of data quality and meaning in contemporary AI practices, and it provides the methodological bridge between raw acquisition and modeling in the revised CRISP-DM framework.

In CRISP-DM, the modeling phase consists of selecting analytical techniques and applying them to the prepared dataset. It includes tasks such as choosing modeling algorithms, building models, and assessing models. This phase traditionally emphasized the technical configuration of models and their statistical performance (e.g., accuracy, recall, precision, and mean square error). However, in the context of DCAI, the role of modeling is reframed. The model is no longer the locus of innovation; its effectiveness is conditioned by the quality, structure, and semantic integrity of the curated data.

This shift entails several methodological changes:

Model selection based on data semantics: Rather than choosing models solely based on statistical properties, selection is guided by the structure and meaning of the data. For example, curriculum-learning strategies may favor models that can adapt to ordered training sequences.
Integration with active learning loops: Modeling is intertwined with active learning—the model identifies uncertain or high-impact samples, which are then prioritized for labeling or refinement. This creates a feedback loop between model and data.
Sensitivity to annotation quality: Models are evaluated not only on predictive accuracy but also on their robustness to label noise, annotation bias, and semantic ambiguity. This requires diagnostic tools that assess model behavior in relation to data quality.
Alignment with business–data requirements: The model must not only perform well statistically but also produce outputs that are both interpretable to stakeholders and actionable within the specific domain context.
Transparent documentation of data–model interactions: All modeling decisions—including feature selection, training dynamics, and performance metrics—are documented with reference to the curated dataset. This ensures reproducibility and supports governance.

CRISP-DM encourages the common practice of experimenting with different algorithms (Random Forest, SVM, Neural Networks, etc.) and tuning hyperparameters until performance improves. The dataset is treated as relatively static. In DCAI, the opposite emphasis is placed: once a model is chosen (let us say, Random Forest, because it has some valuable properties, such as robustness to noise, interpretability of feature importance, etc.), it is fixed, and iteration is performed on the dataset instead. The dataset is “variable” because it can be improved in many ways:

Cleaning mislabeled or inconsistent examples;
Adding representative samples;
Augmenting with synthetic or contextual data;
Harmonizing semantics across sources;
Structuring training sequences (curriculum learning).

The key idea is that performance gains come more from improving the dataset than from swapping algorithms. For this reason, the traditional CRISP-DM term modeling is better reframed as Model training, since the emphasis is no longer on experimenting with alternative algorithms but on controlled learning on curated data.

In the CRISP-DM framework, the Evaluation phase does not primarily focus on the intrinsic technical performance of the model—such as accuracy, generalization, or overfitting—which is already addressed in the modeling phase. Instead, evaluation is explicitly oriented toward assessing results with respect to the original business objectives and success criteria. Its purpose is to determine whether the discovered knowledge and deployed models effectively support decision-making and deliver measurable business value. The expected outcome is therefore an overall assessment of the data-mining project with respect to business success, including a final judgment on whether the initial objectives have been achieved.

In DCAI, business objectives and data are intrinsically intertwined: the achievement of strategic goals depends directly on the integrity, representativeness, and semantic validity of the curated dataset. Models do not generate business value in isolation; rather, their utility emerges from the quality, structure, and governance of the data on which they are built. Consequently, the evaluation phase cannot be confined to verifying whether deployed models deliver measurable business outcomes. It must also explicitly assess whether the data-centric interventions themselves have been effective.

This dual dependency reframes the original CRISP-DM “result evaluation” task as a two-fold validation process:

Business achievement assessment: This process corresponds to the traditional CRISP-DM objective of determining whether model outputs effectively support decision-making, align with strategic goals, and generate tangible value in the target application domain. This includes verifying that the model’s behavior is consistent with operational constraints, performance thresholds, regulatory requirements, and organizational priorities.
Data achievement assessment: This stage involves evaluating whether the iterative data curation actions have been successfully implemented, stabilized, and made sustainable. The central purpose of this assessment is to determine whether—and to what extent—improvements in data quality are demonstrably linked to measurable gains in model robustness, interpretability, reliability, and trustworthiness.

For instance, in the industrial use case of customer churn prediction, where the strategic objective is to identify customers at risk of defection in order to trigger timely and cost-effective retention actions, the business achievement assessment focuses on business impact, such as churn reduction rate, uplift in customer lifetime value, and the return on investment of retention campaigns. The model is thus assessed not simply as a predictor but as a decision-support component embedded in a broader socio-technical process.

This assessment is inseparable from a systematic data achievement assessment, which verifies whether a dataset resulting from iterative curation is epistemically sound and fit for supporting a given business objective. This is operationalized by systematically comparing a baseline dataset with successive curated versions obtained through controlled interventions such as relabeling, targeted enrichment of high-risk customer segments, correction of missing or inconsistent usage records, and refinement of semantic feature representations. By keeping the learning algorithm fixed across dataset versions, any observed improvements in out-of-sample stability across time windows and customer subgroups, calibration errors, and explanation consistencies can be causally attributed to data-centered modifications rather than to model tuning. For instance, increases in inter-annotator agreement on churn labels and improved segment-level coverage must be shown to correspond to statistically significant gains in subgroup robustness and explanation stability. Sustainability is finally assessed by verifying that these improvements persist under post-deployment data drift through continuous monitoring of data quality and model behavior. In this way, the data achievement assessment provides empirical evidence that curated data, rather than model complexity, is the primary driver of durable performance gains in churn prediction.

Finally, in a DCAI perspective, the Deployment phase of CRISP-DM must be significantly extended beyond model release and reporting. It is no longer a terminal step; it is the entry point of a continuous data–model co-evolution process. Four extensions are, in our view, methodologically central. First, automated data quality and drift monitoring becomes a foundational requirement in order to continuously detect distribution shifts, label degradation, and loss of representativeness. Concretely, the representativeness metrics introduced in Section 3.3 provide a principled and model-agnostic toolkit for this monitoring layer. Second, dataset versioning and data lineage governance are introduced to ensure the reproducibility, traceability, and auditability of all deployed models. Third, data-driven continuous learning mechanisms enable retraining and remediation to be triggered by data-centric signals rather than by performance decay alone. Finally, explainability and trust monitoring in production ensure that model decision rationales remain stable, meaningful, and aligned with business and regulatory expectations over time. Together, these extensions transform deployment from a terminal phase into a continuous data–model co-evolution process.

Table 3 summarizes the mapping between the canonical CRISP-DM phases and their reinterpreted counterparts from a DCAI perspective, highlighting the principal conceptual differences introduced by the revision.

As discussed in Section 3, the DCAI pipeline can be articulated into three macro-stages—training-data development, inference data development, and data maintenance—each governing a distinct but tightly interconnected segment of the data life cycle. These stages describe what happens to the data throughout the lifetime of an AI system, from its initial construction to its operational evolution.

At the same time, the revision of CRISP-DM proposed within the DCAI framework provides a finer-grained, process-oriented view of how AI systems are developed when data is considered the primary driver of performance and reliability. The six revised phases—ranging from comprehension of business and data requirements to continuous data-centric operations—capture the methodological steps that guide practitioners in building and maintaining trustworthy AI systems.

Conceptual alignment between these two perspectives is essential. The training-data-development stage spans the early and central phases of the revised DCAI process, encompassing analysis of business and data requirements, data collection and labeling, and ongoing data curation. Conversely, inference data development aligns with the modeling and evaluation phases, as it focuses on constructing and managing the data required for in-distribution, out-of-distribution, and drift-aware assessments. Finally, data maintenance corresponds to Deployment and Continuous Data-Centric Operations, emphasizing the post-deployment evolution of datasets through monitoring, drift detection, feedback loops, and ongoing data quality assurance.

To clarify the relationship between the two perspectives, Table 4 maps the three high-level DCAI stages onto the revised CRISP-DM phases, highlighting how each stage contributes to designing a robust, data-alive pipeline.

5.3. Axes of Methodological Differentiation from CRISP-DM

The proposed revision constitutes a clear departure from CRISP-DM, which can be articulated along three relevant orthogonal axes.

Axis 1: Process Temporality and Versioning

In the original CRISP-DM, data preparation is a one-shot, pre-modeling activity: the dataset is treated as a fixed input, cleaned, and formatted once before being handed off to the modeling phase. In the DCAI-revised methodology, data curation is continuous, iterative, and versioned. This requires adopting dataset-versioning systems, which are analogous to source-code version control and operationalized by tools such as DVC [90]. In these systems, every transformation applied to the dataset (relabeling, enrichment, augmentation, and semantic harmonization) is recorded as a reproducible, auditable commit. This produces a dataset changelog that can be inspected, rolled back, and compared across versions—a capability entirely absent from the original CRISP-DM workflow and from conventional preprocessing pipelines.

Axis 2: Optimization Target Inversion

The traditional data-engineering and original CRISP-DM modeling phase treat the algorithm as the primary variable under optimization: practitioners iterate over alternative models (Random Forest, SVM, Neural Networks, etc.) and tune hyperparameters while keeping the dataset relatively static. The DCAI methodology inverts this relationship, in accordance with the formal framework introduced in Section 1.3. As expressed by Equations (3) and (4), the hyperparameters

λ

are fixed, and the training dataset

D^{'}

becomes the object of optimization. Once a model has been selected on the basis of domain-informed criteria—such as robustness to label noise, interpretability of feature importance, or alignment with regulatory constraints—it is fixed, and iteration is performed exclusively on the dataset. This inversion enables causal attribution of performance changes: as the learning algorithm and its configuration are held constant across dataset versions, any observed variation in out-of-sample performance, subgroup robustness, or calibration error can be unambiguously attributed to specific data-centric interventions rather than confounding algorithmic choices. This protocol is structurally equivalent to a dataset ablation study—analogous to the architectural ablation studies standard in deep learning research—and provides a level of empirical accountability that model-centric workflows cannot offer.

Axis 3: Model-Guided Data Remediation via Active Learning

In standard preprocessing pipelines, data quality assessment is performed before model training, using heuristic or statistical criteria (e.g., missing-value rates, class imbalance ratios, and inter-annotator agreement scores). The DCAI methodology structurally integrates active learning as a feedback mechanism between the model and the dataset: the model, once trained on a curated version of the data, identifies samples associated with high predictive uncertainty, high loss, or inconsistent explanations, which are then prioritized for targeted remediation—relabeling, enrichment, or removal. This creates a closed loop between model training and data curation, which does not exist in conventional preprocessing workflows or in the original CRISP-DM. The result is that the dataset evolves in response to model behavior rather than being constructed independently of it. Empirical evidence from recent work on data-centric benchmarks [1] and programmatic labeling frameworks such as Snorkel [91] confirms that this feedback-driven approach to dataset construction yields measurable gains in model robustness and label consistency that cannot be achieved by algorithm selection alone.

These differences are operationalized through concrete tooling and experimental protocols, and they produce qualitatively different empirical outputs—dataset changelogs, ablation evidence, and active learning convergence curves—that are not artifacts of conventional preprocessing or data-engineering practices but rather the result of a fundamentally different methodological paradigm centered on data as the primary object of optimization.

5.4. Implications for Industrial Adoption and Links to MLOps and GenAIOps

The revision of CRISP-DM in a DCAI perspective has direct and profound implications for industrial adoption. By extending the deployment phase into a regime of continuous data-centric operations, the methodology naturally converges with the principles of MLOps and, more recently, GenAIOps. MLOps can be understood as a set of practices intended to operationalize machine learning systems through automation, reproducibility, continuous integration, monitoring, and lifecycle governance of models [92]. From a DCAI standpoint, however, models are no longer the sole—or even the primary—operational artifacts. Instead, datasets, labeling processes, quality metrics, and data transformations become first-class, versioned, and continuously monitored assets. This makes MLOps insufficient in isolation and calls for its explicit extension toward continuous data operations, including automated data quality monitoring, drift detection, dataset lineage tracking, and data-triggered retraining pipelines. In this sense, DCAI shifts the center of gravity of MLOps from model orchestration to data orchestration.

The recent emergence of GenAIOps further amplifies this need (https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/genaiops-for-mlops, accessed on 25 February 2026). Generative AI systems are inherently data-hungry, continuously evolving, and highly sensitive to shifts in data distributions, semantic drift, and prompt–data interactions. As a result, GenAIOps extends traditional MLOps by introducing automated management of prompts, synthetic data generation [93], continuous curation of human feedback, and governance mechanisms for data provenance, hallucination risk, and safety constraints. These mechanisms align naturally with the DCAI principles introduced in the revised CRISP-DM, particularly with respect to automated data quality monitoring, iterative data curation, and post-deployment data-driven feedback loops.

This integration is also decisive from a governance and compliance perspective. Regulatory frameworks such as the EU AI Act, together with established data protection and auditability requirements, impose strict obligations on transparency, traceability, dataset documentation, and continuous risk monitoring. The revised CRISP-DM provides the methodological governance layer for satisfying these requirements, while MLOps and GenAIOps supply the technical infrastructure for enforcing them at scale through automated logging, dataset version control, explainability pipelines, and continuous compliance checks.

From an industrial standpoint, the convergence of DCAI-oriented CRISP-DM with MLOps and GenAIOps enables organizations to move beyond fragile, one-shot AI deployments toward sustainable, continuously learning socio-technical systems. The methodology ensures conceptual coherence and accountability, while the operational frameworks ensure scalability, robustness, and regulatory alignment. Their synergy guarantees that business objectives, data evolution, and model behavior remain continuously aligned, auditable, and responsive to real-world dynamics—thus suggesting that DCAI is not only a scientific paradigm but also a potentially viable and governable industrial practice.

6. Use Case: Data-Centric Improvement of Text Classification via Confident Learning

In many real-world Natural Language Processing (NLP) applications, the quality of labeled data is a critical bottleneck. Widely used benchmark datasets may contain mislabeled or ambiguous instances due to human annotation errors or unclear textual content. This issue is especially relevant in text classification tasks, where semantic overlap between classes can lead to systematic labeling inconsistencies.

To illustrate, as a focused and demonstrative example, how one specific DCAI technique can contribute to addressing this challenge, we examine a practical use case proposed by Cleanlab (https://docs.cleanlab.ai/v2.6.3/tutorials/text.html, accessed on 25 February 2026) based on a subset of the Banking77-OOS Dataset (https://arxiv.org/abs/2106.04564, accessed on 25 February 2026) containing 1000 customer service requests. These requests can be classified into 10 categories corresponding to the intent of each request. Specifically, in this use case, a DCAI approach is adopted by leveraging the confident-learning framework, implemented through Cleanlab, to detect and mitigate label errors automatically. Rather than improving model architecture, in this case study, the dataset’s quality is enhanced, and the impact of this dataset intervention on downstream model performance is assessed.

To this end, a baseline text classifier was trained by using a pre-trained foundation model, ELECTRA (https://huggingface.co/google/electra-small-discriminator, accessed on 25 February 2026), to encode examples and a Logistic Regression for classification. The trained model generates out-of-sample predicted class probabilities for each example. These probabilities serve as the key input for confident learning. Confident learning identifies inconsistencies between the observed labels and the model’s predicted label distributions. It estimates the joint distribution between noisy (observed) and latent (true) labels. This enables detection of samples whose assigned labels are unlikely given the model’s confidence, yielding a ranked list of potential label issues and a quality score for each example. To evaluate the practical impact of these errors, in this use case, a cleaned version of the dataset is used by removing the lowest-quality samples identified by confident learning, making it possible to retrain the same classification model on this filtered dataset. This ensures that any performance differences can be attributed solely to improvements in data quality. The results show that this DCAI intervention leads to consistent increases in classification accuracy. Notably, these gains were achieved without modifying the model architecture or tuning hyperparameters. This highlights the effectiveness of label error detection as a standalone optimization strategy. This illustrative example suggests that confident learning may provide a scalable and model-agnostic approach for improving dataset quality in NLP tasks. While this single use case does not constitute a comprehensive validation of the broader DCAI framework, it demonstrates the concrete potential of targeted data-centric interventions in a controlled setting.

7. DCAI in Real-Life Applications

The DCAI paradigm is becoming fundamental in real-world applications where rapid and reliable decisions are critical. Two domains are particularly instructive.

In industrial robotics and predictive maintenance [94], AI systems must anticipate equipment failures in real time by processing large streams of continuously updated sensor data. These streams frequently exhibit class imbalance between failure and normal-operation events and are subject to distribution shifts as machinery ages or operating conditions change. A DCAI approach addresses these challenges through the data reduction and active learning techniques described in Section 4.5, along with the drift-aware monitoring mechanisms discussed in Section 3.3. Rather than retraining models from scratch when performance degrades, the data-centric pipeline detects the root cause of degradation at the data level—whether it be noise, label drift, or loss of representativeness—and triggers targeted remediation.

In emergency healthcare, streams of clinical data, vital signs, and heterogeneous patient records are generated and integrated in real time. Completeness, timeliness, and representativeness—as defined in Section 1.2—are essential for improving the accuracy and fairness of clinical decision support models, as shown in [12], where the fine-tuning strategies described in Section 4.6 have also been directly applied, demonstrating that data quality and curation, rather than architectural choices, are the primary determinants of clinical predictive performance.

In both domains, the revised CRISP-DM methodology presented in Section 5 provides the governance structure that transforms these techniques into a reproducible, auditable, and continuously improving pipeline—one in which data quality, rather than model complexity, is the primary driver of sustained performance. Additionally, investing in data curation, standardization, and engineering enables AI systems to respond more confidently and robustly to urgent circumstances than model-centric approaches, where data is treated as static or secondary. This constitutes one of the main research and application directions of DCAI in the contemporary industrial and healthcare sectors.

Despite its promise, the DCAI paradigm is not without limitations. Iterative data curation is resource-intensive: repeated cycles of relabeling, enrichment, and quality assessment require sustained human expertise and computational effort, which may be prohibitive in low-resource settings or for organizations lacking dedicated data-engineering teams. Moreover, DCAI is inherently dependent on the availability of sufficient data to begin with: in domains characterized by data scarcity, the margin for data-centric improvement may be structurally limited.

8. Infrastructure for Intensive DCAI: Features and Solutions

To enhance the efficiency of DCAI applications, the underlying infrastructure must be able to handle data-intensive workloads; provide adequate storage volumes and high-speed and concurrent access to data; and support data-intensive computing. An international data-centric model of cloud-federation infrastructure inherently matches these requirements and maximizes the performance of DCAI applications, since it promotes global data management/governance and the integration and distribution of computing resources in a few central nodes accessible by scientific and commercial collaborations.As shown in Figure 3, a data-centric infrastructure is realized by decoupling storage and computing centers, providing a services backbone, and integrating these distributed resources under a common access layer and data governance. In this way, the federation overcomes the limitations of isolated data centers, creating a unified, scalable, and interoperable ecosystem appropriate for the most demanding scientific AI workloads. The key features are virtualization, layering, scalability, flexibility, heterogeneity, and interoperability.

Specifically, the virtualization allows the abstraction of physical resources into isolated units (virtual machines or containers), enabling different projects to run their specific AI tools on shared hardware. This ensures there is a consistent, reproducible environment, which is vital for MLOps. The layering provides logical separation into service tiers (IaaS, PaaS, and SaaS, which stand for Infrastructure as a Service, Platform as a Service, and Software as a Service)), allowing the provider/administrator to manage hardware while delivering specialized platforms to the users for their analysis. In other words, users can focus on their models rather than infrastructure maintenance. Scalability is the capacity to increase (or decrease) computing power and storage. An Infrastructure-as-Code approach and the center decoupling of data-centric model support an almost instant and automatic scaling to meet variable demands. A flexible and heterogeneous infrastructure easily adapts to diverse, rapidly changing needs, offering configurable, on-demand resource allocation. This is essential because AI workloads fluctuate significantly, demanding bursts of CPU cores for data preparation one day and dedicated GPUs for training the next. Indeed, in heterogeneous systems, computing accelerators are combined with traditional high-performance computing for intensive and data-intensive workloads that have relied on fundamental parallel and distributed multicore architectures. The interoperability approach enables different IT systems to exchange data and share resources and functionalities with minimal end-user intervention.

Therefore, DCAI can substantially benefit from the integration of modern computing accelerators. In this context, devices such as GPUs and emerging accelerators such as TPUs play a crucial role. Their highly parallel architectures are particularly well suited to matrix and tensor operations, which dominate both large-scale data preprocessing and deep learning model training. Compared to CPU-based solutions, these accelerators can drastically reduce training time while enabling efficient handling of large data batches, with improved energy efficiency and cost–performance ratios for equivalent workloads. The availability of large-scale GPU-enabled computing resources within the infrastructure enables the deployment of fully GPU-accelerated DataOps and MLOps pipelines. This capability is not limited to model training, extending to the entire data-centric workflow, including data ingestion, transformation, annotation, validation, and continuous model updating. As a result, the infrastructure supports shorter iteration cycles, faster experimentation, and more reliable deployment of DCAI systems.

For DCAI applications, reliable and well-documented data interoperability is fundamental. The three pillars of ensuring it are data lineage, versioning, and maintenance. Several dedicated open-source infrastructure and software solutions are available. For example, OpenLineage (OpenLineage documentation https://openlineage.io/, accessed on 25 February 2026) is an open-source platform for the collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, providing information to identify the root causes of issues and reveal the impact of changes. It is compatible with products of different ecosystems (Apache, Open Stack, etc.) thanks to its own open standard and aggregating collector (Marquez (Marquez documentation https://marquezproject.ai/, accessed on 25 February 2026). Another popular option is Pachyderm (Pachyderm documentation https://github.com/pachyderm/pachyderm, accessed on 25 February 2026) since it is Kubernetes-native, data-agnostic, and highly scalable. It offers both data lineage and versioning while automating complex data transformation pipelines. Data versioning allows rolling back to a previous dataset version or creating dataset branches to test pipelines under development. It is worth noting that the DVC (DVC documentation https://dvc.org/, accessed on 25 February 2026) tool can be the best option for data versioning in existing Git and small-scale development workflows since it natively extends the Git features to data. The ultimate tool with which to handle data movements in ML or DCAI applications is the feature store, which provides different services (sometimes including data lineage and versioning): feature definition; automated transforms; preprocessed feature and training dataset storage; feature ingestion in ML workflows; feature sharing and discovery; online ML serving; monitoring and alerting; security and data governance; and integration with third-party data and ML tools. Among the open-source providers, the most adopted feature stores are Feast (Feast documentation https://docs.feast.dev/, accessed on 25 February 2026) and Hopsworks (Hopsworks documentation https://www.hopsworks.ai/product-capabilities/feature-store, accessed on 25 February 2026). Feast is fully open-source and does not include the services of feature definition and automated transforms. This limitation can be easily overcome by combining it with other tools like Pachyderm. Feast has the advantages of being a stand-alone feature store and capable of integrating with third-party MLOps platforms. Hopsworks provides more services and different delivery distributions (open and commercial). It is part of the unified Hopsworks platform for feature engineering, real-time ML, and production AI. The mlops.community (MLOpsCommunity site https://mlops.community, accessed on 25 February 2026) provides a more detailed comparison of the principal open-source and commercial feature stores.

Considering the critical role of AI/ML and DCAI, national and international projects have been launched to upgrade the existing infrastructures so that they can fully support these tools. Table 5 reports the principal tools investigated and adopted by the collaboration between the NRRP INFN Cloud (https://www.cloud.infn.it/, accessed on 25 February 2026) and ICSC [95] projects, providing a practical benchmark of the essential services that a modern cloud federation should have to be able to run AI pipelines.

In conclusion, these infrastructure solutions represent one concrete instantiation of the federated DCAI model described above and demonstrate its feasibility in a large-scale scientific setting. Similar principles can be adopted and adapted in other national or international contexts.

9. Conclusions

In this paper, we argued that many of the current limitations of AI primarily stem not from insufficiently powerful models but from the way data are produced, curated, and used. Problems such as poor generalization, sensitivity to distribution shifts, vulnerability to adversarial perturbations, and lack of transparency are typically addressed by introducing more complex architectures or heavier training procedures. However, our analysis shows that these issues are, to a large extent, consequences of inadequate data rather than of inadequate models.

This observation motivates the DCAI paradigm, which treats data as the primary driver of performance, robustness, and trustworthiness. Instead of repeatedly redesigning learning algorithms, DCAI focuses on improving the quality, representativeness, and semantic consistency of the data on which these algorithms are trained and evaluated. By structuring the AI lifecycle around training-data development, inference data design, and data maintenance, we have shown how it becomes possible to explicitly determine how changes in data affect the behavior of models and to attribute performance improvements to concrete, observable data interventions.

As a perspective article, the methodological framework proposed here is intended to offer a coherent and principled foundation for future experimental and applied research rather than provide comprehensive empirical validation of all its components.

A central result of this work is the reinterpretation of the CRISP-DM methodology from a data-centric point of view. While CRISP-DM already recognized the importance of data preparation and iterative refinement, it was developed in a context where datasets were relatively static and models were the main object of optimization. The revised framework proposed here places data collection, labeling, and curation at the core of the process and turns deployment into a continuous phase of data and model co-evolution. This makes the methodology better suited to modern AI systems, which operate on evolving data streams and are increasingly subject to the requirements of traceability, accountability, and regulatory compliance.

The discussion on GenAI further illustrates why a data-centric perspective is now indispensable. Large generative models owe their capabilities to massive datasets, yet their weaknesses, such as hallucinations, bias amplification, and instability under distribution shifts, also originate in the data on which they are trained. DCAI provides a structured way of integrating generative models into controlled data pipelines, where data quality, semantic validity, and governance can be explicitly managed rather than left implicit.

The implications of this paradigm shift go beyond purely technical considerations. Treating data as a strategic and evolving asset changes how AI projects are organized, how responsibilities are distributed, and how trust in automated systems is established. It requires closer collaboration between domain experts, data engineers, and model developers and supports the growing demand for transparency and auditability in sensitive application areas.

From a research perspective, this work reveals several open directions. We need better ways to measure data quality in relation to model robustness and fairness, more expressive semantic representations that can support both learning and reasoning, and experimental protocols that make it possible to distinguish the effect of data interventions from that of model changes. We also need scalable methods for managing and curating data in large, heterogeneous, and continuously evolving environments, especially when generative models are part of the pipeline.

Overall, DCAI offers a concrete and principled response to the growing gap between the apparent power of modern AI systems and their fragile behavior in real-world settings. By shifting attention from models to data, it provides a foundation for building AI systems that are not only accurate but also reliable, understandable, and trustworthy.

Author Contributions

Conceptualization, D.M., A.P., M.A., T.B., M.T.C., R.M.D., D.D., D.E., V.P., M.S., V.S. and G.V.; Methodology, D.M., A.P., M.A., T.B., M.T.C., R.M.D., D.D., D.E., V.P., M.S., V.S. and G.V.; Resources, T.B., M.T.C., D.D., D.E., V.S. and G.V.; Writing—original draft preparation, D.M., A.P., M.A., T.B., M.T.C., R.M.D., D.D., D.E., V.P., M.S., V.S. and G.V.; Writing—review and editing, D.M. and V.P.; Supervision, D.M. and A.P.; Project administration, D.M.; Funding acquisition, D.M. In addition, section-level contributions are as follows: D.M. supervised the work and wrote the executive summary and Section 1 and Section 5; A.P. supervised the work and wrote the executive summary and Section 4.4; M.A. contributed to Section 2.3; T.B., M.T.C., D.D., D.E., V.S. and G.V. contributed to Section 8; R.M.D. contributed to Section 4.4; V.P. contributed to Section 1, Section 2, Section 3 and Section 4, Section 6 and Section 7; M.S. contributed to Section 2.3 and Section 3.1. All authors have read and agreed to the published version of the manuscript.

Funding

Ministero dell’università e della ricerca: PE00000013.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

This paper represents the joint outcome of five partners within the Italian project FAIR-Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU. These partners, spanning four different spokes, collectively contributed to the activities of FAIR Transversal Project 7–Data-Centric AI and Infrastructures. The contributing institutions are the University of Bari Aldo Moro (Spoke 6), the University of Calabria (Spoke 9), the University of Naples Federico II (Spoke 3), the University Sapienza of Rome (Spoke 5), and the INFN (National Institute of Nuclear Physics) (Spoke 6). Their coordinated efforts reflect the interdisciplinary and multi-institutional nature of FAIR, reinforcing the central role of data-centric methodologies in advancing trustworthy and robust AI systems.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zha, D.; Bhat, Z.P.; Lai, K.H.; Yang, F.; Hu, X. Data-centric AI: Perspectives and Challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), Minneapolis, MN, USA, 27–29 April 2023; pp. 945–948. [Google Scholar] [CrossRef]
Ng, A. Unbiggen AI. IEEE Spectrum, 9 February 2022.
Miller, R.; Whelan, H.; Chrubasik, M.; Whittaker, D.; Duncan, P.; Gregório, J. A Framework for Current and New Data Quality Dimensions: An Overview. Data 2024, 9, 151. [Google Scholar] [CrossRef]
Pipino, L.L.; Lee, Y.W.; Wang, R.Y. Data quality assessment. Commun. ACM 2002, 45, 211–218. [Google Scholar] [CrossRef]
Northcutt, C.; Jiang, L.; Chuang, I. Confident Learning: Estimating Uncertainty in Dataset Labels. J. Artif. Int. Res. 2021, 70, 1373–1411. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Quiñonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; Lawrence, N.D. Dataset Shift in Machine Learning; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar] [CrossRef]
Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.F.; Dennison, D. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; MIT Press: Cambridge, MA, USA, 2015; Available online: https://dl.acm.org/doi/10.5555/2969442.2969519 (accessed on 25 February 2026).
Majeed, A.; Hwang, S.O. A Data-Centric AI Paradigm for Socio-Industrial and Global Challenges. Electronics 2024, 13, 2156. [Google Scholar] [CrossRef]
Luley, P.P.; Deriu, J.M.; Yan, P.; Schatte, G.A.; Stadelmann, T. From Concept to Implementation: The Data-Centric Development Process for AI in Industry. In Proceedings of the 2023 10th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland, 22–23 June 2023; pp. 73–76. [Google Scholar] [CrossRef]
Xu, X.; Wu, Z.; Qiao, R.; Verma, A.; Shu, Y.; Wang, J.; Niu, X.; He, Z.; Chen, J.; Zhou, Z.; et al. Position Paper: Data-Centric AI in the Age of Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 11895–11913. [Google Scholar] [CrossRef]
Pasquadibisceglie, V.; Appice, A.; Malerba, D.; Fiameni, G. Leveraging a large language model (LLM) to predict hospital admissions of emergency department patients. Expert Syst. Appl. 2025, 240, 128224. [Google Scholar] [CrossRef]
Pasquadibisceglie, V.; Recchia, V.; Appice, A.; Malerba, D.; Fiameni, G. GANDALF: A LLM-based approach to map bark beetle outbreaks in semantic stories of Sentinel-2 images. In Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, Catania, Italy, 31 March–4 April 2025; SAC ’25, pp. 1074–1081. [Google Scholar] [CrossRef]
Casciani, A.; Bernardi, M.L.; Cimitile, M.; Marrella, A. Enhancing next activity prediction in process mining with Retrieval-Augmented Generation. Inf. Syst. 2026, 137, 102642. [Google Scholar] [CrossRef]
Umer, F.; Adnan, N. Generative artificial intelligence: Synthetic datasets in dentistry. BDJ Open 2024, 10, 13. [Google Scholar] [CrossRef] [PubMed]
Nieberl, M.; Zeiser, A.; Timinger, H.; Friedrich, B. Enhancing the Performance of Computer Vision Systems in Industry: A Comparative Evaluation Between Data-Centric and Model-Centric Artificial Intelligence. Electronics 2025, 14, 4366. [Google Scholar] [CrossRef]
Chen, Y.; Yan, Z.; Zhu, Y. A comprehensive survey for generative data augmentation. Neurocomputing 2024, 600, 128167. [Google Scholar] [CrossRef]
Bhuyan, S.S.; Sateesh, V.; Mukul, N.; Galvankar, A.; Mahmood, A.; Nauman, M.; Rai, A.; Bordoloi, K.; Basu, U.; Samuel, J. Generative artificial intelligence use in healthcare: Opportunities for clinical excellence and administrative efficiency. J. Med. Syst. 2025, 49, 10. [Google Scholar] [CrossRef]
Desai, A.P.; Ravi, T.; Luqman, M.; Mallya, G.; Kota, N.; Yadav, P. Opportunities and Challenges of Generative-AI in Finance. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData), Washington, DC, USA, 15–18 December 2024; pp. 4913–4920. [Google Scholar] [CrossRef]
Chen, I.Y.; Joshi, S.; Ghassemi, M. Treating health disparities with artificial intelligence. Nat. Med. 2020, 26, 16–17. [Google Scholar] [CrossRef] [PubMed]
Giuffrè, M.; Shung, D.L. Harnessing the power of synthetic data in healthcare: Innovation, application, and privacy. npj Digit. Med. 2023, 6, 186. [Google Scholar] [CrossRef]
Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Papernot, N.; Anderson, R.; Gal, Y. AI models collapse when trained on recursively generated data. Nature 2024, 631, 755–759. [Google Scholar] [CrossRef] [PubMed]
Aragon, C.; Guha, S.; Kogan, M.; Muller, M.; Neff, G. Human-Centered Data Science: An Introduction; MIT Press: Cambridge, MA, USA, 2022. [Google Scholar]
Dellermann, D.; Calma, A.; Lipusch, N.; Weber, T.; Weigel, S.; Ebel, P. The future of human-AI collaboration: A taxonomy of design knowledge for hybrid intelligence systems. arXiv 2021, arXiv:2105.03354. [Google Scholar] [CrossRef]
Tan, Z.; Li, D.; Wang, S.; Beigi, A.; Jiang, B.; Bhattacharjee, A.; Karami, M.; Li, J.; Cheng, L.; Liu, H. Large Language Models for Data Annotation and Synthesis: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 930–957. [Google Scholar] [CrossRef]
Huang, T.H.; Cao, C.; Bhargava, V.; Sala, F. The ALCHEmist: Automated Labeling 500x CHEaper than LLM Data Annotators. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2024; pp. 62648–62672. [Google Scholar] [CrossRef]
Mei, Y.; Song, S.; Fang, C.; Yang, H.; Fang, J.; Long, J. Capturing Semantics for Imputation with Pre-trained Language Models. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 61–72. [Google Scholar] [CrossRef]
Vito, G.; Starace, L.L.L.; Martino, S.; Ferrucci, F.; Palomba, F. Large language models in software engineering: A focus on issue report classification and user acceptance test generation. In Proceedings of the Ital-IA Intelligenza Artificiale-Thematic Workshops co-located with the 4th CINI National Lab AIIS Conference on Artificial Intelligence (Ital-IA 2024), Naples, Italy, 29–30 May 2024; pp. 48–53. [Google Scholar]
Jaimovitch-López, G.; Ferri, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez-Quintana, M.J. Can language models automate data wrangling? Mach. Learn. 2022, 112, 2053–2082. [Google Scholar] [CrossRef]
Alviano, M.; Macrì, P.; Reiners, L.A.R. ASP Chef Chats with Large Language Models. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2025, Montreal, QC, Canada, 16–22 August 2025; pp. 10989–10993. [Google Scholar] [CrossRef]
Stonebraker, M.; Ilyas, I.F. Data integration: The current status and the way forward. IEEE Data Eng. Bull. 2018, 41, 3–9. [Google Scholar]
Gilyazev, R.A.; Turdakov, D.Y. Active Learning and Crowdsourcing: A Survey of Optimization Methods for Data Labeling. Program. Comput. Softw. 2018, 44, 476–491. [Google Scholar] [CrossRef]
Wan, M.; Zha, D.; Liu, N.; Zou, N. In-Processing Modeling Techniques for Machine Learning Fairness: A Survey. ACM Trans. Knowl. Discov. Data 2023, 17, 35. [Google Scholar] [CrossRef]
Pereira, R.C.; Abreu, P.H.; Rodrigues, P.P.; Figueiredo, M.A. Imputation of data Missing Not at Random: Artificial generation and benchmark analysis. Expert Syst. Appl. 2024, 249, 123654. [Google Scholar] [CrossRef]
Ciavotta, M.; Cutrona, V.; De Paoli, F.; Nikolov, N.; Palmonari, M.; Roman, D. Supporting Semantic Data Enrichment at Scale. In Technologies and Applications for Big Data Value; Curry, E., Auer, S., Berre, A.J., Metzger, A., Perez, M.S., Zillner, S., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 19–39. [Google Scholar] [CrossRef]
Deng, D. DBSCAN Clustering Algorithm Based on Density. In Proceedings of the 2020 7th International Forum on Electrical Engineering and Automation (IFEEA), Hefei, China, 25–27 September 2020; pp. 949–953. [Google Scholar] [CrossRef]
Riccio, D.; Tortora, G.; Sangiovanni, M. RAZOR: Refining Accuracy by Zeroing Out Redundancies. arXiv 2024, arXiv:2410.14254. [Google Scholar] [CrossRef]
Li, K.; Wang, F.; Yang, L.; Liu, R. Deep feature screening: Feature selection for ultra high-dimensional data via deep neural networks. Neurocomputing 2023, 538, 126186. [Google Scholar] [CrossRef]
García-Gil, D.; Luque-Sánchez, F.; Luengo, J.; García, S.; Herrera, F. From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification. Int. J. Intell. Syst. 2019, 34, 3260–3274. [Google Scholar] [CrossRef]
Aversano, L.; Bernardi, M.L.; Cimitile, M.; Iammarino, M.; Verdone, C. A data-aware explainable deep learning approach for next activity prediction. Eng. Appl. Artif. Intell. 2023, 126, 106758. [Google Scholar] [CrossRef]
Pasquadibisceglie, V.; Donadello, I.; Appice, A.; Lanz, O.; Maggi, F.M.; Fiameni, G.; Malerba, D. Multimodal predictive process monitoring and its application to explainable clinical pathways. Inf. Syst. 2026, 139, 102698. [Google Scholar] [CrossRef]
Wang, Z.; Wang, P.; Liu, K.; Wang, P.; Fu, Y.; Lu, C.T.; Aggarwal, C.C.; Pei, J.; Zhou, Y. A Comprehensive Survey on Data Augmentation. IEEE Trans. Knowl. Data Eng. 2025, 38, 47–66. [Google Scholar] [CrossRef]
Pasquadibisceglie, V.; Appice, A.; Castellano, G.; Malerba, D. JARVIS: Joining Adversarial Training With Vision Transformers in Next-Activity Prediction. IEEE Trans. Serv. Comput. 2024, 17, 1593–1606. [Google Scholar] [CrossRef]
Chung, Y.; Kraska, T.; Polyzotis, N.; Tae, K.H.; Whang, S.E. Slice Finder: Automated Data Slicing for Model Validation. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 1550–1553. [Google Scholar] [CrossRef]
Liu, J.; Shen, Z.; He, Y.; Zhang, X.; Xu, R.; Yu, H.; Cui, P. Towards Out-Of-Distribution Generalization: A Survey. arXiv 2023, arXiv:2108.13624. [Google Scholar] [CrossRef]
Pauwels, S.; Calders, T. Incremental Predictive Process Monitoring: The Next Activity Case. In Proceedings of the Business Process Management: 19th International Conference, BPM 2021, Rome, Italy, 6–10 September 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 123–140. [Google Scholar] [CrossRef]
Pasquadibisceglie, V. Handling concept drifts with traditional process discovery algorithms. J. Intell. Inf. Syst. 2025, 64, 179–213. [Google Scholar] [CrossRef]
Kumar, D.; Addula, S.R.; Lind, M.; Brown, S.; Odion, S. AI-Driven Hybrid Deep Learning and Swarm Intelligence for Predictive Maintenance of Smart Manufacturing Robots in Industry 4.0. Electronics 2026, 15, 715. [Google Scholar] [CrossRef]
Burch, M.; Weiskopf, D. On the Benefits and Drawbacks of Radial Diagrams. In Handbook of Human Centric Visualization; Huang, W., Ed.; Springer: New York, NY, USA, 2014; pp. 429–451. [Google Scholar] [CrossRef]
Van Aken, D.; Pavlo, A.; Gordon, G.J.; Zhang, B. Automatic Database Management System Tuning Through Large-scale Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL, USA, 14–19 May 2017; SIGMOD ’17, pp. 1009–1024. [Google Scholar] [CrossRef]
Carlini, N.; Erlingsson, Ú.; Papernot, N. Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications. arXiv 2019, arXiv:1910.13427. [Google Scholar] [CrossRef]
Stonebraker, M.; Rezig, E.K. Machine Learning and Big Data: What is Important? IEEE Data Eng. Bull. 2019, 42, 3–7. [Google Scholar]
Lakshminarayan, K.; Harp, S.; Goldman, R.; Samad, T. Imputation of missing data using machine learning techniques. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Simoudis, E., Han, J., Fayyad, U., Eds.; AAAI Press: Washington, DC, USA, 1996; pp. 140–145. [Google Scholar]
Jiang, Z.; Zhou, K.; Liu, Z.; Li, L.; Chen, R.; Choi, S.H.; Hu, X. An information fusion approach to learning with instance-dependent label noise. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Čreslovnik, D.; Košmerlj, A.; Ciavotta, M. Using historical and weather data for marketing and category management in ecommerce: The experience of EW-shopp. In Proceedings of the 12th European Conference on Software Architecture: Companion Proceedings, ECSA ’18, Madrid, Spain, 24–28 September 2018. [Google Scholar] [CrossRef]
Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature Selection: A Data Perspective. ACM Comput. Surv. 2017, 50, 94. [Google Scholar] [CrossRef]
Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with Noisy Labels. In Proceedings of the Advances in Neural Information Processing Systems; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
Liu, Y.P.; Xu, N.; Zhang, Y.; Geng, X. Label Distribution for Learning with Noisy Labels. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, Yokohama, Japan, 7–15 January 2021; Bessiere, C., Ed.; ACM: New York, NY, USA, 2021; pp. 2568–2574. [Google Scholar] [CrossRef]
Andresini, G.; Appice, A.; Ienco, D.; Recchia, V. DIAMANTE: A data-centric semantic segmentation approach to map tree dieback induced by bark beetle infestations via satellite images. J. Intell. Inf. Syst. 2024, 62, 1531–1558. [Google Scholar] [CrossRef]
Recchia, V.; Andresini, G.; Appice, A.; Fontana, G.; Malerba, D. An Attention-Based CNN Approach to Detect Forest Tree Dieback Caused by Insect Outbreak in Sentinel-2 Images. In Proceedings of the Discovery Science; Pedreschi, D., Monreale, A., Guidotti, R., Pellungrini, R., Naretto, F., Eds.; Springer: Cham, Switzerland, 2025; pp. 183–199. [Google Scholar] [CrossRef]
Putrama, I.M.; Martinek, P. Heterogeneous data integration: Challenges and opportunities. Data Brief 2024, 56, 110853. [Google Scholar] [CrossRef] [PubMed]
Malerba, D.; Pasquadibisceglie, V. Data-Centric AI. J. Intell. Inf. Syst. 2024, 62, 1493–1502. [Google Scholar] [CrossRef]
Oved, A.; Shlomov, S.; Zeltyn, S.; Mashkif, N.; Yaeli, A. Snap: Semantic stories for next activity prediction. Proc. AAAI Conf. Artif. Intell. 2025, 39, 28871–28877. [Google Scholar] [CrossRef]
Antoniadi, A.M.; Du, Y.; Guendouz, Y.; Wei, L.; Mazo, C.; Becker, B.A.; Mooney, C. Current challenges and future opportunities for XAI in machine learning-based clinical decision support systems: A systematic review. Appl. Sci. 2021, 11, 5088. [Google Scholar] [CrossRef]
Hogan, A.; Blomqvist, E.; Cochez, M.; D’amato, C.; Melo, G.D.; Gutierrez, C.; Kirrane, S.; Gayo, J.E.L.; Navigli, R.; Neumaier, S.; et al. Knowledge Graphs. ACM Comput. Surv. 2021, 54, 71. [Google Scholar] [CrossRef]
Cimiano, P.; Paulheim, H. Knowledge graph refinement: A survey of approaches and evaluation methods. Semant. Web 2017, 8, 489–508. [Google Scholar] [CrossRef]
Xue, B.; Zou, L. Knowledge Graph Quality Management: A Comprehensive Survey. IEEE Trans. Knowl. Data Eng. 2023, 35, 4969–4988. [Google Scholar] [CrossRef]
Peng, C.; Xia, F.; Naseriparsa, M.; Osborne, F. Knowledge graphs: Opportunities and challenges. Artif. Intell. Rev. 2023, 56, 13071–13102. [Google Scholar] [CrossRef]
Masmoudi, M.; Ben Abdallah Ben Lamine, S.; Karray, M.H.; Archimede, B.; Baazaoui Zghal, H. Semantic Data Integration and Querying: A Survey and Challenges. ACM Comput. Surv. 2024, 56, 209. [Google Scholar] [CrossRef]
Futia, G.; Vetrò, A. On the integration of knowledge graphs into deep learning models for a more comprehensible AI—Three challenges for future research. Information 2020, 11, 122. [Google Scholar] [CrossRef]
Zhang, J.; Chen, B.; Zhang, L.; Ke, X.; Ding, H. Neural, symbolic and neural-symbolic reasoning on knowledge graphs. AI Open 2021, 2, 14–35. [Google Scholar] [CrossRef]
Angles, R.; Gutierrez, C. Survey of graph database models. ACM Comput. Surv. 2008, 40, 1. [Google Scholar] [CrossRef]
Angles, R.; Arenas, M.; Barceló, P.; Hogan, A.; Reutter, J.; Vrgoč, D. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 2017, 50, 68. [Google Scholar] [CrossRef]
Baader, F.; Horrocks, I.; Lutz, C.; Sattler, U. An Introduction to Description Logic, 1st ed.; Cambridge University Press: Cambridge, MA, USA, 2017. [Google Scholar] [CrossRef]
Krötzsch, M. Ontologies for Knowledge Graphs? In Proceedings of the 30th International Workshop on Description Logics, Montpellier, France, 18–21 July 2017; Artale, A., Glimm, B., Kontchakov, R., Eds.; CEUR Workshop Proceedings; RWTH Aachen University: Aachen, Germany, 2017; Volume 1879. [Google Scholar]
Lenzerini, M.; Lepore, L.; Poggi, A. Metamodeling and metaquerying in OWL2QL. Artif. Intell. 2021, 292, 103432. [Google Scholar] [CrossRef]
Brickley, D.; Guha, R.V. RDF Schema 1.1. W3c Recommendation, World Wide Web Consortium (W3C): 2014. Available online: https://www.bibsonomy.org/bibtex/9ff17d493abee300f9fff3bff7d2a339 (accessed on 24 February 2026).
ter Horst, H.J. Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. Web Semant. 2005, 3, 79–115. [Google Scholar] [CrossRef]
Franconi, E.; Gutierrez, C.; Mosca, A.; Pirrò, G.; Rosati, R. The Logic of Extensional RDFS. In Proceedings of the 12th International Semantic Web Conference–Part I; ISWC ’13; Springer: Berlin/Heidelberg, Germany, 2013; pp. 101–116. [Google Scholar] [CrossRef]
de Bruijn, J.; Heymans, S. Logical Foundations of (e)RDF(S): Complexity and Reasoning. In The Semantic Web; Aberer, K., Choi, K.S., Noy, N., Allemang, D., Lee, K.I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 86–99. [Google Scholar] [CrossRef]
Delfino, R.M.; Lenzerini, M.; Poggi, A. RDFS Knowledge Graphs Through the Lens of Logic: Semantics and Query Answering. In Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025); Frontiers in Artificial Intelligence and Applications; IOS Press: Amsterdam, The Netherlands, 2025; Volume 413, pp. 1511–1518. [Google Scholar] [CrossRef]
Tharwat, A.; Schenck, W. A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics 2023, 11, 820. [Google Scholar] [CrossRef]
Moles, L.; Andres, A.; Echegaray, G.; Boto, F. Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics 2024, 12, 1898. [Google Scholar] [CrossRef]
Yang, J.; Wang, H.; Wu, S.; Chen, G.; Zhao, J. Towards Controlled Data Augmentations for Active Learning. In Proceedings of the 40th International Conference on Machine Learning; Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; JMLR.org: Norfolk, MA, USA, 2023; Volume 202, pp. 39524–39542. [Google Scholar]
Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
Apóstolo, D.; Santos, M.S.; Lorena, A.C.; Henriques Abreu, P. Pycol: A Python package for dataset complexity measures. Neurocomputing 2025, 640, 130311. [Google Scholar] [CrossRef]
Chapman, P.; Clinton, J.; Kerber, R.; Khabaza, T.; Reinartz, T.; Shearer, C.; Wirth, R. CRISP-DM 1.0: Step-by-Step Data Mining Guide; SPSS Inc.: Chicago, IL, USA, 2000; Volume 9, pp. 1–73. [Google Scholar]
Wirth, R.; Hipp, J. CRISP-DM: Towards a standard process model for data mining. In Proceedings of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining, Manchester, UK, 11–13 April 2000; Volume 1, pp. 29–39. [Google Scholar]
Martínez-Plumed, F.; Contreras-Ochando, L.; Ferri, C.; Hernández-Orallo, J.; Kull, M.; Lachiche, N.; Ramírez-Quintana, M.J.; Flach, P. CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories. IEEE Trans. Knowl. Data Eng. 2021, 33, 3048–3061. [Google Scholar] [CrossRef]
Barrak, A.; Eghan, E.E.; Adams, B. On the Co-evolution of ML Pipelines and Source Code - Empirical Study of DVC Projects. In Proceedings of the 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Honolulu, HI, USA, 9–12 March 2021; pp. 422–433. [Google Scholar] [CrossRef]
Ratner, A.; Bach, S.H.; Ehrenberg, H.; Fries, J.; Wu, S.; Ré, C. Snorkel: Rapid training data creation with weak supervision. VLDB J. 2020, 29, 709–730. [Google Scholar] [CrossRef]
Testi, M.; Ballabio, M.; Frontoni, E.; Iannello, G.; Moccia, S.; Soda, P.; Vessio, G. MLOps: A Taxonomy and a Methodology. IEEE Access 2022, 10, 63606–63618. [Google Scholar] [CrossRef]
Nadǎş, M.; Dioşan, L.; Tomescu, A. Synthetic Data Generation Using Large Language Models: Advances in Text and Code. IEEE Access 2025, 13, 134615–134633. [Google Scholar] [CrossRef]
Azeta, J.; Omeche, T.T.; Daniyan, I.; Abiola, J.O.; Daniyan, L.; Phuluwa, H.S.; Muvunzi, R. Artificial intelligence and robotics in predictive maintenance: A comprehensive review. Front. Mech. Eng. 2026, 11, 2025. [Google Scholar] [CrossRef]
Grandi, C.; Bettoni, D.; Boccali, T.; Carlino, G.; Cesini, D.; dell’Agnello, L.; Donvito, G.; Salomoni, D.; Zoccoli, A. ICSC: The Italian National Research Centre on HPC, Big Data and Quantum computing. EPJ Web Conf. 2024, 295, 10003. [Google Scholar] [CrossRef]
Retico, A.; Avanzo, M.; Boccali, T.; Bonacorsi, D.; Botta, F.; Cuttone, G.; Martelli, B.; Salomoni, D.; Spiga, D.; Triann, A.; et al. Enhancing the impact of Artificial Intelligence in Medicine: A joint AIFM-INFN Italian initiative for a dedicated cloud-based computing infrastructure. Phys. Medica 2021, 91, 140–150. [Google Scholar] [CrossRef]
Salomoni, D.; Campos, I.; Gaido, L.; de Lucas, J.M.; Solagna, P.; Gomes, J.; Matyska, L.; Fuhrman, P.; Hardt, M.; Donvito, G.; et al. INDIGO-DataCloud: A Platform to Facilitate Seamless Access to E-Infrastructures. J. Grid Comput. 2018, 16, 381–408. [Google Scholar] [CrossRef]
Ceccanti, A.; Hardt, M.; Wegh, B.; Millar, A.P.; Caberletti, M.; Vianello, E.; Licehammer, S. The INDIGO-Datacloud Authentication and Authorization Infrastructure. J. Phys. Conf. Ser. 2017, 898, 102016. [Google Scholar] [CrossRef]
Gargliardi, F.; Jones, B.; Reale, M.; Burke, S. European DataGrid Project: Experiences of Deploying a Large Scale Testbed for E-science Applications. Lect. Notes Comput. Sci. 2002, 2459, 480–499. [Google Scholar] [CrossRef]
Antonacci, M.; Costantini, A.; Donvito, G.; Giommi, L.; Grandi, C.; Martelli, B.; Spiga, D.; Serra, E.; Savarese, G.; Vianello, E. The evolution of INFN’s Cloud Platform: Improvements in Orchestration and User Experience. EPJ Web Conf. 2025, 337, 01113. [Google Scholar] [CrossRef]
Savarese, G.; Giommi, L.; Calanducci, A.; Casale, A.; Costantini, A.; Donvito, G.; Fanzago, F.; Gasparetto, J.; Grandi, C.; Martelli, B.; et al. Plan for a renewed PaaS Orchestration solution in the DataCloud Project at INFN. In Proceedings of the International Symposium on Grids and Clouds (ISGC2025); Academia Sinica Grid Computing Centre (ASGC): Taipei, Taiwan, 2025. [Google Scholar] [CrossRef]
Barisits, M.; Beermann, T.; Berghaus, F.; Bockelman, B.; Bogado, J.; Cameron, D.; Christidis, D.; Ciangottini, D.; Dimitrov, G.; Elsing, M.; et al. Rucio: Scientific Data Management. Comput. Softw. Big Sci. 2019, 3, 11. [Google Scholar] [CrossRef]
Kiryanov, A.; Álvarez Ayllón, A.; Salichos, M.; Keeble, O. FTS3-A File Transfer Service for Grids, HPCs and Clouds. In International Symposium on Grids and Clouds (ISGC) 2015; Academia Sinica: Taipei, Taiwan, 2015. [Google Scholar] [CrossRef]
Anderlini, L.; Barbetti, M.; Bianchini, G.; Ciangottini, D.; Pra, S.D.; Michelotto, D.; Spiga, D. Developing Artificial Intelligence in the Cloud: The AI Infn Platform. Comput. Sci. 2025, 26, 20. [Google Scholar] [CrossRef]
Camerlingo, M.T. ML-based classification in an open-source framework for the ALICE heavy-flavour analysis. EPJ Web Conf. 2025, 337, 01049. [Google Scholar] [CrossRef]
Rossi, F.; Battaglieri, M.; Gavalian, G.; Ragusa, E.; Gastaldo, P. Real Time implementation of Artificial Intelligence compression algorithm for High-Speed Streaming Readout signals. EPJ Web Conf. 2025, 337, 01135. [Google Scholar] [CrossRef]
Ciangottini, D.; Spiga, D.; Memon, A.S.; Manzi, A.; Filipcic, A.; Troja, A.; Fanzago, F.; Bianchini, G.; Sgaravatto, M.; Prica, T.; et al. Unlocking the compute continuum: Scaling out from cloud to HPC and HTC resources. EPJ Web Conf. 2025, 337, 01296. [Google Scholar] [CrossRef]

Figure 1. (Left) A model-centric perspective, in which algorithmic adaptability is considered the primary driver of progress (data is static, and the model is dynamic). (Right) A data-centric perspective, highlighting the evolving nature of data as the fundamental source of improvement (the model is static, and data is dynamic).

Figure 2. The CRISP-DM process. The six phases are arranged in a circular process emphasizing iteration. Bidirectional arrows denote stronger interdependence between specific phases; the outer circle indicates that deployment may trigger new cycles.

Figure 3. Data-centric infrastructure model. Lake nodes represent distributed data storage sites; CPU and HPC centers provide computing resources; commercial cloud nodes offer elastic capacity. Interconnections exceeding 1 Tb/s ensure high-speed data transfer across the federation.

Table 1. Breakdown of the six CRISP-DM phases into their main tasks and expected outputs, as defined in the original CRISP-DM 1.0 specification [87,88].

Phase	Main Tasks
Business Understanding	- Determine Business Objectives—background, business success criteria
	- Assess Situation—inventory of resources, requirements, assumptions, constraints, risks, and contingencies
	- Determine Data Mining Goals—goals, success criteria
	- Produce Project Plan—project plan, initial assessment of tools and techniques
Data Comprehension	- Collect Initial Data—initial data collection report
	- Describe Data—data description report
	- Explore Data—data exploration report
	- Verify Data Quality—data quality report
Data Preparation	- Data-Set Description
	- Select Data—rationale for inclusion/exclusion
	- Clean Data—data-cleaning report
	- Construct Data—derived attributes, generated records
	- Integrate Data—merged data
	- Format Data—reformatted data
Modeling	- Select Modeling Technique—technique, assumptions
	- Generate Test Design
	- Build Model Parameter Settings—models, model description
	- Assess Model—assessment, revised parameters
Evaluation	- Evaluate Results—alignment with business success criteria
	- Approve Models—review of process
	- Determine Next Steps—possible actions, decisions
Deployment	- Plan Deployment
	- Plan Monitoring and Maintenance
	- Produce Final Report—report, presentation
	- Review Project—experience documentation

Table 2. Extension of CRISP-DM data preparation tasks from the DCAI perspective. Each original task is preserved and extended to reflect the epistemological shift from pre-modeling preparation to continuous, versioned data curation.

Original CRISP-DM Task	Extended DCAI Task
Select Data	Move beyond simple selection to prioritize representative examples and order training data strategically, improving learning efficiency (e.g., via curriculum learning).
Clean Data	Extend cleaning to include systematic outlier detection and refinement of labels to ensure ongoing semantic reliability.
Construct Data	Extend feature construction to introduce synthetic examples and external contextual features to enhance generalization and expressiveness.
Integrate Data	Extend beyond technical merging to include semantic harmonization. This ensures that when heterogeneous sources are combined, the resulting units of analysis are not only technically coherent but also conceptually valid.
Format Data	Move beyond reshaping datasets for modeling, ensuring that transformations are well documented, transparent, traceable, and aligned with ethical and regulatory requirements.

Table 3. CRISP-DM phases vs. revised DCAI phases.

CRISP-DM Phase	Revised DCAI Phase	Main Conceptual Difference
Business Understanding	Understanding Business and Data Requirements	Business goals are inherently data-dependent.
Data Understanding	Data Collection and Labeling	The focus shifts from exploratory data inspection to the active construction of the dataset (data acquisition and labeling governance).
Data Preparation	Data Curation	Data preparation is reinterpreted as a continuous, dataset-versioned, and auditable curation process (e.g., bias correction, label refinement, and enrichment) instead of a one-off pre-modeling activity.
Modeling	Model Training	Training is performed on stabilized, versioned datasets to enable causal attribution of performance changes to data interventions rather than to algorithmic exploration. The focus shifts from model search to controlled learning on curated data.
Evaluation	Evaluation (Business and Data Achievements)	Evaluation no longer assesses only business performance. Instead, continuous quality reporting is performed to explicitly verify whether data-centric interventions causally improve robustness, interpretability, reliability, and trustworthiness, alongside business KPIs.
Deployment	Deployment and Continuous Data-Centric Operations	Deployment is extended to include governance logs for automated data quality monitoring, drift detection, post-deployment data curation, and feedback loops between operational data and dataset evolution. The system remains “data-alive” after release.

Table 4. Alignment between DCAI stages and revised DCAI CRISP-DM phases. The three macro-stages of the DCAI lifecycle (Section 3) are mapped onto the corresponding revised CRISP-DM phases (Section 5.2).

Three DCAI Stages	Corresponding Revised DCAI Phases (from Table 3)
Training Data Development	Understanding Business and Data Requirements → Data Collection and Labeling → Data Curation
Inference Data Development	Model Training → Evaluation (Business and Data Achievements)
Data Maintenance	Deployment and Continuous Data-Centric Operations

Table 5. Partial list of the tools available via Data-Cloud and in INFN solution portfolio [96]. Finer details can be found in the reported references and URLs. The entire catalogue of INFN Cloud can be found by accessing the following link: https://www.cloud.infn.it/services-catalouge/ (accessed on 25 February 2026).

Action	Tools	Comment	Reference/URL
Authentication /Authorization	INDIGO-IAM	Industry-graded authorization and authentication mechanism, including the definition of roles and groups.	[97,98]
Cloud Orchestration	INDIGO-PaaS	Coordination of the provisioning of virtualized computation resources on both private and public cloud management frameworks.	[99,100,101]
Storage Services	storm, S3, Rados, Pandora, Sync&Share	Distributed and redundant, at the PB level with optional ISO 27001 certification. Complete integration with authentication and authorization.	storm url (https://italiangrid.github.io/storm/, accessed on 25 February 2026), Amazon S3 url (https://aws.amazon.com/s3, accessed on 25 February 2026), Rados Documentation (https://docs.ceph.com/en/reef/man/8/rados/, accessed on 25 February 2026), INFN pandora url (https://pandora.infn.it/, accessed on 25 February 2026)
Data Management and Transfer	Rucio, FTS	Data Management and Transfer tools scaling at the Exabyte level.	[102,103]
Remote Streaming	Xrootd, WebDAV, Kafka	High speed remote data access and completely integrated with the authorization and authentication service.	xrootd url (http://www.xrootd.org/, accessed on 25 February 2026, WebDAV url (http://www.webdav.org/, accessed on 25 February 2026), Kafka url (http://kafka.apache.org/, accessed on 25 February 2026)
Machine Learning processing environment (suited for DCAI R&D)	AI_INFN	Multi-user environment providing access to specialized hardware (GPU and NVMe) in a scalable and transparent manner. It supports both interactive and distributed processing.	AI_INFN Documentation (https://ai-infn.baltig-pages.infn.it/wp-1/docs/, accessed on 25 February 2026) [104]
Multi-purpose analysis processing environments	JupyterHub PaaS	Scalable, transparent, and multi-user environment providing access to resources within Data-Cloud, based on Virtual Machines and Docker. It also supports both interactive and distributed processing.	JupyterHub PaaS Documentation (https://guides.cloud.infn.it/users_guides/sysadmin/compute/jh_with_persistence/, accessed on 25 February 2026)
(suited for DCAI R&D)	JupterHub SaaS	Local service provided by ReCaS (one of the INFN Cloud site), based on Kubernetes.	INFN ReCaS SaaS url (https://www.recas-bari.it/index.php/it/recas-bari-i-servizi-it/recas-bari-i-servizi/cloud-recas-software-as-a-service, accessed on 25 February 2026) [105] Kubernetes url (https://kubernetes.io/, accessed on 25 February 2026)
Anonymization Service	DICOM ToolKit libraries, and DicomAnonymizer Python	Requirement for sensitive data, such as medical dataset.	DICOM ToolKit url (https://www.dcmtk.org/, accessed on 25 February 2026)
DCAI	package Heterogeneous resources	Different hardware implementations allowing one to find the best platform with which to achieve real-time performance for specific applications, such as the AI compression algorithm for High-Speed Streaming Readout signals.	[106]
Offloading	interLink, Virtual-kubelet	A “transparent offloading” of containerized payloads using Virtual-Kubelet and interLink extension, creating a common cloud-native interface with which to access any number of external hardware machines and any type of backend.	interLink (https://intertwin-eu.github.io/interLink/, accessed on 25 February 2026), [107]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Malerba, D.; Poggi, A.; Alviano, M.; Boccali, T.; Camerlingo, M.T.; Delfino, R.M.; Diacono, D.; Elia, D.; Pasquadibisceglie, V.; Sangiovanni, M.; et al. Data-Centric AI Manifesto: How Data Quality Drives Modern AI. Electronics 2026, 15, 1913. https://doi.org/10.3390/electronics15091913

AMA Style

Malerba D, Poggi A, Alviano M, Boccali T, Camerlingo MT, Delfino RM, Diacono D, Elia D, Pasquadibisceglie V, Sangiovanni M, et al. Data-Centric AI Manifesto: How Data Quality Drives Modern AI. Electronics. 2026; 15(9):1913. https://doi.org/10.3390/electronics15091913

Chicago/Turabian Style

Malerba, Donato, Antonella Poggi, Mario Alviano, Tommaso Boccali, Maria Teresa Camerlingo, Roberto Maria Delfino, Domenico Diacono, Domenico Elia, Vincenzo Pasquadibisceglie, Mara Sangiovanni, and et al. 2026. "Data-Centric AI Manifesto: How Data Quality Drives Modern AI" Electronics 15, no. 9: 1913. https://doi.org/10.3390/electronics15091913

APA Style

Malerba, D., Poggi, A., Alviano, M., Boccali, T., Camerlingo, M. T., Delfino, R. M., Diacono, D., Elia, D., Pasquadibisceglie, V., Sangiovanni, M., Spinoso, V., & Vino, G. (2026). Data-Centric AI Manifesto: How Data Quality Drives Modern AI. Electronics, 15(9), 1913. https://doi.org/10.3390/electronics15091913

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Centric AI Manifesto: How Data Quality Drives Modern AI

Abstract

1. Introduction

1.1. Context: From Model-Centric AI to Data-Centric AI

1.2. Data Quality Indicators

1.3. DCAI Formalization

1.4. Why a Paradigm Shift Is Necessary

1.5. Expected Impacts on Research, Industry, and Society

2. Data-Centric AI and Generative AI: Convergences and Synergies

2.1. GenAI as a Catalyst for DCAI

2.2. DCAI as a Foundation for Responsible GenAI

2.3. GenAI in Service of DCAI

3. Data Lifecycle

3.1. Training-Data Development

3.2. Inference Data Development

3.3. Data Maintenance

4. Techniques and Tools for DCAI

4.1. Techniques for Data Cleaning and Selection

4.2. Techniques for Handling Noise and Incorrect Labels

4.3. Techniques for Smart-Data Extraction and Data Enrichment

4.4. Techniques for Semantic Data Preparation

4.5. Active Learning and Data Augmentation

4.6. Transfer Learning and Fine-Tuning

4.7. Libraries and Tools

4.7.1. Data Profiling

4.7.2. Synthetic Data

4.7.3. Data Labeling

5. Methodological Evolution

5.1. CRISP-DM Limitations in DCAI

5.2. Revisiting CRISP-DM

5.3. Axes of Methodological Differentiation from CRISP-DM

5.4. Implications for Industrial Adoption and Links to MLOps and GenAIOps

6. Use Case: Data-Centric Improvement of Text Classification via Confident Learning

7. DCAI in Real-Life Applications

8. Infrastructure for Intensive DCAI: Features and Solutions

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI