1. Introduction
1.1. Motivation and Context
The increasing deployment of Artificial Intelligence (AI) in high-risk decision-making contexts—including healthcare, finance, critical infrastructures, and public administration—has intensified demands for transparency, accountability, and effective human oversight throughout the algorithmic lifecycle.
Importantly, this article does not aim to design or improve an intrusion detection algorithm. The anomaly-detection setting is used solely as an illustrative instantiation to demonstrate how decisions and their supporting artefacts can be recorded, linked, and later reconstructed for audit and accountability under the GDPR and AI Act.
Within the European regulatory landscape, the combined requirements of the General Data Protection Regulation (GDPR) [
1] and the Artificial Intelligence Act (AI Act) [
2] impose stringent obligations on how automated decision systems must be designed, documented, and audited, requiring organisations to treat transparency, accountability, and human oversight as first-class design constraints rather than post hoc extensions.
The synthetic, IDS-inspired setting is therefore best understood as a controlled test harness as follows: it allows the paper to stress-test evidence generation, linkage, and governance gates under reproducible conditions (fixed seeds, explicit parameters, and deterministic artefact paths), while keeping the methodological focus on audit reconstruction rather than on detector optimisation. In many regulated deployments, sharing operational datasets is constrained by privacy, security, contractual, and confidentiality obligations; accordingly, the manuscript prioritises a rerunnable evidence protocol and a clearly specified real-data instantiation procedure over publishing domain-sensitive data.
From a technical perspective, the rapid evolution of machine learning has exposed a persistent gap between mainstream engineering practices and legal–regulatory expectations.
The literature on Explainable Artificial Intelligence (XAI) has consistently emphasised interpretability, traceability, and auditability as central pillars of algorithmic trustworthiness [
3,
4], yet these principles are rarely translated into concrete artefacts and metrics that can be inspected by data protection officers, auditors, or regulators. In many production environments, explanation mechanisms remain ad hoc, unversioned, and weakly connected to formal compliance processes, hindering the systematic demonstration of conformity with the requirements of the GDPR and AI Act [
5].
The security- and privacy-critical domains form a particularly demanding subset of high-risk AI applications. Examples include fraud detection, anomaly detection in critical infrastructures, cyberthreat detection in network traffic, and risk scoring in electronic public services. In such settings, engineering teams must ensure not only accurate and robust models but also operational evidence that these models behave in a stable, explainable, and accountable manner over time. This combination of operational criticality and regulatory scrutiny calls for integrated frameworks that embed XAI, compliance-by-design, and trustworthy Machine Learning Operations (MLOps) principles directly into the lifecycle of high-risk AI systems.
Existing approaches typically address these concerns in a fragmented manner. XAI methods are often implemented as add-ons to existing pipelines; governance frameworks remain largely conceptual; and regulatory analyses frequently lack explicit links to the artefacts actually produced by machine learning workflows. There remains a lack of a unified operational framework that connects technical metrics and artefacts—such as models, explanations, logs, and drift indicators—to concrete regulatory requirements in a way that is both implementable with mainstream tools and auditable in practice.
This need is especially evident in cybersecurity operations, such as security operations centres (SOCs) and intrusion detection systems (IDSs), where teams must justify alerts and interventions in environments subject to strict privacy and logging constraints.
1.2. Problem Statement
This work addresses the following central question: How can we design and implement an operational framework that integrates Explainability, Compliance-by-design, and trustworthy MLOps into high-risk AI systems in a way that produces verifiable evidence of conformity with the GDPR and the AI Act?
Current solutions often focus primarily on model performance and ad hoc explanations or on high-level governance principles that are not operationally implemented in machine learning pipelines. There is a lack of end-to-end architectures that (i) explicitly map technical artefacts to regulatory requirements, (ii) provide systematic compliance logging and tamper-evident evidence bundles, and (iii) can be instantiated in realistic, security- and privacy-critical scenarios without requiring bespoke tooling.
Addressing this gap is essential to operationalising European regulatory obligations in a technically grounded manner and supporting European digital sovereignty in the governance of high-risk AI systems.
1.3. Research Objectives and Contributions
The overarching goal of this article is to propose and validate a modular XAI compliance-by-design framework that bridges the gap between regulatory requirements and machine learning practices in high-risk AI systems. This research is guided by the following questions (RQs):
RQ1: How can XAI techniques, compliance-by-design principles, and trustworthy MLOps practices be integrated into a single modular framework for high-risk AI systems?
RQ2: To what extent can such a framework produce concrete, verifiable artefacts—such as models, explanations, logs, and evidence bundles, bundles—that support GDPR-and AI Act-aligned auditability and accountability?
RQ3: How does the proposed framework behave when instantiated in an illustrative anomaly-detection scenario using a synthetic, security-relevant tabular dataset in terms of predictive performance, explainability, and drift monitoring as first-class, versioned artefacts that support auditability and accountability?
In response to these questions, the article contributes the following:
- 1.
A conceptual XAI Compliance-by-Design framework linking the lifecycle of high-risk AI systems to regulatory requirements from the GDPR, the AI Act. emphasising traceability, oversight, and risk management.
- 2.
A modular MLOps-orientated architecture that integrates data preprocessing, model training, explainability, drift monitoring, and compliance monitoring, designed to be implementable with widely used open-source tools.
- 3.
A technical–regulatory correspondence matrix mapping specific metrics and artefacts—such as SHAP reports, drift statistics, model lineage logs, and evidence bundles—to relevant legal and standardisation provisions.
- 4.
A reusable, drop-in artefact kit defining minimal evidence schemas and templates (e.g., manifests, compliance logs, and decision dossiers) that can be integrated into existing MLflow pipelines to produce consistent, audit-ready outputs without custom tooling.
- 5.
An end-to-end proof-of-concept implementation in Python 3.14.3 instantiated on a synthetic, security-relevant tabular dataset in an illustrative anomaly-detection setting. The implementation uses a Random Forest classifier combined with SHAP and LIME explanations, producing versioned models, explanation artefacts, drift indicators, and tamper-evident evidence bundles. The purpose is to exercise the audit and accountability workflow rather than to optimise intrusion detection performance.
- 6.
An empirical assessment of model performance, global and local explainability, and stability under dataset shift and distributional drift discusses the implications for trustworthy MLOps, regulatory governance, and European digital sovereignty.
Together, these contributions demonstrate that XAI, regulatory alignment, and compliance documentation can be embedded directly into the technical fabric of high-risk AI pipelines rather than being treated as external or purely conceptual layers.
1.4. Structure of the Paper
The remainder of this paper is structured as follows.
Section 2 reviews the state of the art in XAI, compliance-by-design, and AI accountability, identifying the conceptual and operational gaps that motivate the proposed framework.
Section 3 describes the research design, the framework-orientated methodology, and the illustrative case study.
Section 4 presents the
XAI-Compliance-by-Design framework, detailing its architectural logic, functional layers, and technical–regulatory correspondence matrix.
Section 5 reports the implementation details and experimental validation, including model performance, feature importance analysis, global explainability, and drift monitoring.
Section 6 discusses the technical, regulatory, and strategic implications of the results, and
Section 7 concludes the article and outlines directions for future research.
2. Related Work and State of the Art
Recent research in Explainable Artificial Intelligence (XAI) has increasingly focused on the pressing needs of cybersecurity- and privacy-critical applications. These demands are particularly prominent within the context of the rapidly evolving European regulatory environment.
This section provides a critical analysis of the technological, scientific, and legal progress that defines the pillars of explainability, auditability, and compliance-by-design required for trustworthy high-risk AI systems.
The discussion is structured into the following five subsections: (1) foundations of XAI; (2) compliance-by-design and AI accountability; (3) a comparative review of existing frameworks; (4) regulatory and ethical foundations (GDPR, AI Act, and digital sovereignty); and (5) the research gap and conceptual positioning of the current contribution.
2.1. Foundations of Explainable Artificial Intelligence
Explainable Artificial Intelligence (XAI) has emerged in response to the growing demand for transparency in the algorithmic outputs generated by complex, opaque machine learning models [
6]. Foundational studies, such as Doshi-Velez and Kim [
4], have defined interpretability as a measurable property within the broader algorithmic lifecycle. Interpretability may involve intrinsic transparency, where simple models are inherently explainable, or post hoc techniques designed to clarify decisions made by black-box systems [
3,
6].
Techniques such as LIME, SHAP, TreeSHAP, and surrogate models provide local and global explanations to support human-understandable justifications for automated decisions. However, these approaches may be affected by instability, bias, and high computational cost [
3,
4]. In practical settings, explainability has become fundamental in fostering institutional trust, especially in cybersecurity- and privacy-critical applications. Studies highlight that explanation quality hinges on robustness, stability, and contextual relevance—all attributes vital for trustworthy, auditable MLOps pipelines [
7]. Similar robustness and efficiency requirements arise in other mission-critical safety and safety domains, where adaptive mechanisms are used to maintain reliable decision support under changing operational conditions [
8].
Although these contributions lay the scientific foundation for linking explainable AI with regulatory obligations, most foundational studies do not yet provide standardised, audit-ready compliance metrics. Addressing this challenge is key to enabling compliance-by-design in high-risk AI systems, consistent with evolving regulatory expectations.
2.2. Compliance-by-Design and AI Accountability
Compliance-by-design strengthens the integration of regulatory requirements into the entire algorithmic lifecycle, embedding legal and ethical controls within the engineering of AI systems rather than relying solely on retrospective verification [
9]. This paradigm, rooted in privacy-by-design under Article 25 of GDPR [
1], expands to include transparency, human oversight, risk management, and auditability in a unified operational framework.
The current literature provides limited guidance on systematically implementing these controls in continuously evolving large-scale AI pipelines. Best practices increasingly require translating process controls into technical artefacts—such as automated risk assessments, decision logs, and traceable audit trails—that are continuously updated and legally mapped to GDPR and AI Act obligations.
As a result, compliance-by-design initiatives often remain theoretical, lacking proven workflows for evidence curation, versioning, and traceability. The ISO/IEC 42001 standard strengthens the discipline of audit management by mandating explicit controls, ongoing documentation, and the generation of audit-ready regulatory evidence—including risk categorisation, bias testing, and incident reporting throughout the model lifecycle.
Continuous accountability requires automated management, documentation, and monitoring of AI behaviours, with clear roles assigned and readiness for audit. These elements are critical for trustworthy AI governance, particularly in high-risk cybersecurity and privacy applications.
2.3. Comparative Analysis of Existing Frameworks
Several frameworks have sought to align explainability with regulatory requirements. However, most exhibit operational limitations, including the absence of standardised processes for generating, versioning, and maintaining compliance evidence—such as decision logs, model documentation, and audit trails—across the entire AI lifecycle.
Table 1 summarises the most relevant contributions from 2020 to 2025.
Operationally robust frameworks must address the following crucial compliance pillars: automated lineage tracking, code and model versioning, continuous drift monitoring, and granular, audit-grade trails that meet legal standards for high-risk AI under the EU AI Act [
2].
Despite these conceptual contributions, only a minority of frameworks currently offer machine-verifiable mechanisms for continuous, native MLOps compliance—such as automated lineage tracking, versioned decision logs, model cards, and tamper-evident audit trails. Recent standards, including EU AI Act Article 96 and ISO/IEC 42001 [
16], now require compliance artefacts to be continuously retrievable and mapped to explicit legal controls. However, most reviewed frameworks have not yet met these operational and legal benchmarks.
The solution presented in this article directly addresses these deficiencies by integrating real-time evidence generation, robust versioning, and comprehensive model lifecycle governance into a unified, auditable approach to AI compliance.
2.4. Regulatory and Ethical Foundations: GDPR, AI Act, and Digital Sovereignty
The European regulatory framework provides the foundation for trustworthy and accountable AI operations. The GDPR [
1] establishes the principles of transparency, accountability, and lawfulness applicable to automated decision-making systems processing personal data. The AI Act [
2] reinforces this structure through a risk-based classification of AI systems and mandatory requirements for documentation, testing, traceability, and human oversight in high-risk applications.
ISO/IEC 42001:2023 [
16] extends this governance architecture by defining a structured management system for AI, detailing controls for oversight, risk management, documentation, and continuous monitoring aligned with both the GDPR and the AI Act.
Recent work by Ahangar et al. [
5] and Lozano-Murcia et al. [
17] emphasises that European digital sovereignty depends on verifiable technical infrastructures capable of producing trustworthy, auditable, and reproducible evidence—precisely the type of evidence targeted by the framework proposed in this article.
2.5. Research Gap and Conceptual Positioning
A persistent gap in the literature concerns the limited operationalisation of explainability metrics—such as fidelity, stability, and comprehensibility—as measurable indicators of regulatory alignment [
3]. Existing frameworks tend to focus on technical explainability or legal obligations, but they rarely offer integrated, audit-ready mechanisms that link the two domains [
9].
This disconnect poses practical challenges for engineering and compliance teams, who must translate technical artefacts into regulatory evidence that can be examined by auditors, data protection officers, and supervisory authorities.
The present work addresses this gap by proposing a modular XAI-Compliance-by-Design framework that connects explainability metrics to concrete regulatory requirements and produces verifiable evidence bundles integrated directly into the MLOps pipeline. This enables continuous, reproducible, and audit-ready compliance throughout the lifecycle of high-risk AI systems, aligning technical, organisational, and regulatory dimensions in support of European digital sovereignty.
3. Methodology
This section presents the methodological approach adopted to design, formalise, and operationalise the proposed XAI-Compliance-by-Design framework. The primary focus is the construction of a general, reusable framework that can be instantiated across multiple high-risk AI contexts, rather than the optimisation of any particular machine learning model or domain-specific use case. The anomaly detection scenario described later in this section is used as an illustrative synthetic example, whose sole purpose is to demonstrate the implementability of the framework and the generation of audit-ready artefacts.
Consequently, the main evaluation criterion in this work is not improved intrusion detection performance, but the ability of the proposed pipeline to operationalise regulatory obligations under the GDPR, the AI Act, and ISO/IEC 42001 in a traceable and auditable manner. The synthetic IDS-like scenario is deliberately kept simple to isolate the contribution of the framework and its evidence-generation flow, avoiding domain-specific optimisations that could obscure the compliance-by-design mechanisms that are central to this study.
The methodology combines the following three mutually strengthening components: (i) a conceptual model informed by regulations grounded in GDPR, AI Act, and ISO/IEC 42001; (ii) a Design Science Research (DSR) process focused on the construction and evaluation of verifiable artefacts; and (iii) an operational MLOps pipeline that implements the framework and produces technical and regulatory evidence.
3.1. Operational Terminology and Evidence Artefacts
To avoid ambiguity, the methodological description adopts a consistent terminology for the operational components and outputs of the framework. The following terminology is used throughout the remainder of the paper:
Pipeline refers to the end-to-end, compliance-orientated Machine Learning Operations (MLOps) workflow that executes data preparation, modelling, explainability, monitoring, and evidence generation as a single reproducible process.
RUN_ID is a globally unique execution identifier assigned at the beginning of each pipeline run and propagated across file names, manifests, and logs to enable end-to-end traceability.
The compliance log denotes the structured, append-only record of lifecycle events produced by the pipeline (e.g., data snapshot creation, model training, explanation generation, and drift cheques). The compliance log is designed to support the reconstruction of the full lineage for a given RUN_ID.
The evidence bundle denotes the per-run collection of technical artefacts (models, metrics, explanations, drift reports, and hashes) organised in a deterministic directory structure and indexed by a manifest (e.g., JSON). The evidence bundle is intended to be directly inspectable and reusable for audits or conformity assessment activities.
The decision dossier denotes the per-run decision record that captures the deployment-relevant conclusion (e.g., approve, reject, escalate), together with the minimum justification and pointers to the corresponding evidence bundle. In the implementation, the decision dossier is represented as a machine-readable artefact (e.g., JSON) that can be complemented by a human-readable report.
The technical–regulatory correspondence matrix denotes the mapping between regulatory obligations and the concrete controls/artefacts produced by the pipeline.
Table 2 summarises the core operational artefacts generated by the methodology and clarifies their roles as audit-ready outputs.
3.2. MLflow-Backed Evidence Tracking and Model Registry Integration
The framework is designed to be compatible with mainstream Machine Learning Operations (MLOps) tooling. In particular, MLflow can be used as the lifecycle substrate for end-to-end traceability by providing the following three practical primitives: (i) experiment tracking (runs, parameters, metrics, and tags); (ii) an artefact store for persistent, queryable evidence; and (iii) a model registry for controlled promotion and versioning of deployable models.
Crucially, MLflow is not treated as a compliance solution per se. Instead,
XAI-Compliance-by-Design defines explicit evidence schemas (e.g., manifests, dossiers, and structured logs), governance rules, and audit queries, and it uses MLflow to store and retrieve these artefacts in a consistent and reproducible manner.
Table 3 summarises the operational mapping between the framework’s artefacts and MLflow primitives.
In practice, each pipeline execution starts an MLflow run and logs as follows: configuration (parameters and thresholds), evaluation metrics, explainability outputs, drift indicators, and the structured artefacts that make up the evidence bundle and decision dossier. A minimal implementation pattern is illustrated below (Listing 1).
| Listing 1. Example of MLflow-backed audit logging for a single RUN_ID. |
![Jcp 06 00043 i001 Jcp 06 00043 i001]() |
This integration enables auditors and governance stakeholders to retrieve all relevant artefacts for a given decision by resolving the RUN_ID (or model registry version) and downloading the corresponding evidence bundle and decision dossier. By standardising the naming, manifests, and tags of artefacts, the pipeline supports repeatable audit queries, consistent retention policies, and practical linkage between technical lifecycle events and regulatory obligations.
3.3. Reusable Artefact Kit and Drop-In Integration Pattern
This work is not positioned as a software paper; however, to reduce adoption friction and to address the recurring concern that compliance-by-design proposals remain “only engineering,” the framework is expressed as a small, reusable artefact kit that can be dropped into existing MLflow-backed pipelines. The kit specifies a minimal evidence schema (file names, fields, and linkage rules) so that different organisations can produce comparable, audit-ready outputs without bespoke tooling.
The following machine-readable templates are treated as first-class artefacts and are recorded per RUN_ID: decision_dossier.json, cde_report.json, manifest.json, compliance_log.jsonl, schema_mapping.json, xai_provider_spec.json, and (when monitoring is enabled) drift_calibration.json. Each template uses stable identifiers (e.g., RUN_ID, matrix_version) and explicit pointers to evidence-bundle paths.
Evidence is packaged deterministically so that retrieval is consistent across runs as follows:
evidence_bundle/
model/
metrics/
xai/
monitoring/
governance/
manifest.json
decision_dossier.json
cde_report.json
compliance_log.jsonl
To adopt the kit in an existing pipeline, only three changes are required as follows: (i) start a run and persist RUN_ID and matrix_version as tags; (ii) emit the templates at key lifecycle points (training, explanation, monitoring, and decision); and (iii) log the artefacts to the experiment store (e.g., MLflow) so that audit queries resolve by RUN_ID. This approach preserves domain-specific model choices while standardising the evidence and governance surface that auditors and regulators can inspect.
3.4. Framework-Oriented Methodology and MLOps Pipeline
The framework is operationalised through a generic, compliance-orientated MLOps pipeline designed to be applicable across different high-risk AI domains, with a particular focus on cybersecurity- and privacy-sensitive settings. The pipeline is organised into stages that mirror the lifecycle of an AI system, from data handling to deployment-orientated evidence generation, and each stage is instrumented with compliance logging so that technical events can be systematically traced back to regulatory requirements.
At a high level, the pipeline is structured as a sequence of operational stages. For each stage, the methodology specifies (i) inputs, (ii) processing actions, and (iii) outputs as persistent artefacts recorded in the compliance log and referenced in the evidence bundle manifest as follows:
Stage 1–Environment configuration and context registration. Inputs: configuration parameters (e.g., data source, model family, random seed, and thresholds).
Outputs/artefacts: RUN_ID; environment snapshot (library versions); directory initialisation; compliance log entries binding the run context to all subsequent artefacts.
Stage 2–Data handling and preprocessing. Inputs: raw data (ingested or generated) and a declared schema (numerical/categorical attributes).
Outputs/artefacts: frozen dataset snapshot; schema and summary statistics; preprocessing configuration (e.g., encoding/scaling choices); and compliance log entries supporting the traceability of data preparation.
Stage 3–Model training and validation. Inputs: training split; preprocessing pipeline; and model hyperparameters.
Outputs/artefacts: serialised end-to-end model pipeline; evaluation metrics (accuracy, precision, recall, F1-score, and AUC-ROC); hashes of binary artefacts; and compliance log entries that capture the training and evaluation lineage.
Stage 4–Explainability and monitoring. Inputs: trained model pipeline; evaluation data; explainer configuration; and monitoring configuration (drift indicators and thresholds).
Outputs/artefacts: global and local explanations (SHAP/LIME); monitoring outputs (drift statistics/flags); parameter records; and compliance log entries linking explainability and monitoring artefacts to the corresponding RUN_ID.
Stage 5–Construction of evidence bundles. Inputs: all artefacts produced in Stages 1–4.
Outputs/artefacts: per-run evidence bundle directory structure; manifest indexing artefact paths and hashes; and packaged materials ready for audit inspection and conformity assessment workflows.
Stage 6–Constructing the decision dossier. Inputs: manifest evidence bundle; governance rules/thresholds (where applicable); and oversight metadata.
Outputs/artefacts: per-run decision dossier capturing the deployment-relevant conclusion and the minimal justification with pointers to the evidence bundle.
The pipeline thus serves as a vehicle for instantiating the framework; its structure and logging mechanisms are designed to be transferable to other application domains beyond the illustrative anomaly detection scenario, provided that domain-specific data and models can be mapped to the same evidence and compliance-logging pattern.
3.5. Illustrative Case Study: Synthetic IDS-like Scenario
To demonstrate the practical instantiation of the framework in a cybersecurity-relevant setting, a synthetic network anomaly detection scenario is used as an illustrative example. This case study has a purely demonstrative function as follows: it shows how the framework can be implemented end-to-end and how the corresponding artefacts are generated, versioned, and logged without making domain-level claims about intrusion detection performance.
In this scenario, synthetic network traffic is generated to emulate typical intrusion detection system (IDS) datasets. The generated dataset comprises the following:
Numerical and categorical features: Continuous predictors (e.g., connexion duration and packet and byte counts) and categorical variables (e.g., protocol type, service, and flags), processed via a ColumnTransformer with OneHotEncoder for categorical attributes and passthrough for numerical attributes.
Binary target variable: A label representing normal and attack traffic, used solely to demonstrate how the framework manages supervised classification tasks in an IDS-like context.
Class imbalance: A minority proportion of attack events (around 20%), echoing the typical imbalance found in many operational network settings, without claiming to reproduce any specific real-world environment.
Synthetic Dataset Generation Protocol (Reproducible Specification)
To address reproducibility explicitly, the synthetic dataset is generated using a fixed random seed and a parametrised generator that produces (i) numerical features with bounded ranges and plausible heavy-tailed behaviour (e.g., byte volumes), (ii) categorical protocol/service/flag combinations with lightweight consistency constraints, and (iii) an imbalanced binary label. Importantly, these generation rules are designed to exercise the evidence and audit pipeline (data freezing, lineage, artefact logging, and decision reconstruction) rather than to emulate any specific public IDS benchmark.
The generator produces a feature set that mirrors common “network-like” tabular fields used in IDS practice (without claiming fidelity to any specific dataset). In particular, the numerical feature family includes duration, src_bytes, dst_bytes, count, srv_count, and several normalised rates in (e.g., serror_rate, rerror_rate, same_srv_rate). Class-conditional behaviour is introduced by shifting the beta parameters for rate features as follows: for attack samples, error-related rates are sampled with higher expected values (e.g., ), while “stability” rates such as same_srv_rate are sampled with lower expected values (e.g., ). For normal samples, the inverse holds (e.g., for error rates and for stability rates). This ensures a plausible signal for supervised learning while remaining simple and transparent.
The procedure below describes the generator at an operational level; it can be implemented directly, and because the seed is fixed, it deterministically reproduces the same dataset. The operational protocol for the synthetic IDS-like dataset generator is provided in Listing 2.
| Listing 2. Reproducible synthetic IDS-like dataset generation protocol. |
![Jcp 06 00043 i002 Jcp 06 00043 i002]() |
In addition to the dataset file itself, the generator parameters (
Table 4), seed, and basic dataset statistics (shape, missingness checks and class ratio) are stored as run-scoped evidence. This makes the data configuration auditable and reconstructable via the
RUN_ID, reinforcing that the empirical setup is used to test evidence production and decision traceability rather than to benchmark IDS performance.
The dataset is frozen and stored in a persistent data layer together with metadata describing its dimensionality, feature types, and class distribution. These elements are registered in the compliance log to provide an auditable record of the data configuration used in this particular instantiation and to support the reconstruction of the experimental setup.
The simplicity of this synthetic scenario is intentional; it removes confounding factors associated with complex real-world SOC environments and allows the evaluation to focus on whether the framework and pipeline produce the expected lineage records, explanation artefacts, and compliance logs. The empirical setting should therefore be interpreted as a minimal, controlled environment for exercising the compliance-by-design machinery, rather than as an attempt to advance the state of the art in intrusion detection.
3.6. Operational Vignettes: Real-World Decision Auditing Scenarios
The synthetic illustration similar to the above IDS is used to exercise the end-to-end evidence and governance workflow in a controlled setting. However, the contribution of this article is not limited to cybersecurity; the same decision provenance and audit-ready evidence pattern applies to high-impact, regulated decision pipelines in multiple domains.
To make this transferability explicit, we provide three concise operational vignettes. They are not additional experiments and do not introduce new datasets; rather, they illustrate how inputs, decisions, evidence artefacts, and audit questions map to the proposed evidence-centric pipeline and to MLflow-backed run tracking (RUN_ID/run_id).
3.6.1. VignetteA: Consumer Credit Scoring (Financial Services)
Input: Application data (income, liabilities, history), feature engineering configuration, policy thresholds (e.g., risk cut-offs), and the model version to be assessed.
Decision: Approve, decline, or route to manual review.
Artefacts to record: a unique RUN_ID; a frozen data snapshot and preprocessing pipeline configuration; the registered model reference (registry version); an explanation report (global drivers and per-decision rationale); a decision dossier linking the decision to the evidence package; and a compliance log capturing who/what triggered the decision and which policy gate outcomes applied.
Audit questions supported: Which model and preprocessing configuration produced the decision? Which inputs were used and were they complete at the time of the decision? What rationale was recorded for the outcome and what evidence is available to support contestability and human oversight?
3.6.2. VignetteB: Clinical Decision Support (Triage and Risk Assessment)
Input: Patient observations (vital signs and laboratory results), institutional protocols, the model version, and the clinician’s context (e.g., ward and urgency category).
Decision: Clinical risk flagging and prioritisation (e.g., urgent review recommended), with an explicit opportunity for clinician override.
Artefacts to record: RUN_ID and timestamps; data completeness and provenance indicators; explanation artefacts supporting interpretability; drift-monitoring outputs where applicable; the decision dossier, including the final action taken and whether it was accepted or overridden by a clinician; and a compliance log recording oversight actions and traceable policy checks.
Audit questions supported: What evidence supports this recommendation and how was it communicated to the decision maker? Was the output reviewed/overridden, by whom, and when? Which model version was used, and what monitoring signals (e.g., drift) were present at the time of decision?
3.6.3. VignetteC: Security Operations Centre (SOC) Alert Triage
Input: Alerts emitted by existing detectors (including IDS outputs), contextual telemetry (asset criticality and historical behaviour), and operational policies (escalation thresholds and playbooks).
Decision: Escalate, monitor, or close an alert, including justification for the action taken.
Artefacts to record: A RUN_ID per triage action; references to the trigger alert(s) and contextual evidence; the applied policy gates and their outcomes; a decision dossier capturing the triage decision and rationale; and a compliance log allowing later audits of operator actions, decision timing, and evidence lineage.
Audit questions supported: Why was an alert escalated or dismissed and what evidence was available at the time? What policies and thresholds were applied? Can an auditor reconstruct the full chain from alert emission to human action, including any overrides or exceptions?
Across these scenarios, the unifying principle is that the system must be able to reconstruct what was decided, why it was decided, under which model and data conditions, and who exercised oversight. In the proposed approach, these properties are captured as queryable artefacts associated with a RUN_ID and retrievable via the MLflow tracking and artefact store.
3.6.4. ExternalValidity: Minimal Real-World Dataset Instantiation Protocol (Fallback)
Reviewer concerns about external validity typically arise when an evaluation is based solely on synthetic data. In this manuscript, the synthetic IDS-like scenario is intentionally used as a controlled environment to exercise the auditability machinery (evidence bundles, decision dossiers, compliance logs, and CDE gates) rather than to benchmark detection performance. To address external validity without shifting the contribution towards IDS optimisation, this subsection specifies a minimal, reproducible protocol for instantiating the same pipeline on any real-world dataset.
The protocol is designed as a drop-in procedure: it requires no changes to the core framework, only an adaptation layer that maps dataset-specific fields into the technical pipeline and evidence schema. The emphasis remains on traceability and audit readiness (not on maximising predictive scores).
The following inputs are required, all of which are recorded under a unique RUN_ID (MLflow run_id): (i) dataset identifier and acquisition channel; (ii) a data dictionary (feature semantics, types, and allowed ranges); (iii) preprocessing specifications (cleaning rules, handling of missingness, and encoding); (iv) task definitions and label semantics (including any known label noise); and (v) governance parameters (evidence schema, correspondence-matrix version, and CDE thresholds).
The following steps define the minimal instantiation workflow:
- 1.
Dataset selection and documentation: Select a dataset suitable for a regulated decision context (e.g., tabular decision data, sensor logs, or operational telemetry). Document the dataset’s provenance (source and collection context), licensing constraints, and justify why it is appropriate for an external validity check.
- 2.
Protection and risk screening: Document privacy/security constraints (e.g., presence of personal data, sensitive attributes, and access limitations). Record a brief risk screening note (e.g., whether a Data Protection Impact Assessment is required in the target domain) and the adopted minimisation strategy (feature exclusion and pseudonymisation where relevant).
- 3.
Schema mapping: Map the dataset fields into the framework’s evidence schema. Concretely, produce a schema_mapping.json artefact that specifies feature names, types, transformations, and any derived features.
- 4.
Reproducible split and preprocessing: Apply a deterministic split strategy and preprocessing pipeline, recording random seeds and preprocessing configuration as run parameters. Store a frozen snapshot of the split indices (or hash pointers) as evidence.
- 5.
Model training (illustrative): Train a simple, well-documented baseline model (the model family is not the contribution). Log model hyperparameters and training configuration. The goal is to generate stable artefacts for auditability exercises.
- 6.
Explainability artefacts: Generate the declared global and local explanation outputs (as in the synthetic case) and store them under the run artefact directory.
- 7.
Monitoring signals: Compute minimal monitoring indicators required by governance (e.g., drift statistics when historical windows are available). Logarithmic thresholds and computed indicators.
- 8.
Execution of the CDE gate and decision capture: Execute the evidence gates and produce cde_report.json. Emit decision_dossier.json, explicitly linking the decision to evidence pointers and the correspondence-matrix version.
- 9.
Packaging and integrity: Generate manifest.json enumerating all evidence artefacts, including hashes for tampering evidence. Append all stage events to compliance_log.jsonl.
A real-world instantiation is considered “audit-ready” when the run contains the following: schema_mapping.json, frozen data split pointers (or hashes), preprocessing configuration, a registered model reference (or model artefact), explanation artefacts, monitoring outputs (when applicable), cde_report.json, decision_dossier.json, compliance_log.jsonl, and manifest.json. All elements must be retrievable via RUN_ID/MLflow and must satisfy the integrity check declared in the manifest.
To make the external-validity expectation operational without shifting the contribution towards IDS benchmarking, we define a minimal checklist of verifiable conditions. An instantiation is considered externally valid for the claims of this article if conditions EV1–EV8 are satisfied and recorded under the same RUN_ID, together with the declared correspondence-matrix version.
The verifiable conditions EV1–EV8 are summarized in
Table 5.
This protocol enables direct audit questions that are independent of domain-specific performance claims as follows: (i) Which evidence supported this decision, and where is it stored? (ii) Which correspondence-matrix version and governance thresholds were applied? (iii) Did any evidence gate fail, and what remediation was triggered? (iv) Can the full lineage from dataset snapshot to decision dossier be reconstructed for this run?
If a concrete real-world dataset is not provided within this manuscript, the above protocol serves as the operational specification for external-validity instantiation. It defines what must be recorded, how it is linked, and what constitutes a successful audit-ready run, allowing third parties to reproduce the procedure on their own sector-appropriate datasets without changing the scope of the contribution.
3.7. Explainability Layer: SHAP and LIME
The explainability layer is designed to be model- and domain-agnostic. Its role is to (i) generate auditable explanation artefacts (global and local), (ii) make the explainer choice and configuration explicit and versioned per run, and (iii) support lightweight, repeatable checks over explanation quality properties that matter for governance (e.g., stability and parsimony). Importantly, this layer does not aim to introduce a novel IDS model or to benchmark detection performance; explanations are treated as first-class evidence objects that must be reproducible, retrievable, and linkable to a specific RUN_ID.
To avoid binding the framework to a single technique, explainers are treated as interchangeable providers with a minimal interface. Operationally, each run records an xai_provider_spec.json artefact declaring the following: provider_name, provider_version, scope (global/local), target_model_id (MLflow reference), feature_schema_hash, and configuration (sampling limits, background data strategy, random seed, and output formats). The provider then emits the following: (i) global explanations (e.g., ranked feature attributions), (ii) local explanations for selected instances, and (iii) a compact xai_metrics.json file with governance-orientated metrics (e.g., stability and parsimony), all stored under the run artefact directory.
In the illustrative scenario, two widely used methods (SHAP and LIME) are retained for continuity, and one additional lightweight, model-agnostic method is included to demonstrate plugability without expanding the scope.
SHAP: Shapley values are computed via a TreeExplainer on a representative subset of the transformed dataset. The outputs include global summary plots and feature importance statistics, showing how explanation artefacts can be incorporated into audit-ready evidence bundles and revisited during audits.
LIME: Instance-level explanations are generated using a LimeTabularExplainer, focussing on selected cases (e.g., representative decisions, false positives, and false negatives). The purpose is to demonstrate how local explanations are captured, stored, and linked to compliance events to support human oversight and documentation of decision rationales.
Permutation Feature Importance (PFI): As a simple model-agnostic baseline explainer, the importance of permutation is calculated by permuting each feature and measuring the induced change in a chosen loss or score in a frozen validation split. This method provides a second global attribution view that is easy to reproduce, inexpensive to compute, and suitable for governance checks when model-specific explainers are unavailable.
To respond to governance needs without turning the manuscript into an XAI benchmarking study, the framework records two minimal metrics per run, stored in xai_metrics.json:
Stability (global attribution stability): Compute the Spearman rank correlation between global feature rankings obtained under two explanation reruns that differ only by a controlled perturbation (e.g., bootstrap resampling of the background/validation subset or a fixed random seed change). The run stores the correlation value (and the top list k used) as evidence that the explanations are not arbitrarily unstable.
Parsimony (attribution sparsity): Measure the proportion of attribution mass explained by the top-k features (or, equivalently, the number of features needed to reach a fixed cumulative mass threshold). This provides a compact indicator of how concentrated the explanation is, supporting human interpretability and audit summarisation.
All explanation-related operations—including provider declarations, configuration parameters, sampling strategies, seeds, and file locations—are logged as part of the compliance record, enabling reproducibility, systematic re-analysis, and audit re-execution across different instantiations of the framework.
3.8. Evaluation Criteria and Assessment Model
The evaluation focusses on assessing the framework and its artefacts rather than optimising the underlying illustrative model. The following three complementary dimensions are considered:
- 1.
Model performance (illustrative): Standard metrics such as accuracy, precision, recall, F1-score, and AUC-ROC are computed to confirm that the toy model behaves in a plausible manner. These metrics are reported to contextualise the explanation and compliance artefacts and to provide a basic characterisation of predictive behaviour, not as the primary contribution of the work.
- 2.
Explanation properties: The fidelity, stability, and comprehensibility of the explanation are assessed in a lightweight manner and are recorded as auditable artefacts. Concretely, the run stores (i) the explanation provider specification (xai_provider_spec.json); (ii) global and local explanation outputs; and (iii) a compact xai_metrics.json file containing the following two minimal quantitative indicators: a global-attribution stability score (e.g., Spearman rank correlation of feature rankings across controlled reruns) and a parsimony indicator (e.g., top-k mass concentration). The objective is to verify that the framework can systematically document, revisit, and compare explanations across runs, rather than to exhaustively benchmark XAI techniques.
- 3.
Compliance and governance indicators: Coverage of regulatory obligations (percentage of GDPR, AI Act and ISO/IEC 42001 requirements assigned to technical controls in the technical regulatory correspondence matrix), completeness of evidence bundles, and the ability to reconstruct, from the compliance log, the complete data–model–explanation–decision lineage for a given RUN_ID. These indicators capture the extent to which the pipeline supports continuous, audit-ready governance of high-risk AI systems.
Operationally, the assessment is performed at the level of a single RUN_ID. For each run, the methodology checks (i) artefact completeness (presence of the expected files and plots referenced in the evidence bundle manifest), (ii) lineage reconstructability (the ability to re-link data snapshots, model pipelines, explanations, monitoring outputs, and the final decision dossier through the compliance log), and (iii) regulatory coverage reporting (the proportion of obligations represented in the technical–regulatory correspondence matrix that can be supported by concrete artefacts produced in the run). These checks are intentionally defined as lightweight, auditable procedures so that they can be replicated by third parties without requiring access to proprietary tooling. These criteria are used to determine whether the framework fulfils its design objectives of linking technical artefacts to regulatory requirements in an operational, audit-ready manner. In other words, the success of the study is primarily measured by the degree to which the pipeline can generate coherent and reusable evidence of conformity to GDPR, the AI Act, and the ISO/IEC 42001 obligations, with model performance playing a secondary and contextual role.
3.9. Limitations, Ethical Considerations, and Reproducibility
The methodological choices entail several limitations, which are intentionally aligned with the paper’s scope: demonstrating an audit-ready evidence pipeline rather than proposing or benchmarking a new Intrusion Detection System (IDS) algorithm.
The use of synthetic data deliberately constrains the external validity of the empirical results: the anomaly detection scenario is a simplified example and does not aim to replicate the full complexity, variability, or adversarial characteristics of real-world network environments. Likewise, the selection of a Random Forest classifier is illustrative and serves to exercise the evidence lifecycle (data snapshot, model artefacts, explanations, drift checks, and decision logging). As such, the contribution is not an IDS performance claim; it is the operationalisation of traceability and auditability across RUN_ID anchored executions.
Although a concrete real-world dataset would strengthen external validity for domain-specific deployment claims, it is not required to validate the core auditability contribution of this paper. The external-validity protocol in
Section 3.6.4 specifies the minimal evidence, linkage, and integrity conditions needed to create the same audit-ready workflow on sector-appropriate data under realistic privacy and access constraints.
In real-world deployments, biassed data or unbalanced subgroup coverage can lead to discriminatory outcomes even when explanations are available. The proposed governance approach mitigates this by treating bias checks as evidence artefacts as follows: each run can produce and log a compact bias_report.json (or equivalent) containing subgroup statistics, basic parity indicators, and notes on data representativeness. These artefacts are referenced in the manifest of the evidence package and linked to the elements of the technical–regulatory correspondence matrix, allowing auditors to verify whether bias testing was performed, what the results were, and what remediation (if any) was triggered.
A practical risk of audit-orientated pipelines is the over-collection of data or metadata “because it might be useful later”. To avoid this anti-pattern, the framework assumes an explicit purpose_spec.json and a data_minimisation_record.json per run (or per dataset snapshot) describing the purpose, the minimal feature set retained, and the rationale for each category of evidence. The decision dossier should include a short proportionality note for high-impact outcomes, making the necessity/proportionality reasoning auditable instead of implicit.
Storing evidence for auditability can conflict with confidentiality if sensitive raw inputs are logged. The intended operating mode is therefore evidence-by-reference rather than evidence-by-duplication, as follows: the manifest binds to a data snapshot identifier and cryptographic hashes, while access to raw records is governed by role-based controls (e.g., MLflow permissions and organisational access policies). Retention _policy.json<#C4E#> (or retention_policy.yaml) can be logged along with each run to make retention decisions explicit and reviewable. When evidence bundles are exported for external review, the framework supports redaction and minimised disclosure (e.g., storing aggregates, pseudonymized identifiers, and hashes), preserving auditability without leaking personal data.
Explainability can be misused to provide a veneer of transparency for systems that remain fundamentally opaque or operate in settings with significant power asymmetries. To mitigate this risk, the methodology emphasises the following: (i) explicit documentation of modelling assumptions and trade-offs in the decision dossier; (ii) systematic recording of all relevant artefacts and gate outcomes; and (iii) meaningful human oversight with auditable override pathways (e.g., override_record.json linked from the dossier) when decisions are contested or high-risk. This preserves the ability to scrutinise not only what the model predicted but also how governance actors interpreted, accepted, or overruled the outcome.
Reproducibility is supported through the complete documentation of the pipeline in Python, versioned dependencies, fixed random seeds, and structured storage of data references, models, explanations, drift artefacts, and compliance logs. Execution identifiers (RUN_ID) and file paths are consistently registered, enabling other researchers, auditors, or regulators to reinstate the framework and repeat the illustrative experiments under comparable conditions, or to adapt the same methodology to different high-risk AI domains, subject to their own data protection impact assessments and sector-specific risk analyses.
4. Proposed Framework: XAI-Compliance-by-Design
This section presents the modular
XAI-Compliance-by-Design framework, designed to natively integrate explainability, accountability, and continuous regulatory compliance into the lifecycle of high-risk AI systems. The framework explicitly addresses the disconnect identified in
Section 2 between technical explanation metrics—such as fidelity, stability, and comprehensibility—and the legal requirements imposed by GDPR [
1], AI Act [
2], and ISO/IEC 42001:2023 [
16].
The framework is structured around the following two tightly coupled and synchronised flows: an
upstream technical flow responsible for managing data, models, explanations, and technical evidence, and a
downstream governance flow that translates legal and policy requirements into enforceable technical controls and continuous monitoring [
3,
9]. At the centre of the architecture, a
Compliance-by-Design Engine (CDE) orchestrates the alignment between these flows, ensuring that technical metrics and artefacts are systematically mapped to regulatory obligations and that audit-ready evidence is produced throughout the system lifecycle.
4.1. Conceptual Overview and Architectural Logic
The conceptual architecture of the
XAI-Compliance-by-Design framework is depicted in
Figure 1. It is grounded in two complementary flows, mediated by the CDE, that jointly ensure technical robustness and regulatory alignment throughout the lifecycle of high-risk AI systems.
The upstream technical flow (technical pipeline) captures the operational lifecycle of the AI system: it starts with data acquisition and preparation, proceeds through model training, evaluation, and explainability, and culminates in the generation of structured technical evidence (e.g., explanation reports, drift metrics, and lineage records). In operational terms, these outputs are registered in the compliance log and packaged into a per-run evidence bundle, making each run traceable and externally inspectable.
The downstream governance flow (governance pipeline) formalises the translation of legal, regulatory, and organisational requirements into enforceable controls. It starts from obligations derived from the GDPR, the AI Act, and ISO/IEC 42001 and instantiates them as concrete policies, monitoring rules, audit procedures, and decision criteria (e.g., thresholds and escalation rules) that constrain the technical pipeline and shape the decision dossier.
The Compliance-by-Design Engine (CDE) synchronises the two flows at runtime. It maintains the technical–regulatory correspondence matrix and uses it to (i) verify that required artefacts are produced and correctly linked (via RUN_ID), (ii) compute compliance and coverage indicators, and (iii) trigger actions when governance expectations are not met (e.g., escalation, retraining, policy updates, or intensified monitoring).
A key feature of the architecture is bidirectional feedback between homologous modules across the two flows. Governance components consume explanation reports, drift metrics, and lineage evidence produced by the technical pipeline, while the governance outcome (alerts, updated rules, or approval/hold decisions) feeds back into the technical pipeline through threshold adjustments, retraining triggers, or interface adaptations. This closes the evidence–oversight loop and supports continuous alignment between technical operations and evolving regulatory expectations.
Overall, the architecture emphasises modularity, separation of concerns, and responsiveness as follows: technical components can evolve (e.g., model families or explainers) without breaking the governance layer, while regulatory or policy changes can be reflected in the CDE and propagated downstream without a complete redesign of the technical stack.
4.2. Compliance-by-Design Engine: Policy-as-Code and Evidence Gates
The conceptual role of the Compliance-by-Design Engine (CDE) becomes operational only when it is instantiated as an explicit set of enforceable checks and actions that run alongside the technical pipeline. In practice, the CDE implements “policy-as-code” by turning governance expectations (derived from the technical regulatory correspondence matrix) into concrete machine-verifiable evidence gates. These gates determine whether a pipeline execution is eligible for promotion, deployment, or escalation, and they ensure that each decision is accompanied by a minimal, auditable set of artefacts.
Operationally, the CDE consumes the following inputs: (i) the correspondence matrix (including required controls and evidence pointers); (ii) governance thresholds and escalation rules; (iii) a declared evidence schema (required artefacts per stage); and (iv) the run context anchored by a RUN_ID (or an MLflow run_id when MLflow is used as the lifecycle substrate). The CDE produces outputs that are both decision-relevant and audit-facing as follows: (i) a decision outcome (e.g., approve, hold, escalate); (ii) a structured decision dossier that records the minimal justification and links to the evidence bundle; and (iii) compliance indicators (coverage, completeness, and integrity) registered in the compliance log and, when applicable, exposed as MLflow tags and artefacts.
To make the orchestration precise and unambiguous, the CDE can be modelled as a small set of gates that are evaluated at the end of each run (or at pre-deployment checkpoints in a CI/CD setting).
Table 6 summarises a minimal implementable gate set that is sufficient to support auditability under the GDPR and the AI Act while remaining compatible with standard MLOps tooling.
When MLflow is used, the same gate outcomes can be recorded without introducing a bespoke infrastructure as follows: gate results are stored as structured artefacts (e.g., cde_report.json) and as tags/metrics associated with the corresponding MLflow run. This supports practical audit queries such as the following: retrieving all runs held due to missing evidence, retrieving all runs escalated due to integrity mismatches, or retrieving all runs whose regulatory coverage falls below a declared threshold. The key architectural point is that the CDE does not replace MLOps tooling; it adds explicit, audit-orientated decision logic that binds technical artefacts to governance expectations and makes lifecycle decisions reproducible, reviewable, and regulator-facing.
In summary, expressing the CDE as policy-as-code gates turns “continuous compliance” into an operational property as follows: pipeline executions either satisfy the minimal evidence schema and can be promoted, or they are deterministically held/escalated with an auditable rationale captured in the decision dossier and compliance log.
4.3. Functional Layers
The framework is decomposed into functional layers that span both technical and governance flows, ensuring native compatibility between explainability, operational management, and regulatory oversight [
18,
19,
20,
21]. In line with
Figure 1, the upstream flow emphasises data processing, modelling, and explanation, while the downstream flow focusses on verification, continuous auditing, and human oversight [
22,
23]. Concretely, each layer is responsible for producing or consuming well-defined artefacts (e.g., compliance log entries, evidence bundle elements, and decision dossier inputs).
The main functional layers and their relationships with regulatory requirements and technical metrics are as follows:
Data Layer: Manages data acquisition, preprocessing, and data quality checks, ensuring integrity, provenance, and traceability. Typical outputs include snapshots of the dataset, schemas, and summary statistics registered in the compliance log and referenced in the evidence package; this supports Articles 5 and 25 of GDPR and provides input for impact assessments [
24,
25].
Model Layer: Covers model development, validation, and lifecycle monitoring with an emphasis on versioning, documentation, and change control. It produces serialised model artefacts, evaluation metrics, and lineage metadata (including hashes) so that any deployed behaviour can be traced back to a specific run and configuration, supporting accountability under the AI Act and GDPR [
1,
2].
Explanation Layer: Implements explanation techniques and quality properties (e.g., SHAP/LIME outputs, fidelity, and stability checks) to generate human-understandable justifications and structured evidence for transparency duties. Explanation artefacts are stored and referenced as first-class elements of the evidence package and made available to audit and oversight functions [
26,
27,
28,
29].
Audit Layer: Coordinates continuous evaluation of compliance, integrity, and monitoring obligations. Maintain audit trails, supports log management, and enables traceable responses to access/contestation requests (e.g., GDPR Article 22) and post-market monitoring under the AI Act [
22,
30,
31]. Operationally, this layer consumes evidence bundle elements and compliance log events to produce audit-ready reports and indicators.
Interface Layer: Provides mechanisms for visualising explanations and supporting meaningful human oversight, including review, feedback, and override capabilities. It operationalises human-in-the-loop processes and ensures that governance outcomes can be captured as decision dossiers and routed into subsequent technical actions [
18,
24].
At the core, the
Compliance-by-Design Engine (CDE) aggregates XAI metrics, monitors outputs, and gathers governance feedback; updates compliance indicators; and triggers actions (e.g., retraining, policy updates, enhanced monitoring, or escalation) when thresholds are breached [
9,
15,
30,
31,
32]. This orchestration ensures that compliance is not a one-off activity but a continuous process embedded in the operational fabric of the AI system.
To ground these design choices,
Table 7 summarises the architectural principles incorporated into the framework, linking them to seminal references and their concrete manifestations in the proposed design.
These principles reinforce that the framework is not only domain-relevant in the context of European AI regulation but also grounded in established software and systems architecture practices, facilitating adoption in complex, multi-stakeholder environments.
4.4. Technical and Regulatory Correspondence Matrix
A central artefact of the framework is the
technical–regulatory correspondence matrix, which formalises how concrete technical controls, checks, and artefacts relate to explicit legal and normative expectations. To avoid an overly abstract mapping, the matrix decomposes high-level obligations into
sub-obligations that are (i) operationally checkable and (ii) supported by run-scoped evidence pointers. The matrix is maintained by the Compliance-by-Design Engine (CDE), versioned as a machine-readable artefact (e.g.,
correspondence_matrix.json) and bound to each pipeline execution through a
matrix_version tag (
Section 3.2).
Table 8 summarises an illustrative, more granular slice of the correspondences used in this work.
In practice, the correspondence matrix serves the following three functions: (i) informs framework design and configuration by clarifying which artefacts must be produced for a given regulatory context; (ii) guides the implementation of compliance logging and evidence bundles; and (iii) it supports quantitative, audit-facing indicators (e.g., coverage ratios) used in governance dashboards and regulatory reporting.
MatrixVersioning and Update Protocol (Governance, Triggers, and Auditability)
Because regulatory guidance, case law, organisational policies, and technical standards evolve, the correspondence matrix must be treated as a governed artefact rather than a static table. The framework therefore adopts a minimal update protocol that ensures (i) controlled evolution, (ii) retrocompatible audit reconstruction, and (iii) auditable change history.
Each matrix version has an owner (e.g., compliance lead), at least one reviewer (e.g., technical lead/risk owner) and an approver authorised to publish a new version. The published version is immutable and is referenced by matrix_version in MLflow.
Updates are initiated when (i) the regulatory corpus changes (new guidance, delegated acts, standards updates), (ii) the system scope changes (new model family, modality, or deployment context), (iii) incidents or audit findings identify evidence gaps, or (iv) organisational policies revise risk thresholds or oversight requirements.
The matrix follows semantic versioning vMAJOR.MINOR.PATCH. MAJOR increases when the obligation set or semantics change in a way that affects audit interpretation; MINOR increases when new sub-obligations, controls, or evidence pointers are added without breaking prior interpretation; and PATCH captures editorial fixes (typos, clarifications) that do not affect the mapping logic.
Each pipeline run stores the matrix_version tag and logs the exact machine-readable matrix file as a run artefact (or stores its hash + immutable URI). Auditors can therefore retrieve the precise mapping used at decision time, even if newer versions exist. When a MAJOR change occurs, a migration_notes.md artefact records how rows were split/merged or reinterpreted, preserving longitudinal comparability.
Each published version of the matrix is accompanied by a change log (e.g., matrix_changelog.md) and a cryptographic hash recorded in the compliance log. The CDE can enforce that only approved matrix versions (whitelisted) are accepted for deployment-eligible runs, thereby preventing ungoverned or ad hoc mappings.
(1) Submit a change request with rationale and impacted rows; (2) review for regulatory adequacy and technical feasibility; (3) approve and publish a new matrix version; (4) update CDE gate rules that depend on the new rows/thresholds; (5) run a short regression check to confirm evidence pointers are generated; and (6) record the update as auditable artefacts under the matrix version.
4.5. Integration Within MLOps Pipelines
The framework is designed to be integrated into CI/CD-orientated MLOps pipelines, which automate the building, testing, and deployment of machine learning models while maintaining observability and control. This integration ensures that changes in models, data, or configurations are systematically validated and that both technical and regulatory requirements are enforced prior to deployment.
Concretely, integration proceeds through the following mechanisms:
Policy-aware CI/CD stages: CI pipelines are extended with policy linting, unit tests for explainability components, and compliance gates. For example, a build may be blocked if explanation reports are missing, if drift metrics exceed configured thresholds, or if mandatory documentation artefacts (e.g., model cards and data schemas) are absent [
39].
Evidence-aware training stages: During training, the pipeline requires that data snapshots, lineage metadata, and explanation outputs be stored in structured locations and referenced in the compliance log. This enforces the generation of audit-ready artefacts as a condition for promoting models to later stages.
Governance-aware deployment stages: Deployment pipelines enforce policies that prevent the promotion of models lacking human override mechanisms, decision provenance logging, or post-deployment monitoring hooks. Candidate releases are evaluated against the correspondence matrix and CDE indicators prior to approval [
32,
40].
Post-deployment observability and revalidation: In production, telemetry collectors feed drift detectors, explainers, and governance dashboards. Periodic revalidation routines reassess model performance and explanation properties; where appropriate, audit metadata can be anchored on immutable ledgers (e.g., blockchain-based records) to reinforce integrity and non-repudiation [
41].
By embedding the XAI-Compliance-by-Design framework into MLOps pipelines, the design–evidence–governance loop is effectively closed as follows: design decisions generate artefacts, artefacts feed governance assessments, and governance outcomes, in turn, constrain and inform subsequent design and deployment decisions. This cyclical integration supports continuous compliance and retrospective audits while remaining agnostic to the specific domain or model family used in any given instantiation.
5. Implementation and Experimental Validation
This section reports the operational instantiation of the
XAI-Compliance-by-Design framework and its experimental validation in the synthetic anomaly detection scenario introduced in
Section 3. The focus is not on optimising predictive performance for intrusion detection but also on demonstrating that the framework can be implemented end-to-end with mainstream tools and that it produces verifiable audit-ready artefacts that support compliance with the GDPR, the AI Act, and ISO/IEC 42001.
The implementation materialises the dual-flow architecture described in
Section 4 through a Python-based pipeline organised into clearly delineated stages: environment configuration and compliance logging; data handling and preprocessing; model training and evaluation; explainability and drift monitoring; and, finally, evidence bundle and decision dossier construction.
5.1. Implementation Overview
The framework is instantiated using a lightweight, reproducible toolchain built on widely adopted open-source components. The core implementation relies on pandas for data handling, scikit-learn for preprocessing and model training, SHAP and LIME for global and local explainability, and standard Python libraries for hashing and file management.
To ensure traceability and model lineage with the main MLOps primitives, each pipeline execution is tracked as an
MLflow run (
Section 3.2). The MLflow
run_id is used as the primary execution identifier and is treated as the framework
RUN_ID. Run-scoped metadata (parameters, thresholds, and configuration switches) are recorded as MLflow
parameters and
tags, while technical output (metrics, reports, and serialised artefacts) are persisted as MLflow
artefacts. This design supports the reconstruction of the retrospective audit by allowing RUN to query and retrieve all evidence
_ID.
For portability, the implementation also maintains a local directory layout mirroring the evidence schema; however, the authoritative audit surface is the MLflow artefact store as follows:
data_lake/: frozen datasets and transformed feature matrices (logged as run artefacts).
models/: serialised pipelines encapsulating both preprocessing and classifier (logged and optionally registered).
evidence_bundles/: explanation plots, drift reports, compliance logs, and JSON manifests aggregating technical evidence (logged as a single artefact directory).
decision_dossiers/: machine-readable deployment decisions and justifications (logged as run artefacts).
A structured compliance log is maintained as a JSONL file (compliance_log.jsonl) and logged as a run artefact under the evidence bundle. Each record includes a timestamp, RUN_ID, stage identifier, event type, a short description, optional regulatory references, and an extensible payload with structured metadata. To support tamper-evident lineage, a manifest file (manifest.json) indexes the evidence bundle and stores SHA-256 digests for key binary artefacts (e.g., serialised models and reports). These mechanisms jointly operationalise the Data, Model, and Audit Layers of the framework by enforcing consistent execution contexts, explicit linkage between artefacts, and verifiable integrity.
5.2. CDE Gate Enforcement and Audit Artefact Emission
The Compliance-by-Design Engine (CDE) described in
Section 4.2 is implemented as a set of
evidence gates executed at well-defined checkpoints (pre-deployment and post-run). Concretely, the pipeline evaluates minimal evidence requirements (completeness, lineage linkage, integrity, drift thresholds, and explanation availability) and produces a structured decision result. The outcome is recorded in a
decision dossier and is linked to the entire evidence package through the
RUN_ID/MLflow
run_id.
Operationally, gate execution produces a compact machine-readable report (cde_report.json) that lists each gate, its pass/fail status, the justification fields used for evaluation, and references (paths or artefact URIs) to supporting evidence. The CDE decision state is also reflected as MLflow tags (e.g., cde_status=approved|hold|escalate, matrix_version=…), enabling programme-level dashboards and audit queries.
Table 9 summarises the minimal audit artefacts emitted per run and how they persist via MLflow.
5.3. Experimental Design and Reproducibility Controls
To maximise reproducibility and minimise implicit assumptions, the proof of concept is implemented as a configuration-driven pipeline in which all run-critical parameters are captured as first-class metadata. Concretely, each pipeline execution produces (i) an environment snapshot, (ii) a run configuration record, (iii) deterministic artefact paths keyed by RUN_ID, and (iv) integrity metadata (hashes) for binary artefacts. These controls are intended to ensure that an external reviewer can reconstruct what was run, with which inputs and parameters, and which outputs were produced, without relying on tacit knowledge.
The experimental design is intentionally simple and is organised to support traceability rather than performance optimisation as follows:
Data generation and freezing; A synthetic IDS-like dataset is generated (or loaded) and frozen as a run-scoped snapshot. The generator configuration (e.g., sample size, class proportions, feature schema, and rule parameters) is stored alongside the snapshot of the dataset.
Train–test protocol; A stratified train–test split is applied (80/20), and the random seed used for splitting is recorded. This ensures that the evaluation set is stable across re-executions.
Model configuration. The model family (Random Forest) and all hyperparameters supplied at runtime are persisted (e.g., number of trees, depth controls, class handling, and feature sampling). The complete end-to-end scikit-learn Pipeline (preprocessing + classifier) is serialised as a single artefact.
Explainability configuration. Explainer type and settings (SHAP and LIME), including background/conditioning data choice, sampling limits, and output formats, are stored per run.
Drift monitoring configuration. The set of drift indicators, statistical tests, and thresholds are recorded (e.g., distributional distance measures for categorical features and non-parametric tests for numerical features).
Table 10 summarises the minimal metadata recorded per
RUN_ID to support third-party replication and audit-style reconstruction.
5.4. Framework Instantiation in a Synthetic IDS-like Scenario
To illustrate the operational behaviour of the framework, an IDS-inspired synthetic anomaly detection scenario is used as an illustrative example. The case study has a purely demonstrative purpose, outlined as follows: it shows how the proposed architecture can be instantiated end-to-end and how the corresponding artefacts are generated and logged. No claims are made about operational performance in real-world intrusion detection.
The synthetic dataset comprises 10,000 instances of network-like traffic with a mixture of numerical and categorical features as follows:
Numerical attributes represent connexion duration, byte volumes, local counts of recent connexions, and rate-based indicators (e.g., error and service ratios).
Categorical attributes model protocol type, service, and connexion flag, with basic consistency constraints (e.g., http, ftp, and ssh mapped to tcp, dns mostly mapped to udp).
A binary target label distinguishes between normal and attack traffic, with an intentionally imbalanced class distribution of approximately 80% normal and 20% attack.
The dataset is frozen in the data_lake/ directory with a RUN_ID-specific file name and is registered in the compliance log, including shape, feature types, and class proportions.
The preprocessing follows the framework described in
Section 4.3. The features (
X) and the target (
y) are separated; categorical attributes (
protocol_type,
service,
flag) are encoded in a one-hot way via a
ColumnTransformer, while the numerical attributes are passed through. A
RandomForestClassifier is then encapsulated in a
scikit-learn Pipeline together with the preprocessor. An 80/20 stratified split is used for training and testing, preserving the class imbalance.
Standard predictive metrics are computed on the held-out test set (e.g., accuracy, precision, recall, and F1-score for the attack class) and stored as run-scoped artefacts to support traceability and policy checks. In the context of this article, these metrics are not used to evaluate or advance intrusion detection; they function as reproducible outputs that can be referenced by RUN _ID and retrieved via MLflow to justify lifecycle decisions (e.g., promotion, hold, or escalation) under the proposed governance flow. Because the dataset is synthetic and deliberately separable, metric values should be interpreted only as contextual artefacts to exercise the compliance-by-design machinery.
The illustrative run-scoped metrics recorded for traceability are summarized in
Table 11.
5.4.1. Auditability-Focused Comparative Evaluation: Baseline vs. Framework
To address the request for empirical validation without changing the scope of this work, the evaluation below focusses on auditability outcomes rather than intrusion-detection effectiveness. Concretely, we compare (i) a baseline MLOps trace that logs only conventional run metadata (parameters, metrics, and a serialised model artefact) against (ii) the XAI-Compliance-by-Design instantiation that augments the same run with Compliance-by-Design Engine (CDE) gates and explicit audit artefacts (evidence bundle, manifest, compliance log, and decision dossier) persisted under a single RUN_ID in MLflow. Consequently, we intentionally exclude IDS-centric diagnostic plots and benchmarks (e.g., confusion matrices, ROC curves, or detector-to-detector comparisons) as primary evidence because they are orthogonal to the audit reconstruction and compliance artefact claims evaluated here.
We operationalise auditability using measurable, run-scoped indicators that can be computed directly from artefact presence, linkage, and integrity properties as follows:
Evidence completeness (EC): Whether the minimal set of required artefact types is present per RUN_ID (data snapshot reference, model lineage, metrics report, explanation artefacts, drift report, manifest.json, compliance_log.jsonl, cde_report.json, and decision_dossier.json).
Lineage linkage completeness (LC): Whether all artefacts are (a) explicitly keyed by RUN_ID and (b) cross-referenced through a single index (the evidence manifest) and a single justification record (the decision dossier).
Integrity coverage (IC): Whether tamper-evident digests exist for key binary artefacts and can be validated from the manifest.
Regulatory evidence coverage (RC): Whether the run exposes an explicit mapping to the technical–regulatory correspondence matrix (e.g., matrix_version tag) and provides evidence pointers for the obligations in scope.
Audit reconstruction effort (ARE): A practical proxy for time to audit, expressed as the number of deterministic retrieval steps needed to assemble an “audit package” for a given RUN_ID. In the framework case, this is dominated by a single MLflow query plus a manifest-guided fetch, whereas baseline reconstruction typically requires manual aggregation.
Gate enforcement (GE): Whether policy checks are executed and recorded as a lifecycle decision state (approve/hold/escalate) with justification fields.
Table 12 summarises the resulting comparison. The baseline trace can support basic reproducibility (“what parameters and metrics were used?”) but fails to provide governance-grade audit reconstruction (“why was the model promoted, under which policy and with which evidence?”). In contrast, the proposed framework makes the audit surface explicit and queryable by
RUN_ID, allowing deterministic reconstruction of both technical provenance and compliance-relevant decision rationale.
5.4.2. Audit-Facing Outputs and Audit Queries
In this proof of concept, the primary outputs of interest are audit-facing artefacts rather than IDS-specific performance visualisations. Each run produces a decision_dossier.json, a compliance_log.jsonl, a manifest.json (hash-indexed evidence bundle), and a compact cde_report.json that captures gate outcomes. When MLflow is used as the lifecycle substrate, these files are logged as run artefacts and can be retrieved deterministically via the MLflow run_id/RUN_ID.
For illustration, the following excerpts show the minimal structure of the run-scoped artefacts (fields abbreviated for readability):
Representative excerpts are provided for the decision dossier (Listing 3), the compliance log (Listing 4), and the evidence manifest (Listing 5).
These artefacts enable direct audit questions to be answered without ambiguity. Examples include the following: (i) Which exact model and data snapshot supported this decision? (ii) Which gates failed and why was the decision held or escalated? and (iii) Which regulatory obligations were claimed as covered for this run and what concrete evidence pointers support them? In an MLflow-backed deployment, such questions are operationalised through run-level retrieval (by run_id) and governance-level filtering (e.g., by cde_status tags), allowing auditors and oversight teams to reconstruct decisions from evidence rather than from informal narratives.
| Listing 3. Decision dossier excerpt (JSON). |
![Jcp 06 00043 i003 Jcp 06 00043 i003]() |
| Listing 4. Compliance log excerpt (JSONL; one event per line). |
![Jcp 06 00043 i004 Jcp 06 00043 i004]() |
| Listing 5. Evidence manifest excerpt (JSON). |
![Jcp 06 00043 i005 Jcp 06 00043 i005]() |
The entire training pipeline, the transformed feature matrix, and the training labels are serialised under the RUN_ID; a SHA-256 hash of the model file is computed and stored in the compliance log. This ensures that any future audit can reconstruct exactly which data and configuration led to a given model version and performance profile.
In addition, a Gini-based feature-importance plot is generated from the Random Forest model to provide a baseline view of the relative impact of encoded features on the classifier. The top-ranked attributes (such as
serror_rate,
rerror_rate, and
cat:flag_SF) dominate the importance profile, reflecting the synthetic rules used to generate the dataset. Although this graph is not used directly for regulatory mapping, it serves as a useful reference to compare traditional importance measures with SHAP-based attributions, as discussed in
Section 5.5.
5.5. Explainability, Drift Monitoring and Evidence Generation
Beyond predictive metrics, proof-of-concept instantiation generates explainability and monitoring artefacts as first-class outputs and binds them to governance-ready records. Concretely, the pipeline produces both global and local explanation artefacts as follows:
Global explainability (SHAP). Aggregated feature-attribution summaries are produced to support model-level transparency and to allow comparisons against traditional importance measures (
Figure 2). These artefacts are stored in the evidence bundle and are referenced in the manifest so that an auditor can verify which explainer outputs correspond to which
RUN_ID and model hash.
Local explainability (LIME). Per-instance explanation records are generated for selected test cases to support human oversight in decision contexts (e.g., why a specific flow is labelled as attack). Local explanations are stored as run-scoped artefacts and are linked to the decision dossier whenever an escalation or approval policy requires instance-level justification.
For monitoring, the implementation includes lightweight drift indicators intended to demonstrate the operational link between distribution shifts, risk controls, and evidence production. Numerical features are monitored using non-parametric distribution tests (e.g., Kolmogorov–Smirnov), while categorical features are monitored through distributional distance measures (e.g., KL divergence over category frequencies). The set of monitored attributes, test parameters, and alert thresholds is recorded per run (
Section 5.3), and the resulting drift report is written into the evidence package.
To make monitoring evidence operational rather than merely descriptive, drift thresholds must be calibrated and tied to explicit actions that are auditable. In this framework, thresholds are treated as governance parameters that are derived from a declared reference window (e.g., the training distribution or a stable historical period) and an agreed alert budget (e.g., a target false-alarm rate per monitoring interval). Concretely, the organisation can estimate a baseline distribution of drift scores (per indicator) on reference samples or rolling windows and set thresholds by quantile (e.g., 95th) or budget-constrained calibration. The calibration itself is logged as a versioned artefact (e.g., drift_calibration.json) that records the reference window, the calibration procedure, the resulting thresholds, the date, and the governance owner/approver; the active threshold set is then recorded per RUN_ID as MLflow parameters/tags.
When new monitoring results are produced, the Compliance-by-Design Engine (CDE) evaluates them through Gate G4 (
Table 6) and applies a simple auditable playbook as follows: (i)
continues when indicators remain below the threshold; (ii)
intensify monitoring and require review when indicators approach threshold (pre-alert); and (iii)
hold, rollback, or retrain when thresholds are exceeded, depending on the declared risk posture and operational constraints. The selected action and its justification are recorded in the decision dossier and in
cde_report.json, together with the drift metrics, the active thresholds, and any human oversight/approval, so that auditors can reconstruct not only that drift was detected but also
how the organisation responded and whether the response followed declared policy.
Finally, the pipeline consolidates all per-run artefacts into (i) an evidence bundle indexed by a manifest, and (ii) a decision dossier that captures the deployment-relevant outcome together with minimal justification and pointers to the underlying evidence. This ensures that each result reported in this section is not only described in narrative form but is also available as a concrete, inspectable artefact linked to a specific RUN_ID, allowing both replication and audit-style review.
6. Discussion
The results reported in
Section 5 show that the
XAI-Compliance-by-Design framework can be instantiated as an end-to-end pipeline that systematically produces technical, explainability, monitoring, and governance artefacts, whose primary purpose is to support regulatory compliance and auditability in high-risk AI systems. The empirical setting is intentionally synthetic and simplified; therefore, it is used to exercise the compliance flow, the evidence lifecycle, and the governance interfaces, rather than to claim operational intrusion-detection effectiveness.
6.1. Scope of Claims: Demonstrated, Plausible, and Future
To avoid over-claiming, the discussion separates three types of statements.
The article demonstrates (i) an executable instantiation of the architecture; (ii) systematic generation of audit-facing artefacts (e.g., compliance logs, explanation reports, drift monitoring outputs, evidence bundles, and decision dossiers) indexed by a unique
RUN_ID; and (iii) an explicit mapping between these artefacts/metrics and regulatory objectives through the technical–regulatory correspondence matrix (
Table 8).
It is plausible that, under realistic organisational constraints and with domain data, the same evidence-centric design can reduce compliance friction, improve the traceability of deployment decisions, and support meaningful human oversight by making governance criteria explicit and testable. However, these effects depend on the quality of the organisational controls, the realism of the governance thresholds, and the participation of the stakeholdersion of the stakeholders.
The strongest remaining needs are empirical validation in real high-risk settings, stakeholder-centred evaluation (e.g., auditors and data protection officers), and extensions to heterogeneous model families and modalities beyond tabular classifiers.
6.2. Alignment with the Research Objectives
The first research question (RQ1) asked how Explainable Artificial Intelligence (XAI) techniques, compliance-by-design principles, and trustworthy MLOps practices can be integrated into a single modular framework for high-risk AI systems. The architecture proposed in
Section 4 and its realisation in
Section 5 show that such integration is feasible through a dual-flow design mediated by the Compliance-by-Design Engine (CDE). The upstream technical flow (data, model, explanation, and monitoring) and the downstream governance flow (policy, oversight, audit, and decision-making) remain logically distinct, yet they are synchronised via shared artefacts and compliance logging.
The second research question (RQ2) concerned the extent to which such a framework can produce concrete verifiable artefacts that support GDPR- and AI Act-aligned auditability and accountability. The implementation shows that every major stage of the lifecycle leaves a structured trace: datasets and schemas in the Data Layer; serialised pipelines and hashes in the Model Layer; SHAP and LIME outputs in the Explanation Layer; drift reports and lifecycle events in the Audit Layer; and evidence bundles and decision dossiers in the Interface and Governance dimensions.
Table 8 formalises how these artefacts are mapped to regulatory objectives and legal bases, while
Table 9 illustrates how they are aggregated per
RUN_ID.
The third research question (RQ3) focused on the behaviour of the framework when instantiated in an anomaly detection scenario using a synthetic, security-relevant tabular dataset. In the reference run reported in
Section 5, the pipeline yields stable, reproducible predictive outputs (e.g., summary metrics stored as run artefacts;
Table 11). This is expected given the controlled construction of the synthetic data, where class separation is deliberately clear; consequently, these metrics are not interpreted as evidence of real-world IDS effectiveness. Their role is methodological, as follows: they provide stable, reproducible outputs that allow the framework’s evidence and governance flow to be exercised end-to-end. Importantly, the same mechanism can encode deployment policies that rely on risk-relevant thresholds (e.g., minority-class recall, false-positive budgets, or stability under drift), thereby supporting documented rejection, escalation, or approval decisions in realistic settings where performance is typically lower and distribution shift is common.
6.3. From Model-Centric to Evidence-Centric Governance
A central conceptual shift embodied in the framework is the move from model-centric to evidence-centric governance. Traditional AI development practices often treat the model as the primary artefact and address explainability or compliance later, if at all. In contrast, the XAI-Compliance-by-Design framework treats the model as one component within an ecosystem of artefacts that collectively support accountability.
From this perspective, the most relevant outputs of the pipeline are therefore not limited to the classifier and its performance figures; they also include the set of artefacts summarised in
Table 9. The compliance log, SHAP and LIME artefacts, drift reports, evidence bundles, and decision dossiers are designed to be inspectable and re-usable across audits, investigations, or re-certification processes. They embody the principle that high-risk AI deployment decisions should be grounded in a structured body of evidence that goes beyond model performance alone.
In addition, explicit recording of explanations and monitoring activities enables the management of the evidence lifecycle. Explanations, drift signals, and deployment decisions are not treated as isolated events but as part of a continuous narrative over time that can be reconstructed from the RUN_ID-keyed artefacts. This is particularly relevant for accountability obligations under the GDPR and the AI Act, including the contestation of automated decisions and post-market monitoring requirements for high-risk systems.
6.4. Implications for High-Risk AI Practice
Although the case study uses synthetic IDS-like data, the architectural patterns are directly applicable to real high-risk AI deployments. Three implications are particularly relevant, while remaining contingent on real-world validation.
First, the pipeline illustrates that compliance-by-design can be operationalised with widely available tools. The use of pandas, scikit-learn, SHAP, and LIME suggests that the main barrier is not tooling, but engineering discipline: ensuring that lifecycle events are logged with sufficient metadata and that artefacts are organised to support reconstruction and review.
Second, drift monitoring and explainability monitoring become first-class operational controls rather than optional add-ons. Drift detection, as implemented via KL divergence and Kolmogorov–Smirnov tests in
Section 5.5, supports continuous monitoring and post-market surveillance. Similarly, explanation artefacts provide global and local rationales that can be inspected, stress-tested, and compared over time.
Third, the decision dossier formalises deployment criteria and escalation rules as explicit, inspectable policy. This supports meaningful human oversight by making acceptance conditions, thresholds, and exceptions auditable. In operational contexts, such criteria may include performance constraints, stability under drift, mandatory instance-level explanations for high-impact decisions, or enhanced review for uncertain or novel cases.
6.5. Limitations and Threats to Validity
Several limitations and threats to validity delimit the conclusions that can be drawn from the present instantiation.
The empirical demonstration relies on synthetic data and therefore does not capture the complexity of operational environments, including data quality issues, non-stationarity, adversarial behaviour, and organisational constraints on data access and logging.
The metrics and indicators used (performance, drift tests, explanation outputs) are proxies for regulatory objectives such as transparency, accountability, and oversight. While the correspondence matrix structures this mapping, the adequacy of any proxy depends on domain context and how organisations translate legal requirements into operational thresholds.
The compliance flow is exercised under conditions where logging and evidence generation succeed by design. Real deployments must handle partial failures, missing artefacts, conflicting signals, and human-in-the-loop exceptions, which are not stress-tested in the toy scenario.
The evaluation does not include stakeholder studies with auditors, data protection officers, or operational teams. As a result, usability, governance workload, and organisational adoption barriers are not empirically assessed.
The implementation illustrates one model family (Random Forests) and a limited set of XAI and drift metrics. Other model families (e.g., deep learning, ensembles, and foundation models) and assurance methods (e.g., robustness testing, uncertainty quantification, and counterfactual explanations) may reveal different trade-offs.
6.6. Future Directions
The above limitations motivate a concrete research agenda. A first direction is the application of the framework to real high-risk AI systems (e.g., credit scoring, clinical decision support, fraud detection, or public-sector risk scoring) with domain-specific governance thresholds and evidence requirements. A second direction is to extend the artefact taxonomy and the correspondence matrix to heterogeneous model families and modalities, including large language models and computer vision systems, together with appropriate explanation and drift-monitoring strategies. A third direction concerns the standardisation of evidence bundles and decision dossiers, potentially aligned with emerging assurance and certification practices, to facilitate inter-organisational comparability and regulatory uptake.
Finally, future work should explore how the CDE indicators can be exposed through organisational dashboards and decision-support systems for AI governance. By making compliance status, explanation quality, and drift signals visible to non-technical stakeholders, such interfaces could strengthen meaningful human oversight and further reduce the gap between technical practice and regulatory accountability.
7. Conclusions and Future Work
This article addressed the problem of how to operationalise transparency, accountability, and human oversight in high-risk AI systems in a way that is concretely aligned with the obligations arising from the GDPR, the AI Act, and ISO/IEC 42001. The literature review highlighted a persistent gap between, on the one hand, technical advances in Explainable Artificial Intelligence (XAI) and MLOps and, on the other hand, the legal–regulatory requirements for auditability, documentation and continuous monitoring of high-risk AI systems. Existing approaches tend either to treat explainability as an ad hoc add-on to machine learning pipelines or to remain at the level of high-level governance principles without providing an operational mapping between technical artefacts and regulatory obligations.
In response to this gap, the article proposed the XAI-Compliance-by-Design framework, a modular dual-flow architecture that natively integrates explainability, compliance-by-design, and trustworthy MLOps into the lifecycle of high-risk AI systems. The framework distinguishes an upstream technical flow that focusses on data, models, explanations, and monitoring and a downstream governance flow that instantiates regulatory and organisational requirements such as operational controls, policies, and oversight mechanisms. At the centre of the architecture, the Compliance-by-Design Engine (CDE) maintains a technical–regulatory correspondence matrix that links concrete metrics and artefacts—such as explanation fidelity and stability, drift indicators, decision provenance logs, and model lineage—to specific legal and normative requirements. The framework is instantiated and activated through a Python-based pipeline that produces structured evidence bundles and decision dossiers, demonstrating how audit-ready artefacts can be generated as a natural outcome of standard engineering processes.
With respect to RQ1, the results show that XAI techniques, compliance-by-design principles, and trustworthy MLOps practices can be integrated into a single modular framework by separating technical and governance concerns while synchronising them via shared artefacts and compliance log. The dual-flow design, mediated by the CDE, allows the explainability components (SHAP and LIME), the management of models and data, and regulatory controls to evolve independently, yet they remain tightly aligned through the technical–regulatory correspondence matrix and the policy-aware pipeline stages. Regarding RQ2, the implementation confirms that the framework can systematically produce concrete and verifiable artefacts that support auditability and accountability under GDPR, the AI Act, and ISO/IEC 42001. Each pipeline execution yields serialised models with cryptographic hashes, structured compliance logs, global and local explanation artefacts, drift reports, evidence bundles, and machine-readable deployment decisions, all indexed by a unique RUN_ID and traceable throughout the lifecycle. Finally, in relation to RQ3, the synthetic, IDS-inspired anomaly detection scenario demonstrates that the framework behaves as intended in a security-relevant setting by exercising the full evidence and governance flow end-to-end. As expected in a controlled synthetic setting, the observed predictive performance is very strong; therefore, these metrics are not used to claim operational IDS effectiveness but to provide stable artefacts for testing the compliance pipeline. However, the compliance flow highlights why deployment decisions should be grounded in risk-aware criteria (e.g., minority-class recall, false-positive budgets, and stability under drift) encoded in the decision dossier rather than in a single aggregate metric.
A central implication of this work is the shift from model-centric to evidence-centric governance. In the proposed framework, the model is only one component within a broader ecosystem of artefacts that collectively support accountability. The primary outputs of the pipeline are therefore not limited to classifiers and their performance metrics, but they include the full chain of evidence required for regulatory scrutiny, outlined as follows: explanation reports, drift indicators, lineage records, compliance logs, and deployment rationales. This evidence-centric perspective aligns technical practice with the accountability and auditability principles embedded in European regulation and helps ensure that high-risk AI deployments are based on a structured, reproducible body of technical and governance evidence rather than on isolated performance figures.
However, the work has limitations. The proposal is primarily conceptual and does not include an empirical evaluation using real or synthetic datasets; therefore, external validity and operational performance remain to be demonstrated in future deployments. The explanation layer focuses on SHAP and LIME, and the drift monitoring component is limited to relatively simple distributional tests; other explanation methods, robustness assessments, and monitoring strategies could reveal different trade-offs. The evaluation does not include user studies with auditors, data protection officers, or operational teams, nor does it address organisational and cultural factors that may affect the adoption of evidence-centric governance practices. Finally, while the technical–regulatory correspondence matrix provides a structured mapping for the considered regulatory corpus, its completeness and granularity would need to be revisited as guidance, case law, and standards evolve.
These limitations suggest several directions for future work. A first line of research concerns the application of the XAI-Compliance-by-Design framework to real high-risk AI systems, such as credit scoring, clinical decision support, fraud detection, or public-sector risk scoring, using operational datasets and organisational constraints. This would enable a more comprehensive empirical validation of the effectiveness of the framework in supporting audits, impact assessments, and supervisory reviews. A second line involves extending the framework to non-tabular and foundation models, including computer vision and large language models, incorporating explanation techniques, drift detectors, and robustness tests tailored to these modalities. A third direction is the standardisation of evidence bundles and decision dossiers, potentially harmonised with emerging AI assurance and certification schemes, to facilitate comparability and regulatory uptake across organisations and sectors. Finally, future work should explore richer governance interfaces, such as dashboards that expose compliance status, explanation quality, and drift signals to non-technical stakeholders, thereby reinforcing meaningful human oversight and contributing to the operationalisation of European digital sovereignty in AI governance.