Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML

Zhao, Xun; Ma, Zheng Grace; Jørgensen, Bo Nørregaard

doi:10.3390/info17060576

Open AccessArticle

Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML

by

Xun Zhao

,

Zheng Grace Ma

and

Bo Nørregaard Jørgensen

^*

SDU Center for Energy Informatics, Maersk Mc-Kinney Moller Institute, The Faculty of Engineering, University of Southern Denmark, 5230 Odense, Denmark

^*

Author to whom correspondence should be addressed.

Information 2026, 17(6), 576; https://doi.org/10.3390/info17060576 (registering DOI)

Submission received: 15 May 2026 / Revised: 4 June 2026 / Accepted: 8 June 2026 / Published: 10 June 2026

(This article belongs to the Special Issue Computational Modelling and Data Analytics in Smart Cities—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Operational energy-forecasting pipelines require traceable execution from data ingestion to monitoring, yet few studies evaluate whether such pipelines continue to enforce quality controls when inputs or configurations are degraded. This study implements a previously proposed seven-phase forecasting lifecycle as a configuration-driven system on a self-hosted ClearML platform. The implementation is organised into five architectural domains: data and configuration, lifecycle phases and gates, orchestration, document artifact governance, and human-in-the-loop oversight. The pipeline is evaluated through six runs on four years of hourly electricity-consumption data from a Norwegian kindergarten building. Two baseline runs, in automatic and human-in-the-loop modes, demonstrate end-to-end execution and produce an XGBoost champion model with a 24-h-ahead test RMSE of 1.19 kW. Four controlled variants then test the validation-route logic by injecting missing data, shuffled consumption values, restrictive feature selection, and missing foundation-document sections. The first three variants are detected by phase-level sub-checkpoints, while the fourth is detected by Gate 0 through document-structure validation. The runs exercise revise-and-recover, override-then-terminate, and immediate-abort response pathways. The evaluation therefore demonstrates lifecycle execution, validation-route behaviour, and artifact traceability under controlled conditions; claims about live-deployment performance and multi-building generalisation are out of scope and identified as next steps.

Keywords:

MLOps; machine learning lifecycle; energy forecasting; building energy consumption; ClearML; human-in-the-loop; gate-based governance; XGBoost; data quality; concept drift

1. Introduction

Energy forecasting underpins critical operational and planning decisions across power systems, district networks, and building energy management [1,2]. The literature on energy load prediction is large and predominantly focused on model accuracy, with extensive reviews of statistical, machine learning, and deep learning approaches for buildings and grid-level demand [3,4,5,6]. Comparatively few studies, however, report on the engineering of the operational lifecycle that produces and sustains these models, and fewer still demonstrate that such a lifecycle continues to enforce its quality controls when its inputs are degraded.

The challenges of moving machine learning models from development into long-term production have been articulated in general terms by Sculley et al. [7] and Paleyes et al. [8]. The machine learning operations (MLOps) community has since converged on a vocabulary of pipelines, artifact registries, monitoring, and human oversight to address these challenges [9,10,11,12]. Process-model proposals such as CRISP-ML(Q) [13] further extend this perspective by attaching quality assurance to each lifecycle phase. What is largely absent from the energy forecasting literature is an implementation that instantiates this vocabulary as a concrete, governed, and evaluable system on real building data, and that exposes its behaviour to controlled fault conditions.

This study builds on two preceding works but is intended to stand alone as an implementation and controlled route-evaluation paper. The earlier lifecycle framework [14] defined a seven-phase forecasting process, decision gates G0–G6, document artifacts, human-in-the-loop checkpoints, and loopback rules. It also presents a dedicated comparative analysis showing advantages in functional coverage, workflow logic, and governance over CRISP-ML(Q) [13] and previously published end-to-end ML pipelines for energy applications. The subsequent platform-mapping study [15] compared candidate MLOps platforms and selected ClearML as the implementation substrate. That study used a PRISMA-informed review of 256 records to compare 13 MLOps platforms across the seven-phase lifecycle and identified four capability gaps: governance workflow automation, automated data quality validation, feature management, and deployment and monitoring support under nonstationary conditions. The comparison shows that commercial platforms such as Amazon SageMaker and Google Vertex AI offer stronger end-to-end integration and production readiness, while open-source platforms such as Kubeflow and ClearML offer modular flexibility that requires additional integration effort to achieve end-to-end operation. ClearML was selected as the implementation substrate of the present paper because its modular flexibility allows the governance-workflow-automation gap to be addressed through application-level logic above the platform’s primitives.

The present paper implements that lifecycle as an executable system and evaluates whether its configuration propagation, checkpoint escalation, gate routing, artifact governance, and recovery paths behave as specified under both normal and degraded conditions. Compared with general-purpose MLOps platforms and energy-forecasting pipelines, the verification-and-governance design comes from [14] and the platform-selection rationale from [15]; the present paper contributes at the operationalisation-and-evaluation layer above them. Specifically, this study contributes the following:

An executable, configuration-driven operationalisation of the previously proposed seven-phase forecasting lifecycle on a self-hosted ClearML platform, organised into five responsibility-based architectural domains.
A controlled validation-route evaluation of the lifecycle’s two-tier checkpoint-and-gate logic under four deliberately constructed degraded-input and degraded-configuration conditions. The evaluation separates automatic fault detection from operator disposition and exercises three response pathways: revise-and-recover, override-then-terminate, and immediate-abort.
An operational characterisation of ClearML as the MLOps substrate for the implemented lifecycle, distinguishing platform-provided services, including experiment tracking, dataset versioning, artifact storage, and remote execution, from application-level governance logic, including checkpoint routing, human-in-the-loop decisions, document validation, and pause-resume handling.

The unit of analysis is therefore the behaviour of the implemented lifecycle and its validation routes, not forecasting-model novelty, live-deployment performance, or cross-building generalisation. The single-building scope is therefore a design choice required by the route-evaluation objective: introducing building heterogeneity would conflate route behaviour with dataset-dependent forecasting behaviour and obscure the causal attribution of detection events to the injected conditions.

This paper differs from the two preceding studies in both object and evidence. Paper [14] proposed the seven-phase lifecycle, the gates G0–G6, the document artifact family, the HIL checkpoints, and the loopback rules as a paper specification, validated on a single happy-path office-building case study; it did not instantiate the specification on a concrete MLOps platform, did not propose an architectural decomposition for the implementation, and did not test the validation routes under degraded conditions. Paper [15] surveyed 13 MLOps platforms against the lifecycle phases and selected ClearML as the implementation substrate, but did not implement the lifecycle on the selected platform and did not evaluate its behaviour empirically. The contributions of the present paper, listed above, supply these three missing elements respectively. In particular, the present paper provides the operational-level evidence that the two-tier sub-checkpoint plus main-gate design advances over CRISP-ML(Q)’s flat per-phase quality assurance: variant V4 (Section 4.2) is a structural foundation-document fault that no Phase 0 sub-checkpoint is designed to detect but that Gate 0′s existence-and-structure check catches by design, a behaviour a flat per-phase QA step cannot supply. The object of study is therefore the observed behaviour of a running lifecycle under baseline and deliberately degraded conditions, including whether checkpoints, gates, escalation, recovery, and artifact traceability operate as specified.

The remainder of the paper is organised as follows. Section 2 describes the implementation architecture across the five domains within a ClearML platform. Section 3 establishes the implementation context. Section 4 reports controlled operational evaluation evidence. Section 5 discusses the results, the operational role of the MLOps platform, design limitations, and future work. Section 6 concludes the paper.

2. Implementation Architecture of the Forecasting Lifecycle

The implementation realises the seven-phase lifecycle of [14] as an executable system. This section describes the implementation architecture rather than reintroducing the lifecycle specification itself. The architecture is organised by operational responsibility into five domains: data and configuration, lifecycle phases and gates, orchestration, document artifact governance, and human-in-the-loop oversight. The decomposition follows the principle of operational responsibility: each domain owns one class of runtime concern (input contracts, transformation logic, control flow, provenance, human oversight), with no concern shared across domains.

Figure 1 reproduces the lifecycle blueprint, showing the seven-phase structure with gates G0 to G6 and loopback edges, and constitutes the specification view. The lifecycle progresses linearly from project foundation through data acquisition, exploratory analysis and cleaning, feature engineering, model development, deployment, and continuous monitoring, with named workflows decomposing each phase. Several phases also carry inline binary checks that initiate local rework or upstream loopbacks before the formal gate runs, providing intermediate validation against the spec. The gates themselves admit advancement only on cross-functional sign-off against predetermined criteria, and their fail edges define the loopback targets, with Gate 6 closing the lifecycle by routing monitoring outcomes back to a previous phase. Step-level detail, the document artifact family, and the stakeholder responsibility matrix sit in the supporting text of [14].

Figure 2 presents the implementation architecture of the present study, showing the five architectural domains operating within a ClearML platform [16], and constitutes the operational view. The ClearML server provides the platform backdrop that hosts four services consumed across all phases. Within this platform, Domain 2 occupies the centre as the horizontal chain of phases and gates from P0 to P6. Three peripheral domains attach to this spine, namely Domain 1 to the left feeding external data and YAML configuration into the early phases, Domain 5 to the right as the human-in-the-loop oversight subsystem with its dashboard, audit trail, and on-demand path to revise the configuration documents, and Domain 3 across the top as the orchestrator responsible for phase sequencing, gate outcome routing, loopback routing, and human-in-the-loop (HIL) escalation. Domain 4 spans the bottom as the append-only artifact and document governance layer that records the document trail, datasets, registered models, and experiment logs produced by every phase. The edges drawn between phases capture the operational routing repertoire, including gate-level loopbacks, sub-phase cross-phase loopbacks driven by internal checks, and the multi-target escalation routes from Gate 6 to upstream phases that close the monitoring cycle when remediation is required. The arrangement makes the operational responsibilities and their coupling explicit, in contrast with the linear specification view of Figure 1.

Every cross-domain interaction is mediated by a typed, persisted artefact rather than a direct in-memory call, which is the property required for the audit trail and for the recoverability of any pipeline pause. Specifically, D1 communicates with D2 via Doc01 and Doc02; D2 emits artefacts and gate outcomes to D4; D2 and D5 exchange checkpoint triggers and HIL decision records; D3 controls D2 via phase-sequencing signals and gate-routing decisions; and D4 and D5 share decision-artefact persistence and pause-state files.

This channel inventory also fixes the inter-domain dependency structure, which a file-level import-graph analysis of the implementation confirms quantitatively. Counting Python import edges between the files of each domain, the governance and oversight domains (D4, D5) are pure dependency sinks with zero efferent coupling (Ce = 0): the rest of the system reaches them only through file and artefact channels rather than through code imports, which is the property that gives the design its audit-trail and pause-state recoverability. The orchestrator (D3) is conversely a pure control source with zero afferent coupling (Ca = 0), and the instability metric I = Ce/(Ca + Ce) accordingly rises monotonically from the artefact sinks (D4, D5 at I = 0) through the input and phase domains (D1, D2), to the control plane (D3 at I = 1.0), matching the intended control direction; the inter-domain dependency graph is otherwise acyclic. A finer-grained module-level coupling and cohesion analysis is identified in Section 5.4 as a follow-up direction.

2.1. Data and Configuration (D1)

Domain D1 forms the entry layer of the pipeline, where every measurement stream and every operator-authored parameter the system depends on is received, declared, and converted into the governed artefacts that the seven downstream phases consume. It covers the input contracts and configuration sources that feed the pipeline (Figure 2, Domain 2). External data enters as long-format comma-separated values (CSV) time series: an electricity stream of hourly meter readings and an aligned weather stream from the co-located station. To support both live operational deployment and offline reproduction from a single codebase, the pipeline decouples every phase from the underlying data source. Live and offline inputs are accessed through a common DataProvider abstraction whose concrete implementation selects a factory at runtime, so the same phase logic operates on either source without code changes. The abstraction is realised as an abstract base class declaring a data-access method, a connection-validation method, and a source-type identifier, together with a fixed output schema to which every concrete provider must conform. The runtime factory dispatches on the configured source type, so adding or switching a data source requires only a new conforming provider rather than any change to the downstream phase logic.

The configuration sources are three YAML files. The task intake template records the stakeholder-facing fields, namely dataset selection, target horizon, candidate model families, gate criteria, and execution mode. A per-dataset profile, referenced by the intake through an active_dataset_profile pointer, declares the dataset-specific input contracts, for example, file paths, column mappings, time zones, the meter-reading mode, and any per-phase overrides for ingestion and gate limits. The HIL configuration template defines the checkpoint registry, decision schemas, routing rules, and the maximum loopback policy. Phase 0 in Domain 2 reads the intake together with its referenced dataset profile and emits two phase-zero documents: Doc01 (Project Specification), which captures the task intake fields, and Doc02 (Requirements and Criteria), which captures the dataset schema, data-quality thresholds, feature group definitions, train/validation/test split, and per-family hyperparameter search ranges that the orchestrator uses for parameterisation downstream.

After Phase 0, all downstream phases load their task parameters from Doc01 and Doc02 rather than directly re-reading the intake template or dataset profile. These two documents therefore act as the governed configuration boundary of the run: every downstream parameter is traceable to Phase 0, and any approved revision is propagated by regenerating or updating the relevant Phase 0 documents before re-execution. The HIL configuration template is read directly by the orchestrator and the checkpoint manager at runtime, since checkpoint behaviour is operator-facing; routing decisions and pause-state are recorded in the audit trail under their own provenance. The dataset profile is the portability seam of the architecture. Switching the pipeline to a new dataset requires only authoring a new dataset profile YAML (with the dataset’s paths, column mappings, units, time zones, history range, meter-reading mode, and any per-phase overrides) and updating the active dataset profile pointer in the intake template. Intra-dataset variants are realised as alternative profiles on the same underlying data through the same pointer. Reconfiguring HIL behaviour requires editing the HIL template alone.

The three YAML sources are parsed once in Phase 0 and folded into the two foundation documents that all downstream phases consume, so configuration enters the pipeline at a single point rather than being re-read per phase. Field-level validation of the configuration values themselves is not enforced at parse time in the present implementation; strengthening configuration handling, including at the point of intake, is part of the agentic-oversight extension identified in Section 5.4.

Through the DataProvider abstraction, the three layered-configuration sources, and the two phase-zero documents, D1 establishes the single boundary at which external inputs enter the pipeline and localises the points at which the dataset, the task definition, or the governance policy can be reconfigured without touching the seven downstream phases.

2.2. Lifecycle Phases and Gates (D2)

Domain D2 is the executable backbone of the pipeline, where every transformation, quality control, and routing decision that converts the governed inputs from D1 into a deployable, continuously monitored model takes place. It covers the seven phases and the seven decision gates that bracket them, together with the forward and loopback edges shown in Figure 2 (centre panel). The forward path proceeds through seven phases: P0 (Foundation), P1 (Ingestion), P2 (Exploratory Data Analysis [EDA] and Cleaning), P3 (Feature Engineering), P4 (Training), P5 (Deployment), and P6 (Monitoring), with each phase concluded by its corresponding gate, G0 to G6. In brief, P0 reads the intake template and dataset profile and emits the two foundation documents Doc01 and Doc02; P1 ingests the electricity and weather streams onto a UTC-aligned hourly grid and registers the integrated dataset; P2 performs exploratory analysis and cleaning to produce the cleaned dataset; P3 generates derived features and selects a final feature subset; P4 trains and ranks the model families through a two-stage hyperparameter optimisation and registers the champion model; P5 retrieves the champion model and deploys the FastAPI inference service and Streamlit dashboard; P6 runs the delayed-evaluation monitoring cycle and computes drift and quality signals. The full specification, including the stakeholder responsibility matrix and step-level detail, remains in [14].

Phase 4 trains and ranks four candidate model families: recurrent neural network (RNN), long short-term memory network (LSTM), Transformer, and XGBoost (Extreme Gradient Boosting). These families are included to exercise the lifecycle across recurrent, attention-based, and tree-based forecasting approaches, not to claim novelty in model architecture. All four are treated as interchangeable candidates within the same governed training, validation, artifact-registration, and gate-review procedure.

Two layers of validation are applied within every phase, with deliberately different roles. Sub-checkpoints (e.g., HIL-P1-1.1.6, HIL-P2-2.1.6, HIL-P3-3.2.5, HIL-P4-4.3.4) are technical-sufficiency checks that fire mid-phase, compute a phase-step metric against a configured threshold, and decide narrow-scope local iteration when the metric falls short, namely fix this step and retry. Decision gates G0 through G6 fire at the end of a phase and serve as governance review: they verify that the phase’s sub-checkpoints have all passed and, in addition, check documentation completeness, compliance status, risk acceptance, and multi-stakeholder sign-off. A gate can therefore fail even when every sub-checkpoint passed, when the phase output is technically sufficient but governance obligations remain open. The two roles are distinct in scope, in review constituency, and in the failure classes they catch. The division of labour is principled rather than incidental. Sub-checkpoints validate local technical sufficiency within a single phase step, where the necessary evidence is locally available, while main gates validate phase-level completeness, cross-document structural integrity, and governance readiness, where the required evidence spans multiple artifacts produced earlier in the same phase or in upstream phases.

The architecture supports three loop families, all visible in Figure 2. Gate loopbacks (dashed blue) return control to a defined recovery point on gate failure; for Gates 0, 1, 3, 4, and 5 the recovery point is intra-phase, Gate 2 is the only formal gate whose recovery point lies in an earlier phase by escalating to Phase 1, and Gate 6 is the only gate with multi-target recovery, routing a Phase 6 governance failure to Phase 0 for re-scoping, Phase 1 for data root-cause, or Phase 4 for retraining, with the destination selected by the operator’s classification. Sub-phase loopbacks (dashed green) carry mid-phase escalations to an upstream phase when an internal step diagnoses a root cause located outside the current phase, such as a domain mismatch in Phase 2, a deployment-readiness failure in Phase 5, or the post-training root-cause classifier in Phase 4 attributing a performance gap to data. Internal remediation loops, denoted by a loop marker on the phase block, are intra-phase iterations that resolve before the gate is reached. A G6 pass advances the pipeline to the next monitoring cycle.

Through the seven-phase forward path, the two-layer validation pattern that pairs mid-phase sub-checkpoints with end-of-phase gates, and the three loop families, D2 turns the lifecycle into a finite set of legal transitions between phases, so that every forward advance is governed and every backward step is routed to a defined recovery point under explicit, traceable rules.

2.3. Orchestrator (D3)

Domain D3 is the control plane of the pipeline, where the lifecycle defined in D2 is turned into an executable sequence of tasks and the heavy compute steps are dispatched to remote workers.

The seven-phase logic is built as application code rather than as a declarative ClearML pipeline, because the pipeline layer is geared towards training and data-processing workflows and does not natively model the post-deployment monitoring and governance loops that close the lifecycle here. Within that scope, the gate-outcome routing, the cross-phase and intra-phase loopback families, and the human-in-the-loop escalation handler from D2 are not naturally expressible in ClearML’s predominantly static DAG (directed acyclic graph)-based pipeline model. A quantitative wall-clock comparison against a declarative-DAG orchestration would require a parallel reimplementation of the lifecycle, including shim code to express these monitoring, loopback, and escalation features that the DAG model does not natively support; such a measurement would be dominated by that shim layer rather than reflecting the orchestration substrate itself, and a controlled comparison is therefore identified in Section 5.4 as a follow-up direction. The custom orchestrator is a Python service whose responsibilities are phase sequencing along the forward path, gate-outcome routing (advance versus loopback), loopback-target resolution (intra-phase, cross-phase, or operator-classified), and HIL escalation when a gate or sub-checkpoint failure requires operator review. It schedules phase execution, dispatches training jobs to a remote agent through a ClearML queue, persists pause-state files between phases, reconciles pause-state with the ClearML task model, and assembles the per-run audit trail.

The control flow is deterministic and sequential by design: the orchestrator advances one phase at a time within a single pipeline instance, and the asynchronous concerns in the implementation, namely the externally submitted operator decisions and the Phase 4 trial dispatch to the gpu-train queue, are isolated behind synchronous polling boundaries so that gate transitions remain reproducible and auditable. Because automatic retries are disabled, every loopback is a single operator-directed re-execution of the recovery-point-to-current phase range that terminates in either a passing gate or an abort, so the pipeline cannot loop without being bound and the number of loopbacks per run is set by operator dispositions rather than by the pipeline itself.

The orchestrator runs on a self-hosted ClearML platform that exposes four services consumed across the seven phases (top of Figure 2): three storage-tier services and one remote-queue execution service. The experiment tracker logs per-run metrics and parameters. The dataset manager records versioned datasets with parent–child lineage. The artifact store keeps typed binary artifacts, including the document trail and the per-trial model files; the champion model file is uploaded as a task artifact on the producing trial task, and is reached through the document chain, which keeps a single source of lineage for governance and audit.

The remote-queue service handles graphics processing unit (GPU) dispatch. In Phase 4, each hyperparameter trial is created as a ClearML task and enqueued on the gpu-train queue, where a clearml-agent worker on a compute unified device architecture (CUDA)-enabled host pulls each task, runs the training remotely, and reports its metrics and artifacts back.

Through the application-code orchestrator, its consumption of the four self-hosted ClearML platform services, and the queue-based GPU offload, D3 binds the lifecycle defined in D2 to a concrete execution model that keeps end-to-end control in a single Python process while letting compute-heavy training run on a dedicated worker.

2.4. Document Artifact Governance (D4)

Domain D4 is the provenance layer of the pipeline, where every artifact produced across the seven phases is captured, versioned, and made retrievable for downstream consumption and audit.

D4 covers the artifact and document governance subsystem (bottom of Figure 2). The ClearML artifact store records five families of typed artifacts produced across the seven phases: the Doc01–Doc64 structured-JSON document trail, which includes a Phase Report on gate pass and a Log Issues document on gate fail; the dataset profile, captured within the document chain so that the configuration behind any output is recoverable; the champion model file from Phase 4; the experiment logs from Phase 4 hyperparameter trials and from every per-phase ClearML task; and the versioned datasets along the parent–child lineage.

The 37 documents reflect two refinements over the 35 in [14]: Phase 5 expanded from two to four sub-phase documents (Doc51–Doc54, with gate documents shifted to Doc55 and Doc56); Phases 0–4 and Phase 6 match [14] exactly. Table 1 summarises the document family by phase and trigger; Phase Reports and Log Issues documents are generated by every phase gate and are listed once per row to avoid repetition.

Documents persist both locally and as ClearML artifacts, allowing local inspection and cross-phase retrieval. All artifacts, all human-in-the-loop decisions, and all gate outcomes are stored in an append-only audit trail, in line with the auditability requirements identified by Sculley et al. [7] and Paleyes et al. [8] and the traceability requirements articulated by Mora-Cantallops et al. [17]. The traceability chain extends from the intake template through Doc01 and Doc02 to every downstream phase artifact.

Doc01 and Doc02 are structured-JSON documents that carry the configuration boundary consumed by all downstream phases. Each is emitted in two states within a Phase 0 run: a draft at sub-phase 0.1.4 and an approved version at sub-phase 0.2.4, the latter incorporating the compliance (Doc03) and technical-feasibility (Doc04) summaries and superseding the draft in the downstream consumption path. The foundation-document chain is checked at Gate 0, which verifies the existence of Doc01 to Doc04 and the structural completeness of the compliance and feasibility documents before the lifecycle advances; this is the fault class exercised by variant V4 in Section 4.2. Traceability and recoverability rest on ClearML artifact lineage rather than on a declared content schema: each document is stored as a named, timestamped artifact, so the configuration behind any output is reconstructable from the artifact store. Because Doc01 and Doc02 are composed programmatically by a fixed Phase 0 routine rather than authored by hand, their structure is fixed by that routine rather than by hand-edited input, and the foundation-document chain is checked for existence and completeness at Gate 0. Formal JSON-schema validation of document content and a monotonic version counter across revision loopbacks are not currently enforced; both are deferred hardening steps identified in Section 5.4 as follow-up directions. Through the five typed artifact families, the document chain that links the task intake to every phase output, and the append-only audit trail with dual-location persistence, D4 converts the pipeline’s working products into versioned, retrievable evidence that supports both forward consumption by later phases and backward reconstruction during governance review.

2.5. Human-in-the-Loop Oversight (D5)

Domain D5 is the human-oversight layer of the pipeline, where escalations from D2′s gates and sub-checkpoints are surfaced to an authorised operator, decisions are recorded, and routing back into the execution flow is performed under explicit, auditable rules.

The implementation defines 20 checkpoints across the seven phases, of which 17 are enabled in the default route-test configuration; the three disabled checkpoints in Phase 6 (HIL-P6-6.1.5, HIL-P6-6.1.10, HIL-P6-6.3.1) are computed automatically from thresholds and schedules declared in Phase 0 and are excluded from the operator decision flow in both execution modes; only Gate 6 (HIL-P6-G6) requires human review in Phase 6.

The subsystem is built from four application-level components (right-hand side of Figure 2): a Checkpoint Manager (persists pause-state and decision artefacts, pauses and resumes the pipeline), a Decision Manager (dispatches the decision-option set, applies routing rules, writes the audit trail), a HIL Dashboard (surfaces the failed artefact, the decision options, and an editable Doc01/Doc02 view for revision-and-resume), and a Notification Bridge (asynchronous operator alerts).

The checkpoint registry is distinguished into four types: input checkpoints accept a configuration submission; gate checkpoints record a pass, fail, or revise decision; decision checkpoints record a yes-or-no (or domain-specific equivalent) outcome with optional routing; and classification checkpoints record a root-cause categorisation with multi-target routing. Triggers fire at two levels. At sub-gates the checkpoint fires mid-phase, and a failure causes the phase to exit early before its main gate. At the formal decision gates G0 through G6 the checkpoint fires at the end of the phase. In both cases, a failure routes to the operator for review. The two execution modes differ only in which routine non-failure checkpoints are surfaced to the operator. Table 2 lists representative entries from the registry, showing only the most relevant decision options and omitting the three disabled checkpoints.

A checkpoint failure is always detected before operator disposition. The automatic checkpoint or gate logic first evaluates the relevant metric, artifact, or document structure and records the failure condition. The human-in-the-loop subsystem is then invoked to determine the disposition of that already-detected condition. Across checkpoint types, the operator’s disposition falls into one of three response categories: approve and continue, revise and rerun from a registered recovery point, or terminate the run. The exact decision token is checkpoint-specific, as shown in Table 2, but every routing target in the registry maps to one of these three categories.

The pipeline supports two configurable execution modes that determine how human-in-the-loop checkpoints behave at runtime. Automatic mode treats every routine checkpoint as approved using its registered default decision and runs without operator pauses; if an automatic check at a sub-gate or main gate detects a failure, the orchestrator escalates that failure to the operator for adjudication, so HIL fallback on detected failures is inherent to automatic mode. Human-in-the-loop mode pauses at every enabled checkpoint whose firing precondition is satisfied on the current run, regardless of whether the underlying automatic check passes or fails, and requires explicit operator approval at each one.

Enabled checkpoints fall into three behavioural classes that determine whether they fire on a given run. Routine checkpoints fire on every pipeline pass through their step. Trigger-conditional checkpoints fire only when an upstream condition is satisfied, for example a sub-gate failure, a performance-check outcome that requires adjudication, or a quality-assessment finding. Schedule-gated checkpoints fire on a configured cadence, for example the governance review interval declared in the dataset profile. Trigger-conditional and schedule-gated checkpoints do not surface as operator pauses on a run that does not satisfy their precondition, even in human-in-the-loop mode.

The two modes and three checkpoint classes together produce the three operational regimes that the experiment matrix in Section 4 distinguishes: automatic execution that completes without escalation, automatic execution where a detected failure triggers HIL fallback, and human-in-the-loop execution that surfaces every enabled routine checkpoint reached on the current run. The first two regimes share a single configuration; their distinction is whether a failure fired during the run.

Through the four application-level components, the typed checkpoint registry, the pause–decide–resume workflow, and the two execution modes that combine with the three checkpoint classes to produce three operational regimes, D5 makes each escalation a structured event with a registered trigger, a typed decision, a recorded routing target, and an entry in the audit trail, ensuring that operator authority over the pipeline is exercised within the same evidence chain that governs its automatic execution.

3. Implementation Context

Section 3 establishes the implementation context that the controlled operational evaluation in Section 4 depends on, namely the runtime environment, the dataset and target, and the experimental matrix that the variant runs implement.

3.1. Runtime Environment

The pipeline runtime was captured from the execution environments used for the reported runs. The main software stack included PyTorch 2.9.0+cu128 [18], scikit-learn 1.8.0 [19], pandas 2.3.2, NumPy 2.3.3 [20], XGBoost 3.2.0 [21], ClearML SDK 2.0.2, FastAPI 0.118.0, and Streamlit 1.49.1. Exact package versions were recorded through per-task environment snapshots and should be interpreted as the reproducibility environment for this implementation rather than as requirements of the lifecycle design. Streamlit serves separate forecast/monitoring and HIL dashboards. The ClearML server runs as a Docker-based self-hosted deployment reachable at localhost.

All experiments ran on a single workstation equipped with an AMD Ryzen 9 5900X processor (12 cores and 24 threads, base clock 3.7 GHz), 64 GB of DDR4 system memory, and an NVIDIA GeForce RTX 3060 GPU (12 GB VRAM, driver 591.86). The host runs Ubuntu under Windows Subsystem for Linux 2 on Windows 11 Enterprise 25H2 (build 26200, WSL2 kernel 6.6.87.2-microsoft-standard-WSL2).

The orchestrator and Phases 0, 1, 2, 3, 5, and 6 run in a central processing unit (CPU)-only conda environment on the host. Phase 4 training trials are dispatched to a clearml-agent worker running in a separate CUDA-enabled conda environment on the same host. The two environments are kept separate to isolate the CUDA runtime and its PyTorch dependencies from the CPU phases; both run Python in the same 3.x line, with the specific minor version pinned in the released environment files for reproducibility.

Phase 4 training is configured for reproducibility on the GPU. A fixed seed of 42 is set across Python’s random, NumPy, and PyTorch at the start of every trial, and PyTorch’s cuDNN backend is run in deterministic mode (cudnn.deterministic = True, cudnn.benchmark = False). The PyTorch wheel used on the GPU worker bundles CUDA 12.8 and cuDNN 9.10.2. Residual run-to-run variance from non-deterministic CUDA kernels in deep-learning operators is acknowledged but not separately controlled. Each phase task records its pip freeze, the orchestrator’s git commit hash, and the CUDA toolkit version in the ClearML task metadata for cross-checking against the released code archive.

3.2. Dataset and Target

The dataset used in this study is an hourly electricity-consumption record for a single building, labelled L14 throughout this paper. L14 is a Norwegian kindergarten instrumented with an hourly-resolution main-feed electricity meter and aligned with weather observations from a co-located meteorological station that records temperature, wind speed, wind direction, humidity, and solar radiation. The record covers four full years from 1 January 2018 to 31 December 2021, with 35,063 aligned hourly records after timestamp alignment and preprocessing. Mean consumption is 7.60 kW with a standard deviation of 4.64 kW. Figure 3 summarises the operational profile of the L14 record across four panels: the hourly time series across the full 2018 to 2021 record (top left), the daily usage pattern averaged across all days (top right), the marginal distribution of hourly usage values (bottom left), and the average usage by calendar month (bottom right). The four panels together reveal the kindergarten building’s signature, namely a flat overnight baseline near 4 to 5 kW, a pronounced daytime ramp peaking at around 11 to 12 kW between mid-morning and early afternoon, a bimodal usage distribution that separates off-hours from operating-hours regimes, and a seasonal pattern with the highest demand in winter and a sharp summer dip in July when the kindergarten is closed for summer break.

Figure 4 shows the corresponding weather context, summarised as monthly averages of the four scalar weather variables that the integrated dataset feeds into Phase 3 feature engineering: 2 m air temperature, global solar radiation, wind speed, and relative humidity. The Nordic seasonality is visible in all four: temperature ranges from a January mean near freezing to a July mean of roughly 20 °C, solar radiation rises from near zero in December to a June peak above 200 W/m², wind speed is relatively flat at around 2 m/s with a small spring high and a November dip, and relative humidity is anti-correlated with temperature, peaking at about 90% in late autumn and dropping to about 65% in spring. Wind direction is omitted from this figure because a monthly mean is not meaningful for a circular variable; it enters the model as cyclic sine and cosine encodings produced in Phase 3.

The forecasting task is a 24-step-ahead hourly load-forecasting problem: at each forecast origin, the model predicts electricity demand for the next 24 hourly intervals, and evaluation metrics are computed over all hourly predictions in the test window. The aligned modelling table contains 35,063 hourly records over the period 1 January 2018 to 31 December 2021. An 80/10/10 temporal split is used for model development and testing. After applying the 168 h input window and 24 h forecast horizon, the test window yields 146 daily forecast origins, corresponding to 3504 evaluated hourly predictions. The test window runs from 7 August 2021 22:00 to 31 December 2021 23:00.

Phases 5 and 6 (post-deployment serving and monitoring) run on the synthetic provider of the DataProvider abstraction described in Section 2.1. The synthetic stream is generated by averaging the L14 record across the four-year history at each (month, day, hour) slot and stamping the resulting one-year cycle as 2022, which is then replayed through the same abstraction. Consequently, the Phase 5 and Phase 6 results should be interpreted as validation of pipeline mechanics, including model retrieval, serving, monitoring-cycle execution, delayed evaluation, and threshold checks, rather than as evidence of live operational monitoring performance. Both the synthetic and the live provider implementations conform to the common DataProvider interface contract specified in Section 2.1, so downstream phase logic is unchanged between the two paths; a live-versus-offline performance comparison is left to the live-deployment evaluation identified in Section 5.4.

3.3. Experimental Matrix

The experimental matrix defines the set of runs reported in Section 4 and what each run is designed to evaluate. It comprises six runs. The L14 baseline is executed in the two execution modes described in Section 2.5, namely automatic (B-Auto) and human-in-the-loop (B-HIL), and four single-fault variants are executed in automatic mode with human-in-the-loop fallback, each constructed to exercise one specific detection point along the pipeline. The evaluation protocol was fixed before executing the variant runs. Each variant changed exactly one input or configuration dimension relative to the L14 baseline while keeping the remaining pipeline configuration unchanged. The variants were executed in automatic mode until a checkpoint or gate condition fired. Human input was introduced only after the automatic logic had recorded a failure or warning condition, and the operator was restricted to the pre-registered decision options for that checkpoint. The operator was a member of the implementation team, and the permitted decision options and intended response pathways for V1–V4 were fixed before executing the variant runs. The operator did not intervene before the relevant checkpoint or gate condition had been recorded by the automatic logic. The operator was not blinded to the injected condition; therefore, the variants evaluate whether the implemented lifecycle detects and routes the condition as specified, not whether an independent operator would diagnose an unknown field failure. This protocol separates fault detection by the implemented lifecycle from operator disposition after escalation. Table 3 summarises the matrix; the variant configurations and intended detection points are explained in detail in the paragraphs that follow.

In this study, “controlled evaluation” means that each variant modifies one input or configuration dimension relative to the L14 baseline, while the remaining pipeline configuration is kept fixed. This design allows the observed detection point and operator response pathway to be attributed to the injected condition. This is a fault-injection evaluation design: deliberately constructed faults are used to verify that the implemented detection and routing logic activates as specified, and the four variants therefore characterise the validation routes themselves rather than the empirical prevalence of these faults in field deployments. Characterising the prevalence and structure of naturally occurring faults, as opposed to deliberately constructed probes, is a complementary question that a fault-injection design does not address and is identified as a follow-up direction in Section 5.4; the two designs answer different questions, the present one whether the routes activate as specified and the other how often and in what form faults arise in the field.

Each variant was evaluated against four route-validation criteria: whether the injected condition was detected before operator disposition, whether the detection occurred at the intended checkpoint or gate, whether the registered routing response was executed, and whether the corresponding decision and artifact evidence were recorded in the audit trail.

The variants exercise four independent dimensions of the pipeline through alternative L14 dataset profiles or alternative intake templates. V1 (Missing Data) uses a corrupted L14 source file produced by a random uniform deletion of 40% of rows, intended to trigger the data-completeness sub-criterion at HIL-P1-1.1.6. V2 (Shuffled Consumption) shuffles the consumption time series while keeping all weather and calendar inputs aligned to the original timestamps, intended to trigger the temporal-structure check at HIL-P2-2.1.6. V3 (Restrictive Feature Selection) configures the feature selection step to retain only the top three features per group, deliberately violating the L14 profile’s minimum feature-coverage expectation. Unlike V1 and V2, this condition is not a raw data-quality fault but a profile-relative configuration anomaly, intended to trigger HIL-P3-3.2.5 for operator disposition. V4 (Missing Doc03/Doc04 Sections) removes the compliance_assessment and technical_feasibility sections from the intake template so that Phase 0 produces Doc03 with empty applicable_policies and required_evidence and Doc04 with empty compute_resources and data_infrastructure, intended to trigger Gate 0′s existence-and-structure check on the foundation document chain. V1 and V2 target sub-checkpoint detection of data-quality or temporal-structure faults, V3 targets sub-checkpoint escalation of a profile-relative configuration anomaly, and V4 targets the main-gate tier on a structural fault that no Phase 0 sub-checkpoint is designed to detect. V2 is the explicit defence-in-depth case, where an upstream sub-checkpoint warning can be overridden and a downstream sub-checkpoint catches the consequence.

Across the matrix, the data-source configuration is the L14 record for Phases 0 to 4 and the synthetic provider for Phases 5 and 6. The four model families exercised in Phase 4 by the L14 dataset profile are RNN, LSTM, Transformer, and XGBoost.

The four variants are constructed validation probes designed to exercise selected detection points. Each isolates one degraded-input or degraded-configuration condition so that the observed detection event can be attributed to the injected condition. Together with the two baseline runs, the matrix provides operational evidence that the unmodified pipeline executes a complete, governed run end-to-end and that the selected checkpoint and gate routes fire when their triggering conditions are present.

4. Controlled Operational Evaluation

Section 4 reports the controlled operational evaluation evidence collected from the six runs defined in the experimental matrix of Section 3.3. Section 4.1 reports the outcomes of the two baseline runs of L14, Section 4.2 reports the outcomes of the four controlled variant runs, and Section 4.3 reports the computational cost and platform overhead measured across all six runs.

4.1. Baseline Execution Outcomes (L14, Automatic and Human-in-the-Loop Modes)

Two complete runs of the L14 baseline were executed under identical pipeline configuration, differing only in execution mode: the automatic-mode run (B-Auto) and the human-in-the-loop run (B-HIL). Together they exercise the pipeline’s full Phase 0 to Phase 6 forward path under the two configurable execution modes.

Under automatic mode, all seven phases completed in sequence and all seven gates G0 through G6 passed without any loopback being triggered. Phase 4 ran its full hyperparameter optimisation and selected an XGBoost champion model with a 24-h-ahead 24-h-ahead test root mean square error (RMSE) of 1.19 kW; the per-phase wall-clock breakdown for this run is reported in Section 4.3.

Under human-in-the-loop mode the pipeline paused at every routine checkpoint whose firing precondition was satisfied on the run. Of the 17 enabled checkpoints (representative entries in Table 2), fifteen fired and were approved by the operator with rationale and reviewed-artefact entries written to the audit trail. The remaining two are trigger-conditional and did not fire because their preconditions were not satisfied on a run that completed without faults: HIL-P1-1.1.6, a data-quality check that escalates only on a negative finding, and HIL-P4-4.3.4, a root-cause classification triggered only by a performance-check failure.

L14 was applied to the pipeline through configuration alone, with the active dataset profile and the standard task-intake template fed into Phase 0 at runtime and no pipeline code modified.

Phase 1 ingested two streams onto a UTC-aligned hourly grid: the L14 electricity meter record and the co-located weather record. The initial quality check recorded 24 missing values in the consumption column and per-column missing counts of 0 to 120 in the weather variables (Rh fully populated; T, SolGlob, Ws, and Wd each below 0.4% missing); the per-column summary is given in Table 4, each column spans 35,063 hourly records.

Phase 2 produced the cleaned dataset that Phase 3 consumes. Exploratory analysis confirmed the diurnal and weekly patterns visible in the Section 3.2 profile (working-hours peak, overnight floor, weekday-versus-weekend contrast). The cleaning step then resolved the per-column gaps recorded in Table 4, applied variable-specific outlier rules, and emitted the cleaned dataset as the second link in the ClearML lineage chain.

Phase 3 generated derived features across five categories: temporal cyclic encodings (hour, day-of-year, month sin/cos); calendar indicators (working hours, school day, holiday, weekday/weekend); weather summaries and lagged values; consumption lags (1 h to 168 h shifts plus rolling means); and interactions (calendar × consumption, weather × calendar). Each candidate feature was scored by combining a primary effect-size or correlation metric with mutual information, and the composite-score selection retained 45 features for downstream modelling. The selected features ranked by composite score are shown in Figure 5, colour-coded by category. The highest-scoring features are consumption lag and rolling-mean variables (composite scores above 0.6); calendar indicators and calendar × consumption interactions occupy the middle tier; and the raw weather summary features appear in the lower half of the ranking. The selected feature set is registered as the third link in the lineage chain, with the temporal-split and per-fold subset versions following as the fourth and fifth links.

Phase 4 produced four per-family winners, each ranked by validation RMSE within its 24-trial Stage 4.4 pool. To contextualise the forecasting performance, a seasonal-naive baseline was computed on the same temporal split. The baseline predicts each target value using the value observed one week earlier, i.e., y[t + h] = y[t + h − 168], and is used only as a reference for forecast credibility rather than as the main evaluation object of this paper. Test-set performance is summarised in Table 5. For reference, the seasonal-naive baseline on the same test split achieves RMSE 1.725 kW, mean absolute error (MAE) 1.168 kW, mean absolute percentage error (MAPE) 19.12%, coefficient of determination (R²) 0.861 (per the val_ranked Stage 4.4 naive-seasonal statistics record). Relative to the seasonal-naive RMSE of 1.725 kW, the XGBoost champion’s RMSE of 1.19 kW corresponds to an approximately 31% reduction in RMSE.

Although the central evidence of this paper concerns the gate-enforced validation route rather than absolute forecasting accuracy, feature selection and hyperparameter search are nevertheless exercised automatically within every run, and their outcomes can be read as an internal sensitivity exploration. On the L14 baseline, Phase 3 reduces the engineered feature pool to the 45 features shown in Figure 5 through a composite score that weighs a primary effect-size or correlation metric equally with mutual information, with correlation-based pruning and per-category retention across the five feature groups. Phase 4 then performs a two-stage search per family: Stage 4.3 evaluates the 31 non-empty subsets of the five feature groups by validation RMSE and forwards the top three, and Stage 4.4 runs an eight-cell hyperparameter grid on each forwarded subset, so each per-family result in Table 5 is the winner of a search over feature-group composition and hyperparameter settings rather than a single fixed configuration. The most informative outcome of this search is that the XGBoost champion is selected using only the temporal and calendar feature groups, with the weather, lag, and interaction groups dropped, while the three remaining families converge within a narrow 1.40 to 1.50 kW RMSE band; this indicates that the L14 signal is dominated by calendar and temporal structure and that the champion ranking is stable across the explored configuration range.

The deployment champion is XGBoost, selected as the family with the lowest test root mean square error among the four val-ranked Stage 4.4 winners; the val ranking is consistent with this choice (validation RMSE 1.509 kW for XGBoost versus 1.512 kW for RNN, 1.551 kW for LSTM, and 1.587 kW for Transformer). The XGBoost champion is trained with 400 estimators, max_depth 6, and learning_rate 0.05 on the time and calendar feature group selected at Phase 4.3; weather inputs are not retained by the feature-subset search for this family on this dataset, so the deployed model is weather-free under the L14 feature-selection regime. This result should be interpreted as specific to the L14 dataset, the selected forecast horizon, and the configured feature-selection regime; it is not intended as a general claim that weather variables are unimportant for building-energy forecasting. Champion selection here is an operational rule of the pipeline, the family with the lowest RMSE among the Stage 4.4 winners is registered as the deployment model, rather than a claim of statistically significant superiority over the other families; a formal pairwise significance test of forecasting accuracy belongs to a forecasting-performance study and is outside the lifecycle-evaluation scope of the present paper, as identified in Section 5.4.

The diurnal bias of the four Phase 4 family champions on the L14 test set is shown in Figure 6, plotted as the hourly mean signed residual (predicted minus measured) from 00:00 to 23:00. The XGBoost trace stays within approximately ±0.25 kW across all 24 h. The RNN and LSTM traces follow similar shapes, oscillating between roughly −0.7 kW and +0.9 kW with co-located positive peaks at hour 5 and hour 17. The Transformer trace has the largest amplitude, remaining positive between hour 8 and hour 16 and reaching approximately +1.0 kW at hour 8, with negative values in the early morning and evening hours.

A representative seven-day segment of the champion’s forecast against measured consumption is shown in Figure 7, covering 8 August 2021 to 14 August 2021 (the first complete week of the test window). The measured hourly demand (black) and the XGBoost 24-h-ahead forecast (gold) are plotted on a common axis. On the five weekdays (9 August 2021 to 13 August 2021) consumption rises from an overnight floor near 4 kW to daytime peaks of approximately 14 to 17 kW; on 8 August 2021 and 14 August 2021 consumption stays close to the overnight floor across the full day.

Phase 5 retrieved the champion model file by following the document chain and then brought up the FastAPI service and Streamlit dashboard. Phase 5 used synthetic data through the DataProvider fallback. The deployment results record that the pipeline mechanics (model retrieval, service startup, data connection scaffolding) executed without error. The operator-facing forecast view of the dashboard is shown in Figure 8. It comprises the 24-h-ahead forecast as a time-series overlay on the 168 h history and as an hourly chart; three top-row metric cards (24 h forecast total in kW with peak hour, the corresponding measured value with the kW delta against the forecast, and the data-through timestamp); a right-column information panel containing a data-quality card, an expandable model-metadata view, a feature-groups summary, and a forecast-accuracy card; and a sidebar that selects the data source (synthetic, live, or fallback) and the forecast end-time and triggers forecast generation.

Phase 6 ran 16 monitoring cycles to exercise the monitoring-loop mechanics: delayed evaluation, population-stability-index drift computation, and threshold checks against the baseline monitoring summary. Because the synthetic stream is an averaged replay of the L14 four-year record (Section 3.2), the cycles do not carry inserted drift; the run therefore confirms loop execution and threshold-check behaviour rather than detection performance under live drift conditions, the latter being identified as the priority extension in Section 5.4. The Phase 6 monitoring view of the dashboard, accessed through the Phase 5/Phase 6 tab selector, is shown in Figure 9. It comprises four top-row metric cards (current MAPE, data completeness, prediction outliers, average prediction); a quality-evaluation status badge; and a monitoring-cycle panel reporting the last cycle, the next check time, time remaining, the last Gate 6 review, and the next governance audit.

4.2. Controlled Variant Experiments

The four variants were executed in automatic mode with human-in-the-loop fallback. Each variant triggered a targeted condition that the pipeline detected automatically before operator intervention and then escalated for disposition. V1, V2, and V3 were caught at the sub-checkpoint tier within their targeted phases, while V4 was caught at the main-gate tier at Gate 0 on a structural foundation-document fault that no P0 sub-checkpoint is designed to detect. Figure 10 records the run-level outcomes across the six runs, and per-variant narratives follow.

Variant V1 (Missing Data) triggered HIL-P1-1.1.6 during the initial data-quality check. The operator selected REVISE and changed the active dataset profile from the corrupted missing-row source back to the clean L14 source. The orchestrator looped back to Phase 1, reloaded the revised Phase 0 configuration documents, and re-executed the downstream phases. The run recovered and subsequently passed Gates G1 through G6 on the corrected data path.

Variant V2 (Shuffled Consumption) shuffled the consumption values while keeping timestamps, weather variables, and calendar variables unchanged. HIL-P2-2.1.6 detected the resulting temporal-structure mismatch during EDA. The operator overrode the warning to test downstream defence-in-depth behaviour. The pipeline continued to Phase 4, where the post-training performance check failed with test RMSE 4.77 kW and test R² −0.04. The operator then selected ABORT. V2 therefore demonstrates that an overridden upstream warning can still be caught by an independent downstream performance criterion.

Variant V3 (Restrictive Feature Selection) retained only the top three features per category, producing 15 selected features and violating the L14 profile’s minimum feature-coverage expectation. HIL-P3-3.2.5 escalated the profile-relative anomaly. After reviewing the per-category coverage table, the operator selected ABORT and the pipeline terminated before Gate 3. V3 therefore exercises immediate termination at the sub-checkpoint tier.

Variant V4 (Missing Doc03/Doc04 Sections) removed the compliance_assessment and technical_feasibility sections from the intake template. Phase 0 consequently produced Doc03 and Doc04 with empty structural fields. No Phase 0 sub-checkpoint is designed to validate this cross-document structure, but Gate 0 detected the incomplete foundation-document chain and escalated to the operator. After reviewing the generated Log Issues document, the operator selected FAIL and the pipeline terminated without loopback. V4 therefore exercises immediate termination at the main-gate tier.

Across the four variant runs, the experiment design isolates two operational properties of the pipeline. First, every variant condition was surfaced by the checkpoint-and-gate logic before any human review took place. V1 (data-completeness breach) and V2 (temporal-structure breach) trigger threshold-driven fault conditions at sub-checkpoints; V3 (configuration anomaly relative to the L14 minimum-feature-coverage requirement) surfaces at HIL-P3-3.2.5 as anomaly escalation requiring operator disposition rather than as a strict fault detection in the same sense as V1 and V2; and V4 (structural foundation-document fault, with Doc03 policies and Doc04 compute_resources empty) is caught at Gate 0 on the main-gate tier through the existence-and-structure check on Doc01 to Doc04, on a fault class that no P0 sub-checkpoint is designed to detect. As a within-run specificity indication, in each variant run the gates and sub-checkpoints other than the targeted detection point passed under their configured criteria, so the gate machinery did not over-trip on conditions injected elsewhere in the pipeline; because the gates are deterministic threshold tests evaluated on a single run per condition, this is an observation of no spurious trips rather than an estimated false-positive rate.

4.3. Computational Cost

End-to-end wall-clock for the run that produced the champion is dominated by Phase 4 hyperparameter optimisation. Per-phase wall-clock for the L14 baseline is reported in Table 6. The lightweight phases (P0–P3, P5) each complete in seconds-to-minutes per cycle and contribute under ten minutes total to a run. Phase 4′s full Stage 4.4 sweep ran 96 trials (24 per family across four families) sequentially on a single GPU-equipped worker and spanned multiple calendar days; the four val-ranked winning trials had verified per-trial durations of 33.9 min (XGBoost), 3.3 min (RNN), 4.7 min (LSTM), and 15.6 min (Transformer). The orchestrator and the lightweight phases ran in a CPU-only Python 3.12 environment, while Phase 4 trials were dispatched to a CUDA-enabled Python 3.13 agent on the same host: the neural-network trials used the RTX 3060 GPU, and the XGBoost trials were CPU-bound on the host’s Ryzen 9 5900X with negligible GPU activity. The interpretive implications of single-worker sequential execution are discussed in Section 5.2.

Storage volumes per run are characterised by registered objects rather than byte sizes. The dataset registry records the five datasets along the lineage chain; the integrated dataset alone is 14.4 MB across CSV and Parquet representations. The document artifact store contains 29 of the 37 unique document IDs on a run that completed without faults, with the eight conditional documents (seven gate-fail and the quality-fail Doc62) not firing. Phase 4 contributes the bulk of the artifact volume through per-trial scalar logs and plots.

5. Discussion

Table 7 provides the roadmap for this discussion, linking each operational claim to the evidence reported in Section 4 and to the corresponding scope limitation. The following sections expand these claims by discussing the lifecycle outcomes, the two-tier validation behaviour, the role of ClearML, and the remaining limitations.

5.1. Operational Insights from the Implementation

The six-run experimental matrix supports a set of observations. The seven-phase lifecycle of [14], with its two-tier verification, Doc01–Doc64 governance chain, and HIL-routed fault-recovery rules, was proposed as a paper specification; the present paper is the first empirical study to exercise these elements as a running governed system under controlled conditions, and the observations below rest on that empirical evidence. They fall into two groups, corresponding to the first two contributions stated in Section 1. Section 5.1.1 reads the baseline outcomes as insights about the configuration-driven operationalisation of the seven-phase lifecycle. Section 5.1.2 reads the four controlled-variant outcomes as insights about the two-tier validation system that distinguishes this implementation from a paper-only specification.

5.1.1. Configuration-Driven Operationalisation of the Lifecycle

The lifecycle is operationalisable end-to-end. The baseline automatic execution completed all seven phases without human intervention, all seven gates passed, the document chain produced its full complement of artifacts, with Phase 5 and Phase 6 entries generated under the synthetic-data configuration described in Section 3.2, and the four ClearML services (experiment tracker, dataset manager, artifact store, and remote-queue execution) carried the platform-side workload throughout. The configuration-driven design allowed L14 to be applied using only a dataset profile and the standard intake template. The multi-family training results in Table 5 show that the gate-enforced pipeline produces directly comparable models across families on the same data preparation. The champion test RMSE of 1.19 kW at a 24 h horizon (R² 0.934, MAPE 15.1%; Table 5), compared with the seasonal-naive baseline RMSE of 1.725 kW on the same test split, indicates that the implemented lifecycle produces a credible forecasting model. This result supports the operational evaluation but is not the primary contribution of the paper. A finer point: the deployed XGBoost champion uses the time and calendar feature group only, with weather inputs not retained by the Stage 4.3 feature-subset search for this family on this dataset, indicating that on this kindergarten’s record under the configured selection regime the diurnal-and-calendar structure carries the bulk of the predictable signal.

The Doc01 and Doc02 propagation pattern does two jobs at once. The pattern was introduced in this implementation as a traceability mechanism, since every downstream phase loads its parameters from the Doc01 and Doc02 generated by Phase 0, so every observable phase behaviour is traceable through the document chain to the original task intake. Variant V1 made a second job visible. When the operator paused at HIL-P1-1.1.6 and revised Doc01 and Doc02 to point at a corrected data source, the pipeline picked up the revisions on resumption and re-executed the affected phases with no further code or configuration changes. The same propagation that gives traceability also gives revision-driven recovery. A pipeline that does not enforce this kind of single-source-of-configuration discipline cannot offer either property cleanly.

The audit-trail mechanism produces a complete record of decisions and artifacts even on a fault-free run, not only on failure. The baseline human-in-the-loop run paused the pipeline at every routine checkpoint reached on a fault-free run and recorded an approval decision with rationale at each one, including each Phase 5 sub-checkpoint (HIL-P5-5.1.3, HIL-P5-5.2.4, HIL-P5-5.3.3, and HIL-P5-5.4.3) and Gate 5. The same record is generated by Gate review even when no fault was detected. The implementation made the presence of this mechanism visible because the baseline was run in both human-in-the-loop mode and automatic mode.

In short, the implementation evidence supports the first contribution: the lifecycle of [14] is operationalisable as a governed, configuration-driven system when the single-source-of-truth discipline anchored on Doc01 and Doc02 is enforced across all seven phases.

5.1.2. Two-Tier Validation Under Controlled Faults

The controlled variants show that the two-tier validation design changes the behaviour of the running lifecycle rather than merely documenting review points. Sub-checkpoints catch phase-local technical conditions while the required evidence is still local to a phase, whereas main gates catch cross-artifact and governance-level conditions that require evidence spanning multiple documents or phases. This separation is visible in V1–V3, which are detected before their corresponding gates, and in V4, which is detected at Gate 0 through a foundation-document structure check.

Defence in depth emerged as a structural property of the design rather than a feature deliberately built. Variant V2 is the centrepiece evidence. An automatic detection at HIL-P2-2.1.6 was overridden by the operator, the pipeline ran through Phases 3 and 4, and the Phase 4 post-training performance check escalating to HIL-P4-4.3.4 caught the downstream consequence of the override with an entirely different metric, namely model performance versus temporal autocorrelation. The criteria of the two checkpoints are independent, and that independence is what makes layering work. The implication is that checkpoint independence should be treated as a first-order design constraint when adding new checkpoints, since a checkpoint whose decision is conditioned on an upstream checkpoint decision adds an audit-trail entry but no genuine detection redundancy.

The two execution modes serve different evaluation purposes. Automatic mode demonstrates routine execution without unnecessary operator pauses. Human-in-the-loop mode demonstrates that routine approvals, rationales, and reviewed artifacts can be captured in the audit trail. The auto-with-fallback regime used for V1–V4 separates automatic detection from operator disposition: the pipeline runs without human input until a checkpoint or gate condition fires, after which the operator selects one of the registered response options. This separation is central to the controlled evaluation because it makes the detection event attributable to the implemented lifecycle rather than to operator vigilance.

Surface area at human-in-the-loop escalation is matched to failure type. V1 surfaced ingestion metrics (the missing-row count from Doc12 against the Doc02 threshold). V2 surfaced an analytical artifact (the autocorrelation profile from EDA) rather than a numeric flag. V3 surfaced a per-category coverage table from feature selection. This pattern is consistent with the data-quality-by-design principle articulated by Schelter et al. [22] and the validation-as-first-class-citizen view of Polyzotis et al. [23], and it gives the operator the right evidence to diagnose the cause rather than only the symptom. A governance system that escalates a numeric flag without the underlying artifact forces the operator to do a separate forensic step before deciding, and the design here avoids that.

In short, the controlled variants support the second contribution by showing that the implemented lifecycle can detect, escalate, route, and resolve selected degraded-input and degraded-configuration conditions through the intended validation tiers and operator response pathways. Together, these observations support treating the two-tier separation as a structural property of the validation design.

5.2. Operational Role of the MLOps Platform

The pipeline is built on a single self-hosted MLOps platform that provides four ClearML services exercised across the seven phases: three storage-tier services (experiment tracker, dataset manager, artifact store) plus a remote-queue execution service that handles GPU dispatch through clearml-agent workers attached to named queues. In this implementation, ClearML provides the MLOps substrate but not the governance logic itself. Experiment tracking, dataset versioning, artifact storage, and remote queue execution are platform services. Checkpoint routing, human-in-the-loop decision handling, document validation, pause-resume behaviour, and loopback resolution are application-level services implemented around the platform. Table 8 maps the four services to the seven phases that exercise them. Reflecting on what the platform delivered and where it imposed friction is the second half of the discussion.

All four services supported the implementation without requiring replacement by custom substitutes. The experiment tracker carried the full Phase 4 hyperparameter optimisation, with each of the 96 training runs logged together with its hyperparameter values, training and validation curves, test metrics, and per-task resource scalars; absent this service a custom training-run logger would have been needed. Dataset versioning underpins the five-link lineage chain, where each dataset records its parent and rollback to any earlier version is a single API call. Artifact storage doubles as both a model registry and the cross-phase document store: Phase 5 retrieves the champion by scoping to the most recent Phase 4 orchestrator task and reading Doc45 to identify the winning child trial, while any document in the Doc01 to Doc64 family is retrievable by name from a later phase and returns its most recent version. The remote-queue layer splits CPU and GPU work cleanly, with the orchestrator and the lightweight phases running on the CPU and Phase 4 trials dispatched to the CUDA-enabled agent on the same host. These platform characteristics broadly correspond to those identified in MLOps platform reviews [10,11,24,25].

Several required capabilities were outside ClearML’s native scope and were therefore implemented at the application layer. The platform offers no native human-in-the-loop interface, so the Dashboard, Checkpoint Manager, and Decision Manager described in Section 2.5 are application code, with only the storage of decision artifacts and pause-state files relying on the platform. Content-aware artifact search is also absent, since the document family is stored as opaque blobs and cross-document governance queries have to be implemented in application code by retrieving and parsing artifacts. Pause-state persistence revealed a more subtle gap: the platform task model assumes a task is alive when artifacts are written, while the orchestrator pauses between phases when no task is active, and the implementation falls back to local file storage for pause-states in those moments and reconciles to platform artifacts when a task becomes active. Application-level memory is similarly unmanaged, since Phase 4 caches the foundation documents at module load time and the cache must be cleared explicitly after operator revision before re-execution. Finally, the helper layer that wraps the platform API would require rewriting on migration, even though the lifecycle architecture itself is platform-agnostic.

The implication for MLOps platform selection is that platform-native capabilities and application-level governance requirements should be separated explicitly. In this implementation, ClearML made tracking, dataset versioning, artifact storage, and remote execution inexpensive, while HIL workflow, content-aware document validation, pause-state handling, and lifecycle-specific routing required custom application logic.

The wall-clock figures in Table 6 quantify the dominant cost as Phase 4 hyperparameter optimisation. The 96-trial Stage 4.4 sweep ran sequentially on the single GPU-equipped worker, with verified per-trial durations of the four val-ranked winners at 33.9 min (XGBoost, the champion), 3.3 min (RNN), 4.7 min (LSTM), and 15.6 min (Transformer); XGBoost dominated per-trial cost despite being CPU-bound on that worker. Two improvements are prerequisites for industrial deployment: per-task SDK-time instrumentation, to expose what fraction of each phase’s wall-clock is platform overhead versus net compute, and additional ClearML workers on the training queue to parallelise Phase 4 trials within and across families. Storage volumes per run are dominated by the per-trial Phase 4 experiment-tracking artifacts, and multi-building deployment over months would warrant a retention policy aligned with the audit-trail requirements of the deploying organisation.

5.3. Implementation Limitations

Several limitations of the present implementation are stated directly below, with the improvement that would address each one.

One scope note concerns Phase 5 and Phase 6, which were run in an emulation of live operation rather than against a live feed. The deployment and monitoring results confirm that pipeline mechanics function correctly, including model retrieval, service startup, drift detection, delayed evaluation, and quality thresholds, but they were obtained under emulation. The replay stream that drives the emulation is generated by averaging the four-year L14 record at each (month, day, hour) slot (see Section 3.2), so it lacks the intra-day variability and inter-annual drift that a live feed would exhibit. Consequently, the observed drift values and quality metrics reflect this averaged-replay structure rather than live operating behaviour.

A second limitation is that configuration revision within the human-in-the-loop workflow remains manual. When an escalation calls for a Doc01 or Doc02 edit, the dashboard surfaces the relevant content alongside the diagnostic evidence, but the operator must locate the affected parameter in the underlying configuration or generated document file and edit it directly before resuming the run. This is functionally sufficient, since the pipeline reloads the revised configuration on re-execution and the audit trail records the change, but it places the burden of identifying and applying the correct edit on the operator.

A third limitation is that the pipeline does not autonomously interpret prior phase results to inform the next action. The orchestrator advances phases according to gate outcomes and a fixed routing table, and when a sub-checkpoint surfaces a fault, the response set is encoded as the three operator response pathways described in Section 5.1.2. The pipeline itself does not reason over the content of previous-phase output documents to recommend which pathway best fits the observed condition; that judgement rests with the operator.

These limitations also define the validity boundary of the evaluation. The controlled variants verify whether known degraded-input and degraded-configuration conditions activate the specified checkpoint and gate routes, but they do not constitute a blinded operator study, a live-deployment reliability test, or a multi-building generalisation study. The operator was part of the implementation team, and the variants therefore evaluate lifecycle detection and routing behaviour rather than independent human diagnosis of unknown field failures.

5.4. Future Work

Four priority directions follow directly from the limitations above and would substantially strengthen the work.

The first is live data deployment in Phases 5 and 6. Switching these two phases from emulation to a live data source is a configuration change rather than a code change, because the DataProvider abstraction was designed for this purpose. A live-deployment evaluation over a multi-month operational window would compare the drift signals (PSI, KL divergence, rolling-MAPE trend) against ground-truth labelled drift events, extending the Phase 5 and Phase 6 results from validation of pipeline mechanics under controlled stationary conditions to validation of drift-detection efficacy under nonstationary conditions. Auto-retraining triggered by drift detection would become a natural extension that closes the loop between monitoring and training. The second is an agentic layer that assists and extends the human-in-the-loop workflow, addressing the manual-revision and gate-routed-automaton limitations together. Two capabilities sit naturally within this layer. A guided-revision component would interpret the diagnostic evidence at escalation, identify the affected configuration fields, and propose the corresponding Doc01 or Doc02 edits with provenance recorded automatically, so that the operator approves a proposed revision rather than authoring it manually. For well-characterised conditions, a more capable agent could apply the revision directly without further human intervention. A decision component above the orchestrator would interpret each phase’s output documents against the gate criteria and historical run statistics and propose the next action, with the operator retaining final authority and the agent contributing a recommendation rather than autonomous control. Together, these capabilities would convert the pipeline from a gate-routed automaton into a result-aware system and would couple naturally with automated threshold adaptation. This layer is also the intended route to stronger configuration handling, interpreting the configuration against the diagnostic evidence at intake rather than relying on a static field-level schema, which addresses the parse-time configuration-validation gap noted in Section 2.1.

The third is a stakeholder-engaged evaluation of the gate layer’s broader governance role. The present study evaluates the sub-checkpoint layer in a single-team research setting and exercises the main-gate layer at Gate 0 on one structural foundation-document fault, V4. However, the gate layer’s remaining distinguishing criteria, namely cross-stakeholder review, audit-chain integrity, compliance-status verification, and authorisation sign-off, presuppose stakeholder roles outside the implementation team. A deployment-setting study, ideally co-conducted with a building operator and a compliance officer, would inject realistic governance scenarios, such as an unclosed compliance review propagating across a gate, an incomplete Phase Report sign-off, or a risk recorded as accepted at one gate but not signed off by the responsible role, and verify that the gate layer surfaces these issues on the audit chain and routes them through the appropriate review constituency. Evaluation criteria would be both qualitative and quantitative, covering whether the document chain, dashboard, and audit trail support each stakeholder’s review at the gates.

The fourth is multi-building generalisation across climate zones and occupancy types. Applying the pipeline through dataset profiles to at least three buildings spanning at least two climate zones (for example a hospital, an office, and a residential building), with the pipeline code held fixed and each new building adapted by authoring a new dataset profile only, would test whether the dataset-profile abstraction is portable. A fixed evaluation protocol (same forecast horizon, same gate criteria, same audit-trail completeness check) would then distinguish two separable questions: whether the lifecycle mechanics generalise (does every building complete the same gate routes) and whether the dataset-profile abstraction supports forecasting performance across building types and climates (does each building reach acceptable test RMSE under its own profile). This would test external validity, expose any dataset-specific assumptions in the gate thresholds, and demonstrate the configuration-driven adaptation contribution at scale [4,26,27]. Industrial deployment in a production building-management context is the natural follow-on once multi-building generalisation is established, and would also serve as the setting for the stakeholder-engaged governance evaluation identified above.

Several further directions follow from the peer-review process and are noted here for completeness. A finer-grained module-level coupling and cohesion analysis of the implementation repository would complement the operational-domain dependency structure described in Section 2. Concurrent multi-pipeline and multi-model orchestration, with explicit treatment of decision ordering and audit-trail write-serialisation and with event-driven rather than polled handling, would extend the present single-pipeline sequential design. A statistical characterisation of checkpoint and gate decision outcomes, estimating false-positive and false-negative behaviour through an injected-fault population with ground-truth labels and threshold sweeps that trace the detection trade-off per gate, would move the present demonstration toward measured detection characteristics. An empirical fault-distribution study on longitudinal field data would characterise the prevalence and structure of naturally occurring faults, complementary to the present fault-injection probe study. A parallel declarative-DAG reimplementation of the lifecycle would enable a wall-clock benchmark against the application-code orchestrator. An external sensitivity analysis, with seed-variance estimation, feature-group ablations, and hyperparameter perturbation, would extend the internal Stage 4.3 and Stage 4.4 search reported in Section 4.1. A formal model-comparison study, for example pairwise Diebold–Mariano testing of forecasting accuracy with multiple-comparison control, would address comparative model accuracy as a question distinct from the lifecycle evaluation of the present paper. A content schema and monotonic versioning scheme for the foundation documents Doc01 and Doc02 would replace the present draft-to-approved status label and overwrite-in-place persistence. Selective phase re-execution with phase-output checkpointing, already identified in Section 5.3, would reduce the wall-clock cost of operator-approved loopbacks.

6. Conclusions

This paper presented an executable implementation and controlled validation-route evaluation of a seven-phase machine learning lifecycle for energy forecasting introduced in [14], using ClearML as the MLOps substrate selected in [15]. The pipeline was implemented as a configuration-driven system linking checkpoint enforcement, document artifact governance, and human-in-the-loop oversight. It was exercised on the L14 baseline in automatic and human-in-the-loop modes, and on four controlled variants in automatic mode with human-in-the-loop fallback. The variants were detected by the implemented checkpoint-and-gate logic before human review: V1, V2, and V3 at the sub-checkpoint tier, and V4 at the main-gate tier on a structural foundation-document fault. Together, the variants exercised revise-and-recover, override-then-terminate, and immediate-abort response pathways.

The implementation yields three operational findings, one per contribution. First, the seven-phase lifecycle is operationalisable end-to-end as a governed, configuration-driven system, with the L14 baseline applied through a dataset profile and the standard intake template alone (Section 5.1.1). Second, the two-tier validation system is operationally consequential across the four degraded-input and degraded-configuration conditions exercised by the variants, with the tested sub-checkpoint and main-gate tiers each detecting fault classes outside the other tier’s intended scope, and with the auto-with-fallback regime making detection events attributable to the implemented checkpoint-and-gate logic rather than to operator vigilance (Section 5.1.2). Third, the platform delivers on the four ClearML services it was selected for and requires application-level engineering for the four capabilities it does not reach (Section 5.2).

The main contribution is therefore not the absolute forecasting accuracy, although the XGBoost champion achieved a credible test RMSE of 1.19 kW, R² of 0.934, and MAPE of 15.1%. Rather, the contribution is the demonstration that a governed machine-learning lifecycle for energy forecasting can be implemented as a traceable, revisable, and auditable system on a single MLOps platform, and that its two-tier validation design can be exercised under controlled degraded-input and degraded-configuration conditions. The current evidence is bounded to one building dataset, synthetic post-deployment monitoring, and a single-team operator setting; live Phase 5–6 deployment, stakeholder-engaged gate evaluation, and multi-building validation remain necessary next steps.

Author Contributions

Conceptualization, B.N.J., Z.G.M. and X.Z.; methodology, X.Z., Z.G.M. and B.N.J.; software, X.Z.; validation, X.Z. and B.N.J.; formal analysis, X.Z. and B.N.J.; investigation, X.Z. and B.N.J.; resources, B.N.J. and Z.G.M.; data curation, X.Z.; writing of the original draft, X.Z.; review and editing, B.N.J. and Z.G.M.; visualisation, X.Z.; supervision, B.N.J. and Z.G.M.; project administration, B.N.J.; funding acquisition, Z.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of the project titled “Automated Data and Machine Learning Pipeline for Cost-Effective Energy Demand Forecasting in Sector Coupling” (jr. Nr. RF-23-0039; Erhvervsfyrtårn Syd Fase 2), which is supported by The European Regional Development Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The hourly electricity-consumption record analysed in this study (building L14) is part of the ADRENALIN multi-national sub-metered building energy dataset, publicly available on Zenodo at https://doi.org/10.5281/zenodo.19553414 (accessed on 30 April 2026) under a CC BY 4.0 licence. Configuration files defining the L14 dataset profile, the four controlled variants (V1–V4), the human-in-the-loop checkpoint registry, and the per-run audit trails (gate decisions, operator decisions, document artifacts Doc01–Doc64) used to produce the reported results are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AS	Artifact Store
CPU	Central Processing Unit
CUDA	Compute Unified Device Architecture
DAG	Directed Acyclic Graph
DM	Dataset Manager
EDA	Exploratory Data Analysis
ET	Experiment Tracker
GPU	Graphics Processing Unit
HIL	Human-in-the-Loop
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
MLOps	Machine Learning Operations
R²	Coefficient of Determination
RE	Remote Execution
RMSE	Root Mean Square Error
RNN	Recurrent Neural Network
SDK	Software Development Kit
XGBoost	Extreme Gradient Boosting

References

Hong, T.; Pinson, P.; Wang, Y.; Weron, R.; Yang, D.; Zareipour, H. Energy Forecasting: A Review and Outlook. IEEE Open Access J. Power Energy 2020, 7, 376–388. [Google Scholar] [CrossRef]
Wang, Z.; Srinivasan, R.S. A review of artificial intelligence based building energy use prediction: Contrasting the capabilities of single and ensemble prediction models. Renew. Sustain. Energy Rev. 2017, 75, 796–808. [Google Scholar] [CrossRef]
Bourdeau, M.; Zhai, X.Q.; Nefzaoui, E.; Guo, X.; Chatellier, P. Modeling and forecasting building energy consumption: A review of data-driven techniques. Sustain. Cities Soc. 2019, 48, 101533. [Google Scholar] [CrossRef]
Sun, Y.; Haghighat, F.; Fung, B.C.M. A review of the-state-of-the-art in data-driven approaches for building energy prediction. Energy Build. 2020, 221, 110022. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N.M. A review of data-driven building energy consumption prediction studies. Renew. Sustain. Energy Rev. 2018, 81, 1192–1205. [Google Scholar] [CrossRef]
Pham, A.-D.; Ngo, N.-T.; Truong, T.T.H.; Huynh, N.-T.; Truong, N.-S. Predicting energy consumption in multiple buildings using machine learning for improving energy efficiency and sustainability. J. Clean. Prod. 2020, 260, 121082. [Google Scholar] [CrossRef]
Sculley, D.; Holt, G.; Golovin, D.; Davydov, E.; Phillips, T.; Ebner, D.; Chaudhary, V.; Young, M.; Crespo, J.-F.; Dennison, D. Hidden Technical Debt in Machine Learning Systems. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; Curran Associates, Inc.: Red Hook, NY, USA, 2015; pp. 2503–2511. [Google Scholar]
Paleyes, A.; Urma, R.-G.; Lawrence, N.D. Challenges in Deploying Machine Learning: A Survey of Case Studies. ACM Comput. Surv. 2022, 55, 114. [Google Scholar] [CrossRef]
Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 2023, 11, 31866–31879. [Google Scholar] [CrossRef]
Symeonidis, G.; Nerantzis, E.; Kazakis, A.; Papakostas, G.A. MLOps—Definitions, Tools and Challenges. In Proceedings of the 2022 IEEE 12th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 26–29 January 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 0453–0460. [Google Scholar] [CrossRef]
Testi, M.; Ballabio, M.; Frontoni, E.; Iannello, G.; Moccia, S.; Soda, P.; Vessio, G. MLOps: A Taxonomy and a Methodology. IEEE Access 2022, 10, 63606–63618. [Google Scholar] [CrossRef]
Lwakatare, L.E.; Raj, A.; Crnkovic, I.; Bosch, J.; Olsson, H.H. Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Inf. Softw. Technol. 2020, 127, 106368. [Google Scholar] [CrossRef]
Studer, S.; Bui, T.B.; Drescher, C.; Hanuschkin, A.; Winkler, L.; Peters, S.; Müller, K.-R. Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Mach. Learn. Knowl. Extr. 2021, 3, 392–413. [Google Scholar] [CrossRef]
Zhao, X.; Ma, Z.G.; Jørgensen, B.N. An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise. Information 2025, 16, 805. [Google Scholar] [CrossRef]
Zhao, X.; Ma, Z.G.; Jørgensen, B.N. A Systematic Lifecycle-Referenced Capability Mapping of MLOps Platforms for Energy Forecasting. Information 2026, 17, 328. [Google Scholar] [CrossRef]
Allegro AI. ClearML: Open-Source MLOps Platform Documentation. Available online: https://clear.ml (accessed on 30 April 2026).
Mora-Cantallops, M.; Sánchez-Alonso, S.; García-Barriocanal, E.; Sicilia, M.-A. Traceability for Trustworthy AI: A Review of Models and Tools. Big Data Cogn. Comput. 2021, 5, 20. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery (ACM): New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
Schelter, S.; Lange, D.; Schmidt, P.; Celikel, M.; Biessmann, F.; Grafberger, A. Automating Large-Scale Data Quality Verification. Proc. VLDB Endow. 2018, 11, 1781–1794. [Google Scholar] [CrossRef]
Polyzotis, N.; Roy, S.; Whang, S.E.; Zinkevich, M. Data Management Challenges in Production Machine Learning. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD), Chicago, IL, USA, 14–19 May 2017; Association for Computing Machinery (ACM): New York, NY, USA, 2017; pp. 1723–1726. [Google Scholar] [CrossRef]
Zaharia, M.; Chen, A.; Davidson, A.; Ghodsi, A.; Hong, S.A.; Konwinski, A.; Murching, S.; Nykodym, T.; Ogilvie, P.; Parkhe, M.; et al. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 2018, 41, 39–45. [Google Scholar]
Baylor, D.; Breck, E.; Cheng, H.-T.; Fiedel, N.; Foo, C.Y.; Haque, Z.; Haykal, S.; Ispir, M.; Jain, V.; Koc, L.; et al. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Halifax, NS, Canada, 13–17 August 2017; Association for Computing Machinery (ACM): New York, NY, USA, 2017; pp. 1387–1395. [Google Scholar] [CrossRef]
Fan, C.; Sun, Y.; Zhao, Y.; Song, M.; Wang, J. Deep learning-based feature engineering methods for improved building energy prediction. Appl. Energy 2019, 240, 35–45. [Google Scholar] [CrossRef]
Somu, N.; Raman, M.R.G.; Ramamritham, K. A hybrid model for building energy consumption forecasting using long short term memory networks. Appl. Energy 2020, 261, 114131. [Google Scholar] [CrossRef]

Figure 1. Seven-phase lifecycle blueprint with gates G0 to G6 and loopback edges (specification view), reproduced from [14].

Figure 2. Implementation architecture of the forecasting lifecycle (operational view). Five architectural domains, namely data and configuration (D1), lifecycle phases and gates (D2), the orchestrator (D3), document artifact governance (D4), and human-in-the-loop oversight (D5), operate within a ClearML platform that exposes experiment tracking, dataset versioning, artifact storage, and remote queue execution.

Figure 3. Target electricity load analysis across the full observation period.

Figure 4. Weather context of the L14 record, summarised as monthly averages over 2018–2021 from the pipeline-input weather file, showing 2 m air temperature, global solar radiation, wind speed, and relative humidity. Wind direction is omitted because monthly averaging is not meaningful for a circular variable.

Figure 5. Composite-score ranking of the 45 features retained by Phase 3 selection on the L14 dataset, shown as five per-category panels (lag, calendar, interaction, temporal, weather). The composite score combines a primary effect-size or correlation metric with mutual information, both computed from the training partition only.

Figure 6. Diurnal bias of the four Phase 4 family champions on the L14 test set, computed as the hourly mean signed residual (predicted minus measured) across all 146 forecast origins of the test window. Values above zero indicate over-forecasting and values below zero indicate under-forecasting.

Figure 7. Hourly measured (black) and XGBoost 24-h-ahead forecast (gold) electricity consumption for L14 over the first complete week of the test window (8 August 2021 to 14 August 2021).

Figure 8. Phase 5 deployment dashboard for the L14 baseline, showing the operator-facing forecast view.

Figure 9. Phase 6 automated-monitoring view of the deployment dashboard for the L14 baseline, accessed through the Phase 5/Phase 6 tab selector.

Figure 10. Run-level outcomes across the six experimental runs of the L14 baseline and the four controlled variants. Each row reports the per-phase gate result from P0 to P6 and the final outcome, with cell colour encoding the gate state (legend at the bottom of the figure). The annotation under each variant row records the auto-detection point, the escalation route, and the operator decision pathway.

Table 1. Document artifact family by phase (condensed).

Phase	Phase Docs (Input/Intermediate)	Pass Doc (PR)	Fail Doc (LI)
P0	Doc01 (Project Spec, v1 + v2), Doc02 (Reqs & Criteria, v1 + v2), Doc03 (Compliance), Doc04 (Tech Feasibility)	Doc05 (PR0)	Doc06 (LI0)
P1	Doc11 (Source Inventory), Doc12 (Initial Data Quality), Doc13 (DQ Validation), Doc14 (Integration)	Doc15 (PR1)	Doc16 (LI1)
P2	Doc21 (EDA Report), Doc22 (Preproc Spec)	Doc23 (PR2)	Doc24 (LI2)
P3	Doc31 (Feature Spec), Doc32 (FS Report)	Doc33 (PR3)	Doc34 (LI3)
P4	Doc41 (Candidate Exps), Doc42 (Data Prep), Doc43 (Training), Doc44 (Performance), Doc45 (Final Model)	Doc46 (PR4)	Doc47 (LI4)
P5	Doc51 to Doc54 (one per sub-phase)	Doc55 (PR5)	Doc56 (LI5)
P6	Doc61 (Cycle Report), Doc62 (Disqualified predictions)	Doc63 (PR6)	Doc64 (LI6)

Table 2. Human-in-the-loop checkpoint registry.

Checkpoint	Type	Question or Role	Decision Options	Routing on Fail
HIL-P0-0.1.1	input	Submit the task configuration	SUBMIT	(advances P0 on submit)
HIL-P3-3.2.5	decision	Feature subset sufficiently?	YES/ABORT/REVISE	continue/abort pipeline/loop to step 3.2.2
HIL-P4-4.3.4	classification	Root cause: data quality or model setup?	CONTINUE/REVISE, DATA_ISSUE or MODEL_ISSUE/ABORT	proceed/loop (DATA_ISSUE→P1, MODEL_ISSUE→step 4.1.1)/terminate
HIL-P2-G2	gate	Is the Phase 2 output ready to advance to Phase 3?	PASS/FAIL/REVISE	(continue)/loopback per specification/loopback per specification + Doc01/Doc02 revision

Table 3. Experimental matrix. Each row specifies a run, its configuration relative to the baseline L14 dataset profile and intake template, the execution mode in which it is run, the detection point or points it is intended to exercise, the validation tier of that detection point, and the intended operator response pathway.

Run	Setup	Mode	Expected Issue	Response
B-Auto	Normal L14 baseline	Automatic	None	Completes without intervention
B-HIL	Normal L14 baseline	Human-in-the-loop	None	Operator approves checkpoints
V1	40% rows deleted	Auto with HIL fallback	Dataset/profile issue	Operator revises dataset profile; run recovers
V2	Consumption data shuffled	Auto with HIL fallback	Data inconsistency	Operator first overrides, then later terminates
V3	Feature selection limited to top 3 per group	Auto with HIL fallback	Profile-relative feature-coverage anomaly	Operator inspects and aborts
V4	Missing intake-template sections	Auto with HIL fallback	Gate 0 structure issue	Operator inspects documents and aborts

Table 4. Per-column quality summary of the L14 inputs after Phase 1 ingestion over 1 January 2018 to 31 December 2021. The aligned merged grid contains 35,063 hourly records per column. Missing-value counts are pre-cleaning; Phase 2 resolves these gaps.

Stream/Variable	Unit	Missing Values
Electricity: main_meter	kW	24
Weather: T (air temperature)	°C	110
Weather: Rh (relative humidity)	%	0
Weather: SolGlob (global solar)	W/m²	120
Weather: Ws (wind speed)	m/s	114
Weather: Wd (wind direction)	°	118

Table 5. Phase 4 four-family comparison on the L14 baseline at a 24 h horizon. Rows are ordered by ascending test root mean square error. The test evaluation comprises 3504 hourly predictions generated from 146 daily forecast origins in the final test window.

Family	Test RMSE (kW)	Test MAE (kW)	Test MAPE (%)	Test R²
XGBoost	1.19	0.89	15.1	0.934
RNN	1.40	1.00	16.8	0.908
LSTM	1.43	1.01	16.8	0.905
Transformer	1.50	1.03	16.7	0.896

Table 6. Representative wall-clock measurements for the L14 baseline. Lightweight phases are reported from a regression run with a reduced Phase 4 trial budget. The Phase 4 row reports the verified duration of the XGBoost champion trial only. The full Stage 4.4 sweep consisted of 96 sequential trials on a single worker and therefore should not be inferred by summing the rows in this table.

Phase	Wall-Clock (s)
P0 Scoping	71
P1 Data ingestion + QC	91
P2 EDA	85
P3 Feature engineering	94
P4 Training (champion XGBoost trial)	2036
P5 Deployment	211

Table 7. Summary of operational claims, supporting evidence, and scope limitations.

Claim	Evidence	Scope Limitation
The lifecycle executes end-to-end as a governed ClearML implementation.	B-Auto and B-HIL complete Phases 0–6, with Gates G0-G6 passing.	One building dataset; Phases 5–6 use synthetic data.
HIL mode produces an auditable decision trail.	B-HIL records operator approvals, rationales, and reviewed artifacts at routine checkpoints.	Single-team operator setting; stakeholder governance remains untested.
Selected sub-checkpoints surface phase-local issues before main gates.	V1, V2, and V3 escalate at HIL-P1-1.1.6, HIL-P2-2.1.6, and HIL-P3-3.2.5.	Variants are constructed probes, not naturally occurring failures.
Main gates can catch structural document faults.	V4 is detected at Gate 0 through foundation-document structure checks.	Only Gate 0 is tested for this structural fault class.
ClearML provides the MLOps substrate, while governance logic is application-level.	ClearML supports tracking, datasets, artifacts, and remote execution; routing, HIL decisions, validation, and pause-resume logic are custom-built.	Platform integration is ClearML-specific.

Table 8. ClearML services exercised per phase. ET = Experiment Tracker; DM = Dataset Manager; AS = Artifact Store; RE = Remote-Execution layer (clearml-agent and named queues). ✓ = service used in the phase; — = not used.

Phase	ET	DM	AS	RE
P0 Foundation	✓	—	✓	—
P1 Ingestion	✓	✓	✓	—
P2 EDA/Cleaning	✓	✓	✓	—
P3 Feature Eng & Sel.	✓	✓	✓	—
P4 Training	✓	✓	✓	✓
P5 Deployment	✓	—	✓	—
P6 Monitoring	✓	—	✓	—

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, X.; Ma, Z.G.; Jørgensen, B.N. Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML. Information 2026, 17, 576. https://doi.org/10.3390/info17060576

AMA Style

Zhao X, Ma ZG, Jørgensen BN. Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML. Information. 2026; 17(6):576. https://doi.org/10.3390/info17060576

Chicago/Turabian Style

Zhao, Xun, Zheng Grace Ma, and Bo Nørregaard Jørgensen. 2026. "Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML" Information 17, no. 6: 576. https://doi.org/10.3390/info17060576

APA Style

Zhao, X., Ma, Z. G., & Jørgensen, B. N. (2026). Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML. Information, 17(6), 576. https://doi.org/10.3390/info17060576

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Operationalising an End-to-End MLOps Lifecycle for Energy Forecasting: Implementation and Controlled Evaluation on ClearML

Abstract

1. Introduction

2. Implementation Architecture of the Forecasting Lifecycle

2.1. Data and Configuration (D1)

2.2. Lifecycle Phases and Gates (D2)

2.3. Orchestrator (D3)

2.4. Document Artifact Governance (D4)

2.5. Human-in-the-Loop Oversight (D5)

3. Implementation Context

3.1. Runtime Environment

3.2. Dataset and Target

3.3. Experimental Matrix

4. Controlled Operational Evaluation

4.1. Baseline Execution Outcomes (L14, Automatic and Human-in-the-Loop Modes)

4.2. Controlled Variant Experiments

4.3. Computational Cost

5. Discussion

5.1. Operational Insights from the Implementation

5.1.1. Configuration-Driven Operationalisation of the Lifecycle

5.1.2. Two-Tier Validation Under Controlled Faults

5.2. Operational Role of the MLOps Platform

5.3. Implementation Limitations

5.4. Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI