1. Introduction
Energy forecasting underpins critical operational and planning decisions across power systems, district networks, and building energy management [
1,
2]. The literature on energy load prediction is large and predominantly focused on model accuracy, with extensive reviews of statistical, machine learning, and deep learning approaches for buildings and grid-level demand [
3,
4,
5,
6]. Comparatively few studies, however, report on the engineering of the operational lifecycle that produces and sustains these models, and fewer still demonstrate that such a lifecycle continues to enforce its quality controls when its inputs are degraded.
The challenges of moving machine learning models from development into long-term production have been articulated in general terms by Sculley et al. [
7] and Paleyes et al. [
8]. The machine learning operations (MLOps) community has since converged on a vocabulary of pipelines, artifact registries, monitoring, and human oversight to address these challenges [
9,
10,
11,
12]. Process-model proposals such as CRISP-ML(Q) [
13] further extend this perspective by attaching quality assurance to each lifecycle phase. What is largely absent from the energy forecasting literature is an implementation that instantiates this vocabulary as a concrete, governed, and evaluable system on real building data, and that exposes its behaviour to controlled fault conditions.
This study builds on two preceding works but is intended to stand alone as an implementation and controlled route-evaluation paper. The earlier lifecycle framework [
14] defined a seven-phase forecasting process, decision gates G0–G6, document artifacts, human-in-the-loop checkpoints, and loopback rules. It also presents a dedicated comparative analysis showing advantages in functional coverage, workflow logic, and governance over CRISP-ML(Q) [
13] and previously published end-to-end ML pipelines for energy applications. The subsequent platform-mapping study [
15] compared candidate MLOps platforms and selected ClearML as the implementation substrate. That study used a PRISMA-informed review of 256 records to compare 13 MLOps platforms across the seven-phase lifecycle and identified four capability gaps: governance workflow automation, automated data quality validation, feature management, and deployment and monitoring support under nonstationary conditions. The comparison shows that commercial platforms such as Amazon SageMaker and Google Vertex AI offer stronger end-to-end integration and production readiness, while open-source platforms such as Kubeflow and ClearML offer modular flexibility that requires additional integration effort to achieve end-to-end operation. ClearML was selected as the implementation substrate of the present paper because its modular flexibility allows the governance-workflow-automation gap to be addressed through application-level logic above the platform’s primitives.
The present paper implements that lifecycle as an executable system and evaluates whether its configuration propagation, checkpoint escalation, gate routing, artifact governance, and recovery paths behave as specified under both normal and degraded conditions. Compared with general-purpose MLOps platforms and energy-forecasting pipelines, the verification-and-governance design comes from [
14] and the platform-selection rationale from [
15]; the present paper contributes at the operationalisation-and-evaluation layer above them. Specifically, this study contributes the following:
An executable, configuration-driven operationalisation of the previously proposed seven-phase forecasting lifecycle on a self-hosted ClearML platform, organised into five responsibility-based architectural domains.
A controlled validation-route evaluation of the lifecycle’s two-tier checkpoint-and-gate logic under four deliberately constructed degraded-input and degraded-configuration conditions. The evaluation separates automatic fault detection from operator disposition and exercises three response pathways: revise-and-recover, override-then-terminate, and immediate-abort.
An operational characterisation of ClearML as the MLOps substrate for the implemented lifecycle, distinguishing platform-provided services, including experiment tracking, dataset versioning, artifact storage, and remote execution, from application-level governance logic, including checkpoint routing, human-in-the-loop decisions, document validation, and pause-resume handling.
The unit of analysis is therefore the behaviour of the implemented lifecycle and its validation routes, not forecasting-model novelty, live-deployment performance, or cross-building generalisation. The single-building scope is therefore a design choice required by the route-evaluation objective: introducing building heterogeneity would conflate route behaviour with dataset-dependent forecasting behaviour and obscure the causal attribution of detection events to the injected conditions.
This paper differs from the two preceding studies in both object and evidence. Paper [
14] proposed the seven-phase lifecycle, the gates G0–G6, the document artifact family, the HIL checkpoints, and the loopback rules as a paper specification, validated on a single happy-path office-building case study; it did not instantiate the specification on a concrete MLOps platform, did not propose an architectural decomposition for the implementation, and did not test the validation routes under degraded conditions. Paper [
15] surveyed 13 MLOps platforms against the lifecycle phases and selected ClearML as the implementation substrate, but did not implement the lifecycle on the selected platform and did not evaluate its behaviour empirically. The contributions of the present paper, listed above, supply these three missing elements respectively. In particular, the present paper provides the operational-level evidence that the two-tier sub-checkpoint plus main-gate design advances over CRISP-ML(Q)’s flat per-phase quality assurance: variant V4 (
Section 4.2) is a structural foundation-document fault that no Phase 0 sub-checkpoint is designed to detect but that Gate 0′s existence-and-structure check catches by design, a behaviour a flat per-phase QA step cannot supply. The object of study is therefore the observed behaviour of a running lifecycle under baseline and deliberately degraded conditions, including whether checkpoints, gates, escalation, recovery, and artifact traceability operate as specified.
The remainder of the paper is organised as follows.
Section 2 describes the implementation architecture across the five domains within a ClearML platform.
Section 3 establishes the implementation context.
Section 4 reports controlled operational evaluation evidence.
Section 5 discusses the results, the operational role of the MLOps platform, design limitations, and future work.
Section 6 concludes the paper.
2. Implementation Architecture of the Forecasting Lifecycle
The implementation realises the seven-phase lifecycle of [
14] as an executable system. This section describes the implementation architecture rather than reintroducing the lifecycle specification itself. The architecture is organised by operational responsibility into five domains: data and configuration, lifecycle phases and gates, orchestration, document artifact governance, and human-in-the-loop oversight. The decomposition follows the principle of operational responsibility: each domain owns one class of runtime concern (input contracts, transformation logic, control flow, provenance, human oversight), with no concern shared across domains.
Figure 1 reproduces the lifecycle blueprint, showing the seven-phase structure with gates G0 to G6 and loopback edges, and constitutes the specification view. The lifecycle progresses linearly from project foundation through data acquisition, exploratory analysis and cleaning, feature engineering, model development, deployment, and continuous monitoring, with named workflows decomposing each phase. Several phases also carry inline binary checks that initiate local rework or upstream loopbacks before the formal gate runs, providing intermediate validation against the spec. The gates themselves admit advancement only on cross-functional sign-off against predetermined criteria, and their fail edges define the loopback targets, with Gate 6 closing the lifecycle by routing monitoring outcomes back to a previous phase. Step-level detail, the document artifact family, and the stakeholder responsibility matrix sit in the supporting text of [
14].
Figure 2 presents the implementation architecture of the present study, showing the five architectural domains operating within a ClearML platform [
16], and constitutes the operational view. The ClearML server provides the platform backdrop that hosts four services consumed across all phases. Within this platform, Domain 2 occupies the centre as the horizontal chain of phases and gates from P0 to P6. Three peripheral domains attach to this spine, namely Domain 1 to the left feeding external data and YAML configuration into the early phases, Domain 5 to the right as the human-in-the-loop oversight subsystem with its dashboard, audit trail, and on-demand path to revise the configuration documents, and Domain 3 across the top as the orchestrator responsible for phase sequencing, gate outcome routing, loopback routing, and human-in-the-loop (HIL) escalation. Domain 4 spans the bottom as the append-only artifact and document governance layer that records the document trail, datasets, registered models, and experiment logs produced by every phase. The edges drawn between phases capture the operational routing repertoire, including gate-level loopbacks, sub-phase cross-phase loopbacks driven by internal checks, and the multi-target escalation routes from Gate 6 to upstream phases that close the monitoring cycle when remediation is required. The arrangement makes the operational responsibilities and their coupling explicit, in contrast with the linear specification view of
Figure 1.
Every cross-domain interaction is mediated by a typed, persisted artefact rather than a direct in-memory call, which is the property required for the audit trail and for the recoverability of any pipeline pause. Specifically, D1 communicates with D2 via Doc01 and Doc02; D2 emits artefacts and gate outcomes to D4; D2 and D5 exchange checkpoint triggers and HIL decision records; D3 controls D2 via phase-sequencing signals and gate-routing decisions; and D4 and D5 share decision-artefact persistence and pause-state files.
This channel inventory also fixes the inter-domain dependency structure, which a file-level import-graph analysis of the implementation confirms quantitatively. Counting Python import edges between the files of each domain, the governance and oversight domains (D4, D5) are pure dependency sinks with zero efferent coupling (Ce = 0): the rest of the system reaches them only through file and artefact channels rather than through code imports, which is the property that gives the design its audit-trail and pause-state recoverability. The orchestrator (D3) is conversely a pure control source with zero afferent coupling (Ca = 0), and the instability metric I = Ce/(Ca + Ce) accordingly rises monotonically from the artefact sinks (D4, D5 at I = 0) through the input and phase domains (D1, D2), to the control plane (D3 at I = 1.0), matching the intended control direction; the inter-domain dependency graph is otherwise acyclic. A finer-grained module-level coupling and cohesion analysis is identified in
Section 5.4 as a follow-up direction.
2.1. Data and Configuration (D1)
Domain D1 forms the entry layer of the pipeline, where every measurement stream and every operator-authored parameter the system depends on is received, declared, and converted into the governed artefacts that the seven downstream phases consume. It covers the input contracts and configuration sources that feed the pipeline (
Figure 2, Domain 2). External data enters as long-format comma-separated values (CSV) time series: an electricity stream of hourly meter readings and an aligned weather stream from the co-located station. To support both live operational deployment and offline reproduction from a single codebase, the pipeline decouples every phase from the underlying data source. Live and offline inputs are accessed through a common DataProvider abstraction whose concrete implementation selects a factory at runtime, so the same phase logic operates on either source without code changes. The abstraction is realised as an abstract base class declaring a data-access method, a connection-validation method, and a source-type identifier, together with a fixed output schema to which every concrete provider must conform. The runtime factory dispatches on the configured source type, so adding or switching a data source requires only a new conforming provider rather than any change to the downstream phase logic.
The configuration sources are three YAML files. The task intake template records the stakeholder-facing fields, namely dataset selection, target horizon, candidate model families, gate criteria, and execution mode. A per-dataset profile, referenced by the intake through an active_dataset_profile pointer, declares the dataset-specific input contracts, for example, file paths, column mappings, time zones, the meter-reading mode, and any per-phase overrides for ingestion and gate limits. The HIL configuration template defines the checkpoint registry, decision schemas, routing rules, and the maximum loopback policy. Phase 0 in Domain 2 reads the intake together with its referenced dataset profile and emits two phase-zero documents: Doc01 (Project Specification), which captures the task intake fields, and Doc02 (Requirements and Criteria), which captures the dataset schema, data-quality thresholds, feature group definitions, train/validation/test split, and per-family hyperparameter search ranges that the orchestrator uses for parameterisation downstream.
After Phase 0, all downstream phases load their task parameters from Doc01 and Doc02 rather than directly re-reading the intake template or dataset profile. These two documents therefore act as the governed configuration boundary of the run: every downstream parameter is traceable to Phase 0, and any approved revision is propagated by regenerating or updating the relevant Phase 0 documents before re-execution. The HIL configuration template is read directly by the orchestrator and the checkpoint manager at runtime, since checkpoint behaviour is operator-facing; routing decisions and pause-state are recorded in the audit trail under their own provenance. The dataset profile is the portability seam of the architecture. Switching the pipeline to a new dataset requires only authoring a new dataset profile YAML (with the dataset’s paths, column mappings, units, time zones, history range, meter-reading mode, and any per-phase overrides) and updating the active dataset profile pointer in the intake template. Intra-dataset variants are realised as alternative profiles on the same underlying data through the same pointer. Reconfiguring HIL behaviour requires editing the HIL template alone.
The three YAML sources are parsed once in Phase 0 and folded into the two foundation documents that all downstream phases consume, so configuration enters the pipeline at a single point rather than being re-read per phase. Field-level validation of the configuration values themselves is not enforced at parse time in the present implementation; strengthening configuration handling, including at the point of intake, is part of the agentic-oversight extension identified in
Section 5.4.
Through the DataProvider abstraction, the three layered-configuration sources, and the two phase-zero documents, D1 establishes the single boundary at which external inputs enter the pipeline and localises the points at which the dataset, the task definition, or the governance policy can be reconfigured without touching the seven downstream phases.
2.2. Lifecycle Phases and Gates (D2)
Domain D2 is the executable backbone of the pipeline, where every transformation, quality control, and routing decision that converts the governed inputs from D1 into a deployable, continuously monitored model takes place. It covers the seven phases and the seven decision gates that bracket them, together with the forward and loopback edges shown in
Figure 2 (centre panel). The forward path proceeds through seven phases: P0 (Foundation), P1 (Ingestion), P2 (Exploratory Data Analysis [EDA] and Cleaning), P3 (Feature Engineering), P4 (Training), P5 (Deployment), and P6 (Monitoring), with each phase concluded by its corresponding gate, G0 to G6. In brief, P0 reads the intake template and dataset profile and emits the two foundation documents Doc01 and Doc02; P1 ingests the electricity and weather streams onto a UTC-aligned hourly grid and registers the integrated dataset; P2 performs exploratory analysis and cleaning to produce the cleaned dataset; P3 generates derived features and selects a final feature subset; P4 trains and ranks the model families through a two-stage hyperparameter optimisation and registers the champion model; P5 retrieves the champion model and deploys the FastAPI inference service and Streamlit dashboard; P6 runs the delayed-evaluation monitoring cycle and computes drift and quality signals. The full specification, including the stakeholder responsibility matrix and step-level detail, remains in [
14].
Phase 4 trains and ranks four candidate model families: recurrent neural network (RNN), long short-term memory network (LSTM), Transformer, and XGBoost (Extreme Gradient Boosting). These families are included to exercise the lifecycle across recurrent, attention-based, and tree-based forecasting approaches, not to claim novelty in model architecture. All four are treated as interchangeable candidates within the same governed training, validation, artifact-registration, and gate-review procedure.
Two layers of validation are applied within every phase, with deliberately different roles. Sub-checkpoints (e.g., HIL-P1-1.1.6, HIL-P2-2.1.6, HIL-P3-3.2.5, HIL-P4-4.3.4) are technical-sufficiency checks that fire mid-phase, compute a phase-step metric against a configured threshold, and decide narrow-scope local iteration when the metric falls short, namely fix this step and retry. Decision gates G0 through G6 fire at the end of a phase and serve as governance review: they verify that the phase’s sub-checkpoints have all passed and, in addition, check documentation completeness, compliance status, risk acceptance, and multi-stakeholder sign-off. A gate can therefore fail even when every sub-checkpoint passed, when the phase output is technically sufficient but governance obligations remain open. The two roles are distinct in scope, in review constituency, and in the failure classes they catch. The division of labour is principled rather than incidental. Sub-checkpoints validate local technical sufficiency within a single phase step, where the necessary evidence is locally available, while main gates validate phase-level completeness, cross-document structural integrity, and governance readiness, where the required evidence spans multiple artifacts produced earlier in the same phase or in upstream phases.
The architecture supports three loop families, all visible in
Figure 2. Gate loopbacks (dashed blue) return control to a defined recovery point on gate failure; for Gates 0, 1, 3, 4, and 5 the recovery point is intra-phase, Gate 2 is the only formal gate whose recovery point lies in an earlier phase by escalating to Phase 1, and Gate 6 is the only gate with multi-target recovery, routing a Phase 6 governance failure to Phase 0 for re-scoping, Phase 1 for data root-cause, or Phase 4 for retraining, with the destination selected by the operator’s classification. Sub-phase loopbacks (dashed green) carry mid-phase escalations to an upstream phase when an internal step diagnoses a root cause located outside the current phase, such as a domain mismatch in Phase 2, a deployment-readiness failure in Phase 5, or the post-training root-cause classifier in Phase 4 attributing a performance gap to data. Internal remediation loops, denoted by a loop marker on the phase block, are intra-phase iterations that resolve before the gate is reached. A G6 pass advances the pipeline to the next monitoring cycle.
Through the seven-phase forward path, the two-layer validation pattern that pairs mid-phase sub-checkpoints with end-of-phase gates, and the three loop families, D2 turns the lifecycle into a finite set of legal transitions between phases, so that every forward advance is governed and every backward step is routed to a defined recovery point under explicit, traceable rules.
2.3. Orchestrator (D3)
Domain D3 is the control plane of the pipeline, where the lifecycle defined in D2 is turned into an executable sequence of tasks and the heavy compute steps are dispatched to remote workers.
The seven-phase logic is built as application code rather than as a declarative ClearML pipeline, because the pipeline layer is geared towards training and data-processing workflows and does not natively model the post-deployment monitoring and governance loops that close the lifecycle here. Within that scope, the gate-outcome routing, the cross-phase and intra-phase loopback families, and the human-in-the-loop escalation handler from D2 are not naturally expressible in ClearML’s predominantly static DAG (directed acyclic graph)-based pipeline model. A quantitative wall-clock comparison against a declarative-DAG orchestration would require a parallel reimplementation of the lifecycle, including shim code to express these monitoring, loopback, and escalation features that the DAG model does not natively support; such a measurement would be dominated by that shim layer rather than reflecting the orchestration substrate itself, and a controlled comparison is therefore identified in
Section 5.4 as a follow-up direction. The custom orchestrator is a Python service whose responsibilities are phase sequencing along the forward path, gate-outcome routing (advance versus loopback), loopback-target resolution (intra-phase, cross-phase, or operator-classified), and HIL escalation when a gate or sub-checkpoint failure requires operator review. It schedules phase execution, dispatches training jobs to a remote agent through a ClearML queue, persists pause-state files between phases, reconciles pause-state with the ClearML task model, and assembles the per-run audit trail.
The control flow is deterministic and sequential by design: the orchestrator advances one phase at a time within a single pipeline instance, and the asynchronous concerns in the implementation, namely the externally submitted operator decisions and the Phase 4 trial dispatch to the gpu-train queue, are isolated behind synchronous polling boundaries so that gate transitions remain reproducible and auditable. Because automatic retries are disabled, every loopback is a single operator-directed re-execution of the recovery-point-to-current phase range that terminates in either a passing gate or an abort, so the pipeline cannot loop without being bound and the number of loopbacks per run is set by operator dispositions rather than by the pipeline itself.
The orchestrator runs on a self-hosted ClearML platform that exposes four services consumed across the seven phases (top of
Figure 2): three storage-tier services and one remote-queue execution service. The experiment tracker logs per-run metrics and parameters. The dataset manager records versioned datasets with parent–child lineage. The artifact store keeps typed binary artifacts, including the document trail and the per-trial model files; the champion model file is uploaded as a task artifact on the producing trial task, and is reached through the document chain, which keeps a single source of lineage for governance and audit.
The remote-queue service handles graphics processing unit (GPU) dispatch. In Phase 4, each hyperparameter trial is created as a ClearML task and enqueued on the gpu-train queue, where a clearml-agent worker on a compute unified device architecture (CUDA)-enabled host pulls each task, runs the training remotely, and reports its metrics and artifacts back.
Through the application-code orchestrator, its consumption of the four self-hosted ClearML platform services, and the queue-based GPU offload, D3 binds the lifecycle defined in D2 to a concrete execution model that keeps end-to-end control in a single Python process while letting compute-heavy training run on a dedicated worker.
2.5. Human-in-the-Loop Oversight (D5)
Domain D5 is the human-oversight layer of the pipeline, where escalations from D2′s gates and sub-checkpoints are surfaced to an authorised operator, decisions are recorded, and routing back into the execution flow is performed under explicit, auditable rules.
The implementation defines 20 checkpoints across the seven phases, of which 17 are enabled in the default route-test configuration; the three disabled checkpoints in Phase 6 (HIL-P6-6.1.5, HIL-P6-6.1.10, HIL-P6-6.3.1) are computed automatically from thresholds and schedules declared in Phase 0 and are excluded from the operator decision flow in both execution modes; only Gate 6 (HIL-P6-G6) requires human review in Phase 6.
The subsystem is built from four application-level components (right-hand side of
Figure 2): a Checkpoint Manager (persists pause-state and decision artefacts, pauses and resumes the pipeline), a Decision Manager (dispatches the decision-option set, applies routing rules, writes the audit trail), a HIL Dashboard (surfaces the failed artefact, the decision options, and an editable Doc01/Doc02 view for revision-and-resume), and a Notification Bridge (asynchronous operator alerts).
The checkpoint registry is distinguished into four types: input checkpoints accept a configuration submission; gate checkpoints record a pass, fail, or revise decision; decision checkpoints record a yes-or-no (or domain-specific equivalent) outcome with optional routing; and classification checkpoints record a root-cause categorisation with multi-target routing. Triggers fire at two levels. At sub-gates the checkpoint fires mid-phase, and a failure causes the phase to exit early before its main gate. At the formal decision gates G0 through G6 the checkpoint fires at the end of the phase. In both cases, a failure routes to the operator for review. The two execution modes differ only in which routine non-failure checkpoints are surfaced to the operator.
Table 2 lists representative entries from the registry, showing only the most relevant decision options and omitting the three disabled checkpoints.
A checkpoint failure is always detected before operator disposition. The automatic checkpoint or gate logic first evaluates the relevant metric, artifact, or document structure and records the failure condition. The human-in-the-loop subsystem is then invoked to determine the disposition of that already-detected condition. Across checkpoint types, the operator’s disposition falls into one of three response categories: approve and continue, revise and rerun from a registered recovery point, or terminate the run. The exact decision token is checkpoint-specific, as shown in
Table 2, but every routing target in the registry maps to one of these three categories.
The pipeline supports two configurable execution modes that determine how human-in-the-loop checkpoints behave at runtime. Automatic mode treats every routine checkpoint as approved using its registered default decision and runs without operator pauses; if an automatic check at a sub-gate or main gate detects a failure, the orchestrator escalates that failure to the operator for adjudication, so HIL fallback on detected failures is inherent to automatic mode. Human-in-the-loop mode pauses at every enabled checkpoint whose firing precondition is satisfied on the current run, regardless of whether the underlying automatic check passes or fails, and requires explicit operator approval at each one.
Enabled checkpoints fall into three behavioural classes that determine whether they fire on a given run. Routine checkpoints fire on every pipeline pass through their step. Trigger-conditional checkpoints fire only when an upstream condition is satisfied, for example a sub-gate failure, a performance-check outcome that requires adjudication, or a quality-assessment finding. Schedule-gated checkpoints fire on a configured cadence, for example the governance review interval declared in the dataset profile. Trigger-conditional and schedule-gated checkpoints do not surface as operator pauses on a run that does not satisfy their precondition, even in human-in-the-loop mode.
The two modes and three checkpoint classes together produce the three operational regimes that the experiment matrix in
Section 4 distinguishes: automatic execution that completes without escalation, automatic execution where a detected failure triggers HIL fallback, and human-in-the-loop execution that surfaces every enabled routine checkpoint reached on the current run. The first two regimes share a single configuration; their distinction is whether a failure fired during the run.
Through the four application-level components, the typed checkpoint registry, the pause–decide–resume workflow, and the two execution modes that combine with the three checkpoint classes to produce three operational regimes, D5 makes each escalation a structured event with a registered trigger, a typed decision, a recorded routing target, and an entry in the audit trail, ensuring that operator authority over the pipeline is exercised within the same evidence chain that governs its automatic execution.