An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise

Zhao, Xun; Ma, Zheng Grace; Jørgensen, Bo Nørregaard

doi:10.3390/info16090805

Open AccessArticle

An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise

by

Xun Zhao

,

Zheng Grace Ma

and

Bo Nørregaard Jørgensen

^*

SDU Center for Energy Informatics, Maersk Mc-Kinney Moller Institute, The Faculty of Engineering, University of Southern Denmark, 5230 Odense, Denmark

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 805; https://doi.org/10.3390/info16090805

Submission received: 15 August 2025 / Revised: 2 September 2025 / Accepted: 6 September 2025 / Published: 16 September 2025

Download

Browse Figures

Versions Notes

Abstract

Energy forecasting is critical for modern power systems, enabling proactive grid control and efficient resource optimization. However, energy forecasting projects require systematic approaches that span project inception to model deployment while ensuring technical excellence, domain alignment, regulatory compliance, and reproducibility. Existing methodologies such as CRISP-DM provide a foundation but lack explicit mechanisms for iterative feedback, decision checkpoints, and continuous energy-domain-expert involvement. This paper proposes a modular end-to-end framework for energy forecasting that integrates formal decision gates in each phase, embeds domain-expert validation, and produces fully traceable artifacts. The framework supports controlled iteration, rollback, and automation within an MLOps-compatible structure. A comparative analysis demonstrates its advantages in functional coverage, workflow logic, and governance over existing approaches. A case study on short-term electricity forecasting for a 2560 m² office building validates the framework, achieving 24-h-ahead predictions with an RNN, reaching an RMSE of 1.04 kWh and an MAE of 0.78 kWh. The results confirm that the framework enhances forecast accuracy, reliability, and regulatory readiness in real-world energy applications.

Keywords:

energy forecasting framework; end-to-end machine learning pipeline; domain-expert integration; decision gate governance; MLOps for energy systems; data quality and provenance; modular workflow design; regulatory compliance in AI forecasting

1. Introduction

Forecasting has long been an essential function in the power and energy industry, with thousands of research papers dedicated to predicting electricity demand, prices, and renewable generation [1]. Accurate energy forecasts inform critical decisions for utilities and grid operators, enabling proactive planning, resource optimization, and integration of renewable energy sources [2]. The ability to anticipate future energy supply and demand underpins a wide range of applications, from balancing electricity grids and scheduling power plants to designing demand response programs and optimizing microgrid operations [2]. In recent years, the rise of smart grids and Internet of Things (IoT) sensor networks has further heightened the need for robust forecasting pipelines that can handle large-scale, real-time data and complex machine learning (ML) models [3].

Despite the abundance of forecasting models in the literature, energy forecasting practitioners often face challenges in end-to-end implementation that are specific to the energy domain. Energy forecasting projects not only require that the right predictive algorithms are chosen but also that data quality is ensured, incorporating energy-system domain knowledge and validating results against physical and operational constraints. Poor data quality or misaligned project goals can lead to the adage “garbage in, garbage out” [4], where even advanced models fail due to fundamental flaws in the input data or problem definition. Moreover, energy forecasting models must account for complex temporal patterns, weather dependencies, and regulatory requirements that generic data science approaches may overlook [5,6,7]. To address these challenges, a structured methodology is crucial. Indeed, successful energy forecasting initiatives often adhere to similar staged approaches that iterate the process of understanding the domain, preparing high-quality data, and refining models [2,8]. However, there is a need for a domain-specific framework tailored to the nuances of energy time-series data and the requirements of forecasting tasks.

Several recent works underscore the importance of end-to-end pipeline design in energy forecasting. Im et al. (2024) developed a Machine Learning Operations (MLOps)- centric [9] data pipeline for real-time energy management, integrating streaming data platforms and databases to deploy forecasting models in production [3]. Their results highlighted the trade-off between model complexity and operational speed [3]. Other researchers have proposed structured workflows that similarly include data preprocessing, feature engineering, model training with hyperparameter tuning, and final model selection as distinct stages [10]. Our work builds on these concepts by providing a detailed, reproducible blueprint tailored to energy forecasting projects (such as day-ahead load prediction or renewable generation forecasting). We also align the pipeline with documentation practices and decision criteria at each phase, which helps in maintaining transparency and facilitating collaboration among different stakeholders.

Structured process models have long been used to guide data mining and data science projects. The Cross-Industry Standard Process for Data Mining (CRISP-DM), introduced in the late 1990s, remains one of the most widely adopted frameworks for analytics projects [11,12]. CRISP-DM breaks projects into six phases (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment) and allows iterative revisiting of earlier phases if needed [11]. This general workflow has proven valuable for ensuring a systematic approach to developing insights from data. However, as data science practice has evolved, particularly in the energy domain, new challenges have emerged that traditional process models do not fully address. In particular, organizations today face needs for the following:

Continuous Iteration and Feedback Loops: Modern energy forecasting ML pipelines often run in a continuous development environment, such as development and operations (DevOps) [13] or MLOps, where models are retrained and redeployed regularly in response to new data, seasonal changes, or performance triggers [14]. While CRISP-DM permits backtracking between phases, it does not explicitly formalize decision points or gating criteria at each phase beyond a general waterfall-style progression [15]. This can make it unclear when or how to iterate or rollback if issues arise.
Domain-Expert Involvement: Ensuring that analytical findings make sense in context is critical, especially in high-stakes or domain-specific applications [16]. CRISP-DM emphasizes business understanding at the start and evaluation before deployment, which implicitly call for domain knowledge input. In practice, however, many ML pipelines implemented in pure technical terms neglect sustained involvement of energy domain experts after the initial requirements stage [11]. The separation of domain experts from data scientists can lead to models that technically optimize a metric but fail to align with real-world needs and constraints [17].
Documentation, Traceability, and Compliance: With growing concerns about Artificial Intelligence (AI) accountability and regulations, such as the General Data Protection Regulation (GDPR) [18] and the forthcoming European Union Artificial Intelligence Act (EU AI Act) [19], there is a need for rigorous documentation of data sources, modeling decisions, and model outcomes at each step of the pipeline. Traceability, i.e., maintaining a complete provenance of data, processes, and artifacts, is now seen as a key requirement for trustworthy AI [20].
Modularity and Reusability: The rise in pipeline orchestration tools (Kubeflow [21], Apache Airflow [22], TensorFlow Extended (TFX) [23], etc.) reflects a best practice of designing ML workflows as modular components arranged in a Directed Acyclic Graph (DAG) [24]. Each component is self-contained (for example, data ingestion, validation, training, evaluation, deployment as separate steps) and can be versioned or reused. Classical process models were conceptually modular but did not consider the technical implementation aspect. Integrating the conceptual framework with modern pipeline architecture can enhance the clarity and maintainability of data science projects. The proposed framework addresses these gaps by combining the strengths of established methodologies with new enhancements for iterative control, expert validation, and traceability. In essence, we extend the CRISP-DM philosophy with formal decision gates at each phase, which involves multiple stakeholders’ validation steps, all underpinned by thorough documentation practices. The framework is enhanced through selective automation and continuous improvement. Technical validation steps—such as quality checks, artifact versioning, and forecast anomaly detection—can be partially or fully automated, while manual governance gates ensure expert oversight and compliance. The framework remains modular and tool-agnostic, aligning with contemporary MLOps principles, so it can be implemented using pipeline orchestration platforms in practice.

In summary, the key contributions of the proposed framework are as follows:

Formalized Feedback Loops: Introduction of explicit decision gates at each phase, enabling iteration, rollback, or re-processing when predefined success criteria are not met. This makes the iterative nature of data science explicit rather than implicit [25] and supports the partial automation of loopback triggers and alerts.
Systematic Energy-Domain-Expert Integration: Incorporation of energy domain expertise throughout the pipeline, not just at the beginning and end, to ensure results are plausible and actionable in the energy system context. This goes beyond typical consultation, embedding energy domain feedback formally into the pipeline, particularly for validating data patterns, feature-engineering decisions, and model behavior against energy system knowledge.
Enhanced Documentation and Traceability: Each phase produces traceable outputs (datasets, analysis reports, model artifacts, evaluation reports, etc.) that are versioned and documented. This provides end-to-end provenance for the project, responding to calls for AI traceability as a foundation of trust in critical energy infrastructure applications [26].
Modularity and Phase Independence: The process is structured into self-contained phases with well-defined inputs and outputs, making it easier to integrate with pipeline orchestration tools and to swap or repeat components as needed. This modular design mirrors the implementation of workflows in platforms like Kubeflow Pipelines and Apache Airflow, which treat each step as a reusable component in a DAG [27]. It facilitates collaboration between data scientists, engineers, and domain experts, and supports consistent, repeatable runs of the pipeline.

The proposed framework is organized into seven main phases (numbered 0 through to 6) that reflect the logical progression of an energy forecasting project. Each phase contains clearly defined objectives, tasks, documentation outputs, and decision gates to decide if the project is ready to advance. This design ensures a systematic workflow where each step’s deliverables feed into the next, and any issues trigger iterative loops for remediation before resources are wasted on flawed inputs or models. Notably, the pipeline framework places strong emphasis on early-phase activities such as project scoping and data validation, which are sometimes under-emphasized in purely algorithm-focused studies. The framework also incorporates domain-expert validation to bridge the gap between data-driven findings and real-world energy system understanding. By doing so, it aims to improve forecast accuracy and credibility, as energy domain knowledge can help identify spurious correlations or anomalies.

The remainder of this paper is organized as follows: Section 2 reviews related work and existing process models, highlighting their coverage and limitations. Section 3 introduces the proposed framework, detailing each phase and the decision gate mechanism. Section 4 provides a comparative discussion of how the framework improves upon the state of the art in terms of functional coverage, workflow logic, and technical features. Section 5 presents a case study that demonstrates the framework in a practical energy forecasting scenario. Finally, Section 6 concludes the paper and discusses future work.

2. Background and Related Work

2.1. Existing Data Science Process Models

The need for structured methodologies in data analytics has led to numerous process models over the years. CRISP-DM is often cited as the de facto standard, providing a generic six-phase approach that is industry- and application-neutral [28]. It gained popularity for its simplicity and generality by defining stable, general tasks for each phase that can adapt to different projects. In CRISP-DM, after the completion of each phase, practitioners may revisit previous phases if necessary until the project meets its success criteria, essentially functioning as a “waterfall with backtracking” process [11]. This flexibility was one of CRISP-DM’s strengths; however, the model itself leaves it to the practitioners to determine when to iterate back and how to document those iterations.

Several extensions and alternatives to CRISP-DM have been proposed to address its shortcomings in modern contexts, though few specifically address energy domain requirements:

Framework for ML with Quality Assurance (QA): Cross-Industry Standard Process for Machine Learning with Quality assurance (CRISP-ML(Q)) extends CRISP-DM by adding an explicit Monitoring and Maintenance phase at the end of the lifecycle to handle model degradation in changing environments [11]. It also merges the Business Understanding and Data Understanding phases into one, recognizing their interdependence in practice [11]. Most notably, CRISP-ML(Q) introduces a comprehensive QA methodology across all phases: for every phase, specific requirements, constraints and risk metrics are defined, and if quality risks are identified, additional tasks (e.g., data augmentation, bias checks) are performed to mitigate them [29]. This ensures that errors or issues in each step are caught as early as possible, improving the overall reliability of the project [11]. This quality-driven approach is informed by established risk management standards, such as International Organization for Standardization (ISO) 9001: Quality Management Systems—Requirements [30], and requires that each stage produces verifiable outputs to demonstrate compliance with quality criteria. Although CRISP ML (Q) improves overall model governance and reliability, it is designed to be application-neutral and does not specifically address challenges unique to energy forecasting, such as handling temporal structures, integrating weather-related inputs, or capturing domain-specific operational constraints.
Frameworks for Specific Domains: In regulated or specialized industries, variants of CRISP-DM have emerged. For instance, Financial Industry Business Data Model (FIN-DM) [31] augments CRISP-DM by adding a dedicated Compliance phase that focuses on regulatory requirements and risk mitigation (including privacy laws like GDPR), as well as a Requirements phase and a Post-Deployment review phase [32]. This reflects the necessity of embedding governance and legal compliance steps explicitly into the data science lifecycle in domains where oversight is critical.
Industrial Data Analysis Improvement Cycle (IDAIC) [33] and Domain-Specific CRISP-DM Variants: Other researchers have modified CRISP-DM to better involve domain experts or to address maintenance. For example, Ahern et al. (2022) [33] propose a process to enable domain experts to become citizen data scientists in industrial settings by renaming and adjusting CRISP-DM phases and adding new ones like Domain Exploration and Operation Assessment for equipment maintenance scenarios [32]. These changes underscore the importance of domain knowledge and continuous operation in certain contexts.

Beyond CRISP-DM and its direct descendants, the rise of MLOps has influenced how we conceptualize the ML lifecycle. MLOps is not a single defined process model, but rather a set of practices and tools to automate and streamline the end-to-end ML lifecycle, often drawing on DevOps principles [9,34]. In an MLOps paradigm, one often defines pipelines that encompass stages such as data ingestion, data validation, model training, model evaluation, deployment, and monitoring in production [35]. These pipelines are frequently implemented using workflow orchestrators, such as Kubeflow, Airflow, and TFX. Such tools allow practitioners to define each step of an ML workflow as a separate component (e.g., a Docker container [36] or script), with dependencies forming a DAG. This enables scalable, repeatable, and traceable execution of the entire ML process [34,37]. Metadata tracking is a built-in feature in many of these platforms; for example, they record which dataset version and parameter configuration were used to train a given model and when that model was deployed [35,37]. This addresses the traceability challenge by answering questions like “Which data and code produced this model and what were its evaluation results?” [35]. However, the focus of most MLOps tools is on automation and reproducibility; they do not inherently ensure that a domain expert has vetted the model’s behavior or that documentation is written in human-readable form—those aspects still rely on the development team’s process.

Hence, existing methodologies each contribute parts of the picture: CRISP-DM provides a high-level scaffold for analytic projects; CRISP-ML(Q) and others inject quality control, maintenance, and, sometimes, compliance; MLOps frameworks deliver technical capabilities for automation and tracking. Yet, there remains a gap in having a unified methodology that formally integrates iterative decision gating, domain validation, and comprehensive traceability in a modular, implementable way. This is the gap our work aims to fill.

2.2. Limitations of Current Approaches

From the brief survey above, a few limitations can be identified in the current state of practice that motivate our framework:

Implicit vs. Explicit Iteration: CRISP-DM and its variants generally allow iteration (e.g., CRISP-DM’s arrows looping back, CRISP-ML(Q)’s risk-based checks). However, the decision to loop back is often implicit or left to the judgment of the team, without a formal structure. For instance, CRISP-DM’s Evaluation phase suggests checking if the model meets business objectives; if not, one might return to an earlier stage [11]. But this is a single checkpoint near the end. The proposed framework introduces multiple gated checkpoints with clear criteria (e.g., data quality metrics, model performance thresholds, validation acceptance by domain experts) that must be met to proceed forward, which increases process automation opportunities while also making the workflow logic explicit and auditable.
Domain Knowledge Integration: Many pipelines emphasize technical validation (metrics, statistical tests) but underemphasize semantic validation. As noted by Studer et al., in CRISP-ML(Q), separating domain experts from data scientists can risk producing solutions that miss the mark [11]. Some domain-specific processes have tried to bridge this, as in Ahern et al.’s work and other “CRISP-DM for X” proposals [33], but a general solution is not common. The proposed framework elevates domain-expert involvement to the formal phases and suggests including domain experts not only in defining requirements but also in reviewing intermediate results and final models. This provides a structured feedback loop where expert insight can trigger model adjustments.
Comprehensive Traceability: Reproducibility and traceability are increasingly seen as essential for both scientific rigor and regulatory compliance. Tools exist to version data and models, but few conceptual frameworks explicitly require that each step’s outputs be recorded and linked to inputs. The High-Level Expert Group on AI’s guidelines for Trustworthy AI [38] enumerate traceability as a requirement, meaning one should document the provenance of training data, the development process, and all model iterations [20]. The proposed framework bakes this in by design: every phase yields an artifact (or set of artifacts) that is stored with metadata. This could be implemented via existing MLOps, but methodologically, we ensure that no phase is “ephemeral”; even exploratory analysis or interim results should be captured in writing or code for future reference. This emphasis exceeds what CRISP-DM prescribed and aligns with emerging standards in AI governance [20].
Modularity and Extensibility: Traditional methodologies were described in documents and were not directly operational. Modern pipelines are often executed in software. A good process framework today should be easily translated into pipeline implementations. CRISP-ML(Q) explicitly chose to remain tool-agnostic so that it can be instantiated with various technologies [11]. The proposed framework shares that philosophy: the framework is conceptual, but each phase corresponds to tasks that could be implemented as modules in a pipeline. By designing with clear modular boundaries, we make it easier to integrate new techniques (for instance, if a new data validation library arises, it can replace the validation component without affecting other parts of the process). This modularity also aids reusability: components of the pipeline can be reused across projects, improving efficiency.

In light of these observed limitations, the proposed framework attempts to synthesize these considerations into a single, coherent methodology. The design of the proposed framework is further informed by theoretical perspectives in decision theory and governance in software engineering. Formal decision gates draw on principles of staged quality assurance, where predefined criteria structure the iteration and rollback. Continuous documentation aligns with provenance theory in trustworthy AI, ensuring end-to-end traceability. By embedding domain expert validation throughout, the framework operationalizes requirements based on engineering principles that stress semantic alignment between technical artifacts and real-world system behavior.

3. Proposed Energy Forecasting Pipeline Framework

The proposed framework is a phase-oriented process pipeline designed to extend the standard data science lifecycle with enhanced structural control, traceability, and decision governance. The framework consists of seven logically ordered phases, spanning early scenario understanding to post-deployment monitoring and feedback. Phase 0 defines the forecasting scenario, establishes domain-specific constraints, and derives the success criteria. Phases 1–5 execute the data management, modeling, evaluation, and deployment activities. Phase 6 supports automated monitoring and ensures readiness for update cycles and retraining.

Each phase contains internal decision points that evaluate technical sufficiency, such as data validity, semantic alignment, or model performance, based on criteria derived during Phase 0. These decision points may initiate local iteration within the phase or trigger loopbacks to earlier phases, depending on whether technical, semantic, or operational requirements have been met. This built-in flexibility enables corrective action and refinement while maintaining control over pipeline progression.

At each phase, a formal decision gate serves a broader governance function. Unlike internal technical checks, these gates require cross-functional stakeholder review and sign-off to ensure that

All the internal decision points pass the predefined technical requirements and criteria;
Artifacts are documented, versioned, and traceable;
Outputs and processes comply with technical, operational, regulatory, and domain standards;
Risks are evaluated and either mitigated or accepted;
The project is ready to transition to the next phase without carrying forward unresolved issues.

They function as structured project control mechanisms that prevent the propagation of incomplete or unverified deliverables, ensuring that transitions across phases are justified and risk-managed.

The framework also incorporates mechanisms for controlled loopbacks, allowing governed transitions back to earlier phases if a deliverable fails to meet either technical thresholds or gate-level governance criteria. These transitions are explicitly tracked and recorded, enabling transparent reprocessing, rollback, and root-cause analysis. This ensures that the system remains adaptable while preserving accountability.

Finally, the framework is enhanced through automation and continuous improvement. By structuring all outputs as versioned, traceable artifacts and aligning them to predefined criteria, the pipeline supports automated quality checks, artifact lineage tracking, and streamlined updates. This ensures that the forecasting process remains resilient to changing data, evolving system behavior, or updated operational constraints while maintaining full reproducibility and governance integrity.

Figure 1 conceptually shows the pipeline. The framework explicitly incorporates energy domain considerations throughout, ensuring that technical decisions align with energy system requirements, physical constraints, and operational needs. Each phase produces versioned artifacts—including datasets, reports, and configuration files—which collectively form a fully traceable project history suitable for regulatory compliance and reliability assessments. A summary of key datasets and documentation artifacts generated in each phase is provided in Appendix A.

3.1. Phase 0—Project Foundation

Phase 0 establishes the fundamental groundwork for energy forecasting projects through two structured and sequential workflows: (i) Scenario Understanding and Alignment (0.1) and (ii) Compliance and Technical Checks (0.2) and a decision gate: Gate 0: Foundation Check (0.3). This phase ensures that projects are both strategically aligned and technically feasible before proceeding to subsequent implementation phases. The detailed workflow for Phase 0 is illustrated in Figure 2.

3.1.1. Scenario Understanding and Alignment

The scenario understanding workflow, detailed in Figure 2, begins with task entry (0.1.1) and proceeds through a systematic requirement definition. The process first establishes a comprehensive scenario understanding and physical system comprehension (0.1.2), which serves as the foundation for understanding the project and the physical information of the system. It includes use-case analysis, the physical installation assessment, the impact factor understanding, and the model scoping understanding, which will be used in the following phases. The requirements derivation and criteria definition process (0.1.3) establish measurable success criteria across project phases. The framework provides phase-specific guidance from overall forecasting target and horizon requirements to detailed data quality metrics, feature generation criteria, and model performance benchmarks.

This systematic approach leads to the creation of two documents: Doc01_ProjectSpecification and Doc02_RequirementsCriteria (0.1.4) to ensure traceability and to provide documentation for use in the following phases.

3.1.2. Compliance and Technical Checks

Following scenario understanding and requirement definition, the workflow continues by entering the overall task and the two output documents from the previous step (0.2.1), then conducting compliance verification and feasibility assessment. The process begins with a formal initial compliance check (0.2.2), ensuring that the project adheres to relevant regulatory, contractual, or organizational standards to ensure ethical and safe processing. The compliance check involves reviewing related regulations and standards to comply with both technical and process requirements. Reviewing relevant certifications is also part of this work.

Subsequently, the process proceeds to assess system constraints and technical feasibility (0.2.3). This includes evaluating the availability of computational resources, data transmission capabilities, infrastructure readiness, and any network or storage limitations.

These findings are synthesized and require an update to the previously defined specification and requirements and criteria report (0.2.4). The results are also documented in a comprehensive Doc03_ComplianceCheckReport and Doc04_SystemTechFeasibilityReport (0.2.5) for traceability.

3.1.3. Gate 0—Foundation Check

The final milestone of Phase 0 is Gate 0: Foundation Check (0.3.1), which serves as a formal entry gate into the forecasting pipeline. It consolidates the outcomes of the scenario alignment, requirement definition, compliance verification, and feasibility assessment. The gate ensures that

Specification and requirements and success criteria are clearly defined and documented. Compliance obligations are clearly identified and documented. System and infrastructure feasibility have been confirmed.
All output artifacts are documented.
Outputs and processes are compliant.
Risks are identified and accepted.
There are no unresolved issues.

Only when all foundational conditions are met does the project proceed to Phase 1: Data Foundation. A comprehensive Doc05_Phase0Report (0.3.2) is output to log the activities completed in this decision gate, and it also includes all stakeholder sign-offs. If any issues occur or if the reviews are not satisfactory, a Doc06_LogIssues (0.3.3) will be created and the process will loop back to the beginning of Phase 0. This structured approach ensures that projects are grounded in realistic assumptions and are compliant with constraints, thereby minimizing the risk of costly midstream redesigns.

3.2. Phase 1—Data Foundation

Phase 1 establishes the data acquisition necessary for energy forecasting through two sequential workflows: (i) Data Identification & Quality Check (1.1) and (ii) Data Ingestion and Validation (1.2) and a decision gate: Gate 1 Data Acquisition Ready (1.3). This phase ensures comprehensive data acquisition and validates the data integrity before proceeding to analytical phases. The detailed workflow for Phase 1 is illustrated in Figure 3.

3.2.1. Data Identification and Quality Check

The data identification workflow is initiated with the systematic identification of available and required input (1.1.1) relevant to the project. From the previous phase, we defined the factors needed for the forecasting project; this step involves cataloging or re-analyzing these factors into implementable data features and identifying the available data sources.

The process continues with the quality check of the raw energy data. Starting with raw energy data input (1.1.2), followed by energy data loading (1.1.3), a critical quality assessment (1.1.4) module evaluates the feature completeness, data type and schema compatibility, and completeness and consistency across time intervals. Figure 4 illustrates a potential sequence of internal steps involved in the initial quality check process. The quality verification (1.1.5) is conducted through predefined metrics.

The workflow is designed to prioritize energy data first, as it is central to the forecasting task. This approach reduces unnecessary effort in using the supplementary features data until the core dataset is validated. Once the energy data passes the quality check, it is saved, and output datasets (1.1.6) and the formal documents Doc11_DataSourceInventory and Doc12_InitialDataQualityReport are generated.

After processing the energy data, the pipeline advances to a decision point (1.1.8) to evaluate whether all relevant influencing factors have been covered. If gaps are identified, the system enters a supplementary-data sourcing process (1.1.9), which includes availability checks (1.1.10). If additional sources are available, they are subjected to the same quality validation process as the energy data (1.1.4), as well as saving the dataset and updating the documents in steps 1.1.6 and 1.1.7. If not, the workflow loops back to identify and select alternatives.

Another branch occurs at 1.1.11 to determine whether the data represents target energy data or supplementary information, triggered when a dataset fails quality checks. The process distinguishes between energy data and other influencing data. If the failing dataset is energy data, the workflow enters a targeted search for alternative sources or data recollection efforts (1.1.12). If the issue involves supplementary data (e.g., temperature or calendar), it reverts to the broader data-source identification process (1.1.9).

Once all datasets have passed validation and all required factors are covered, the pipeline transitions to (ii) Data Ingestion and Validation (1.2), ensuring a high-quality foundation for downstream processing.

3.2.2. Data Ingestion and Validation

Upon entering the collected datasets (1.2.1), the process proceeds with data loading, transformation, and integration tasks that are executed (1.2.2). A review of quality metrics (1.2.3) for the integrated dataset is conducted to evaluate the completeness, consistency, and temporal alignment of the integrated dataset while preserving the lineage and supporting future audits.

A validation checkpoint (1.2.4) confirms whether ingestion was successful and the overall quality passes; if not, the process loops back to the data-loading step. Upon passing, the integrated dataset is output (1.2.5), ready for downstream processing, and the results are formally documented through Doc13_DataQualityValidationReport and Doc14_IntegrationSummary (1.2.6).

3.2.3. Gate 1—Data Acquisition Ready

Phase 1 culminates in Gate 1: Data Acquisition Ready (1.3.1), a formal evaluation checkpoint that consolidates outcomes from both workflows. The gate confirms that

All the necessary data have been identified and collected. Datasets have been successfully ingested and validated. Quality metrics meet established thresholds.
All documents and datasets are correctly created.
The process complies with data-related technical and process requirements.
Risks are accepted
There are no unresolved issues.

Only projects that demonstrably meet these conditions are permitted to log Doc15_Phase1Report (1.3.2) and advance to Phase 2. If the gate fails, issues should be logged in Doc16_LogIssues (1.3.3), and the process should loop back to the beginning of Phase 1. Gate 1 ensures that the data-processing initiative is built upon a complete, reliable, and auditable data foundation—minimizing risks in subsequent analytical phases.

3.3. Phase 2—Data Understanding and Preprocessing

Phase 2 focuses on developing a deep understanding of the integrated datasets and applying rigorous preprocessing techniques to ensure data readiness. This phase consists of two interconnected workflows: (i) Exploratory Data Analysis (EDA) (2.1) and (ii) Data Cleaning and Preprocessing (2.2), and Gate 2: Data Preprocessing OK? (2.3). The detailed workflow for Phase 2 is illustrated in Figure 5.

3.3.1. EDA

The EDA workflow begins with the input of integrated datasets (2.1.1), followed by a sequence of statistical and structural analyses. These include distribution analysis (2.1.2), seasonal and cyclical pattern identification (2.1.3), and correlation analysis (2.1.4). These steps support the extraction of high-level data insights and behavioral patterns relevant to energy forecasting.

Subsequently, practitioners document visualized results and recurring patterns (2.1.5) and assess whether observed trends align with known domain knowledge (2.1.6). If misalignment is detected, the process loops back, allowing the team to revisit Phase 1: Data Foundation to re-collect data. Once alignment is confirmed, the system proceeds to generate a comprehensive EDA report, Doc21_EDAReport (2.1.7), to fully record all the data features.

3.3.2. Data Cleaning and Preprocessing

The process begins with the detection of outliers (2.2.1) and missing values (2.2.2), triggering appropriate remediation strategies. A variety of techniques can be employed to handle outliers, such as setting them to NaN, applying value capping, or using domain-specific thresholds. Missing data patterns can then be addressed using methods like linear interpolation or context-informed and domain-guided imputations.

Following imputation, the workflow conducts consistency and critical pattern validation (2.2.6). This step confirms whether the cleaned data retains structural coherence and preserves key temporal–energy patterns (e.g., seasonality, load cycles). If the dataset fails this validation (2.2.7), the process loops back to reapply outlier handling or to refine the imputation logic. Once the dataset passes validation, the cleaned dataset is output for downstream processing (2.2.8), and the workflow formalizes the results through the creation of a final data-cleaning report, Doc22_DataCleaningReport (2.2.9).

3.3.3. Gate 2—Data Preprocessing OK?

Phase 2 concludes with Gate 2: Data Preprocessing OK? (2.3.1), which serves as a formal checkpoint to verify that

Exploratory insights align with domain knowledge, and all critical data anomalies have been addressed.
Preprocessing procedures are transparently documented.
Outputs and processes comply with technical, operational, regulatory, and domain standards.
Risks are evaluated and either mitigated or accepted.
The project is ready to transition to the next phase without carrying forward unresolved issues.

Only when these conditions are met does the project move to the generation of Doc23_Phase2Report and advance to Phase 3: Feature Engineering. Otherwise, the process will create a log, Doc24_LogIssues, and loop back to Phase 2 for data reevaluation and further refinement.

3.4. Phase 3—Feature Engineering

Phase 3 transforms preprocessed data into model-ready datasets through two sequential workflows: (i) Feature Engineering and Creation (3.1) and (ii) Feature Analysis and Selection (3.2). It also consists of a decision gate at the end of the phase: Gate 3: Features OK. This phase ensures that the dataset contains informative, valid, and contextually relevant features structured for modeling tasks. The detailed workflow for Phase 3 is illustrated in Figure 6.

3.4.1. Feature Engineering and Creation

The workflow begins with the input of cleaned datasets (3.1.1). Practitioners generate new features (3.1.2) derived from domain knowledge, temporal structures, or cross-variable interactions. These are then integrated with raw or previously established variables (3.1.3) to form a unified dataset. The resulting output (3.1.4) is a comprehensive feature-enriched dataset that is passed onto the next stage. A formal feature creation report, Doc31_FeatureCreationReport (3.1.5), documents the derivation logic, assumptions, and transformation methods.

3.4.2. Feature Analysis and Selection

This stage starts with the full-feature dataset (3.2.1). Feature analysis (3.2.2) evaluates statistical properties, correlation profiles, and relevance to the target variable. A scoring mechanism (3.2.3) can include different techniques to evaluate feature combinations, such as mutual information, permutation importance, or Shapley additive explanations (SHAP) [39] values to prioritize features for selection (3.2.4).

A decision point (3.2.5) assesses whether the selected subset sufficiently captures the forecasting signal and checks if the data quality is satisfactory. If not, the workflow returns to feature generation (3.1.2), enabling the generation of new or adjusted features before repeating the selection loop. Once the validation passes, a refined dataset with only selected features is output (3.2.6), and a feature selection analysis report, Doc32_FeatureSelectionReport, is created (3.2.7).

3.4.3. Gate 3—Features OK

Phase 3 concludes with Gate 3: Features OK (3.3.1), a formal quality checkpoint that synthesizes the outcomes of all three workflows. This gate confirms that

Feature engineering has been thoroughly conducted, and selected features are sufficient, relevant, and validated for the modeling objective.
All artifacts are documented appropriately.
The process complies with regulations and standards if applicable.
Risks are evaluated and accepted.
No unresolved issues remain.

Only projects that satisfy these conditions are allowed to record the log information for the decision gate review (Doc33_Phase3Report (3.3.2)) and to proceed to Phase 4: Model Development and Evaluation. If any issues remain, Doc34_LogIssues (3.3.3) must be generated, and the process should loop back to the beginning of this phase. Gate 3 ensures that model inputs are comprehensive, representative, and structured according to rigorous standards, thereby minimizing the risk of performance degradation or bias during model training and validation.

3.5. Phase 4—Model Development and Evaluation

Phase 4 develops and evaluates predictive models through five structured workflows and one decision gate: (i) Model Scoping (4.1), (ii) Input Data Preparation (4.2), (iii) Model Training and Evaluation (4.3), (iv) Model Refinement (4.4), (v) Result Consolidation (4.5), and Gate 4: Model Valid? (4.5). This phase enables systematic model selection, performance validation, and iterative improvement, forming the technical backbone of the forecasting pipeline. The detailed workflow for Phase 4 is illustrated in Figure 7.

3.5.1. Model Scoping

The process begins by identifying candidate models spanning statistical, ML, and deep learning (DL) categories (4.1.1), according to the project specification. A structured model instantiation is then constructed and organized, and candidate experiments are created in Doc41_CandidateExperiments (4.1.2), providing a foundation for transparent comparison and reproducible experimentation in downstream workflows.

3.5.2. Input Data Preparation

The data preparation workflow processes datasets with selected features (4.2.1) through the model-specific data loader (4.2.2), and steps such as scaling, encoding, and dataset splitting are adopted to the input dataset. Validation checks (4.2.3) confirm integrity while also ensuring, for example, that no data leakage or shape inconsistencies exist. Failed checks initiate dataset adjustments (4.2.4), creating a feedback loop within this sub-phase back to the final dataset creation. Successful validations lead to the generation of final transformed datasets (4.2.5), producing training, validation, and test datasets for model development and data preparation summary documentation (Doc42_DataPreparationSummary (4.2.6)).

3.5.3. Model Training and Evaluation

The workflow begins with the initial model training (4.3.1), where a broad range of models are tested across available feature sets, following the predefined experimental design established in the previous step. Model performance is then evaluated through metrics and visual diagnostics (4.3.2), enabling consistent comparisons. Subsequently, the workflow assesses whether any model meets the acceptable performance threshold (4.3.3). If the assessment fails, the process examines whether limitations stem from the model itself or the underlying data (4.3.4).

In the case of data deficiencies, the workflow loops back to Phase 1: Data Foundation to revisit the data sourcing or preprocessing. If the issue lies with the model configuration or selection, the process restarts from the beginning of Phase 4 to explore alternative model structures or categories. If acceptable performance is achieved, the workflow advances to the generation of a performance record (4.3.5), and a top-feature-sets report (Doc43_ModelTrainingReport (4.3.6)), providing critical input for subsequent model refinement.

3.5.4. Model Refinement

Refinement begins with the selection of top-performing model and feature-set combinations based on prior results (4.4.1), followed by targeted hyperparameter tuning (4.4.2). This stage aims to optimize model accuracy while mitigating overfitting risks. The outcomes are then integrated into performance consolidation and documentation workflows.

3.5.5. Result Consolidation

The final workflow in this phase synthesizes modeling outputs and prepares them for handoff to deployment. It begins with a final statistical analysis and visual comparison (4.5.1), culminating in a comprehensive performance analysis to select the model that will be deployed in the next phase (4.5.2). The process concludes with outputting the selected model artifacts (4.5.3) and creating the final model documentation—Doc44_PerformanceAnalysisReport and Doc45_FinalModelDocumentation (4.5.4)—to support downstream integration.

3.5.6. Gate 4—Model Valid

Phase 4 culminates in Gate 4: Model Valid (4.6.1), a formal review point confirming the model quality and readiness for operationalization. This gate ensures that

Data preparation steps, including transformations and splits, have passed integrity checks. Model performance meets defined criteria across relevant metrics and datasets. Selected models are reproducible and aligned with forecasting objectives.
All modeling procedures, configurations, and decisions are fully documented.
The process meets compliance requirements.
Risks are evaluated and accepted.
There are no other unresolved issues.

Only projects satisfying these conditions proceed to create a log report, Doc46_Phase4Report (4.6.2), and move forward to Phase 5: Deployment and Implementation. If issues occur, a log report, Doc47_LogIssues, will be generated (4.6.3), and the process will loop back to re-implement this phase. Gate 4 ensures that candidate models are demonstrably valid, thoroughly documented, and aligned with performance expectations. This provides a reliable foundation for subsequent deployment and operational integration.

3.6. Phase 5—Model Deployment and Implementation

Phase 5 transitions validated forecasting models into operational environments through a structured series of deployment workflows. This phase ensures that the model is not only technically integrated within the production infrastructure but also accessible, functional, and reliable for end users. It includes rigorous validation steps that confirm successful setup, data connectivity, and system operability. The detailed workflow for Phase 5 is illustrated in Figure 8.

3.6.1. Deployment Preparation

The deployment process begins with a final validation of model performance and implementation readiness (5.1.1). This involves confirming that accuracy benchmarks have been retained and that the model is reproducible under the target infrastructure. The implementation environment, including software stacks, deployment scripts, and scheduling tools (5.1.2), is then configured. A critical decision point (5.1.3) assesses the overall deployment readiness. If criteria are not met, the process loops back to Phase 4: Model Development and Evaluation for additional refinement and configuration alignment.

3.6.2. User Interface (UI) Setup

Once deployment readiness is established, the UI is developed or updated (5.2.1). This includes constructing access points such as dashboards, web applications, or application programming interfaces (APIs). Output display elements are configured (5.2.2) to ensure that forecasts, metrics, and system messages are clearly communicated to users. Key interactive features such as alerts, filters, and forecast exploration tools are implemented (5.2.3). Following configuration, the system undergoes interface functionality validation (5.2.4), which checks the display correctness, response reliability, and user interaction behavior. If the validation fails, the workflow loops back to the start of the interface setup process for corrective adjustments. Only interfaces that pass validation advance to the subsequent deployment stages, ensuring a robust and user-ready forecasting environment.

3.6.3. Data Connection

The system is connected to designated data sources such as sensors, databases, or third-party APIs (5.3.1). Following connection, the system performs both connection testing and input data quality checks (5.3.2). This step verifies that the incoming data stream is timely, structurally consistent, and within acceptable operational ranges. Specific checks include the schema and type conformity and value plausibility to ensure that the data is suitable for model inference. A formal decision point (5.3.3) determines whether the connection and data quality meet predefined criteria. If either the connection is unstable or the input data is invalid, the process enters a remediation loop (5.3.4) to resolve issues. This may involve reinitializing the connection, applying basic data-repair logic, or triggering alerts to responsible stakeholders. Only after both connection stability and data integrity are confirmed does the system proceed to output the production-ready datasets (5.3.5), create the data connection report (Doc51_DataConnectionReport (5.3.6)), and advance to full deployment.

3.6.4. System Deployment

With the model, interface, and data pipeline in place, the forecasting system is officially launched (5.4.1). The deployed system then performs initial forecasting and analytical routines (5.4.2), and its behavior is evaluated under production conditions (5.4.3). This includes checking for accurate predictions, proper output generation, and system responsiveness. Failures trigger issue resolution procedures (5.4.4), after which the system is re-evaluated. Once the system passes all operational checks, comprehensive deployment artifacts are generated (5.4.5) to ensure traceability, auditability, and effective knowledge transfer to downstream stakeholders. This includes technical documentation such as deployment reports (Doc52_DeploymentReport), API specifications, UI configurations, and integration logs.

3.6.5. Gate 5—Deployment OK

Phase 5 culminates in Gate 5: Deployment OK (5.5.1), a formal checkpoint that verifies that

The model is fully validated and operational in its target environment. Data connections are stable and accurate. UIs are functional and deliver correct outputs
Deployment documentation and system configuration have been properly captured and archived.
There are no unfulfilled compliance-related requirements.
No unacceptable risks have been identified.
No more other issues have been identified by all stakeholders.

Only systems meeting all operational, technical, and governance requirements are permitted to advance to Phase 6: Automated Monitoring and Update. Upon successful gate passage, Doc53_Phase5Report (5.5.2) is generated to document the deployment completion. If issues occur, Doc54_LogIssues (5.5.3) is created, and the process loops back for remediation. Gate 5 ensures that the deployed forecasting solution is robust, accessible, and ready for long-term use.

3.7. Phase 6—Automated Monitoring and Update

Phase 6 operationalizes end-to-end automation for monitoring and maintaining deployed forecasting systems. It ensures sustained model performance through real-time tracking, automatic quality evaluation, and structured feedback mechanisms. This phase minimizes manual intervention by automating operations, diagnostics, alerts, and update initiation, thereby supporting reliable model deployment in production environments. The detailed workflow for Phase 6 is illustrated in Figure 9.

3.7.1. Automated Monitoring

Automated monitoring systems are initiated with the activation of scheduled or event-triggered automated operations (6.1.1), enabling the system to continuously ingest real-time data from live sources such as sensors, APIs, or streaming databases (6.1.2). The data is then loaded and transformed according to the model’s requirements (6.1.3). After preprocessing, the deployed model is automatically executed to generate updated forecasts, which are then visualized through dynamic interface updates (6.1.4).

Automated validation mechanisms continuously evaluate output quality (6.1.5) against defined thresholds, historical baselines, and domain-specific expectations. If the output quality is unacceptable, the system automatically generates logs, triggers alerts (6.1.6), and archives anomalous input–output pairs as hard samples for future retraining (6.1.7). This creates a feedback loop where challenging or error-prone cases are flagged and preserved for manual review and future model improvement. If the output passes quality checks, predictions and system logs are archived (6.1.8) to support auditability and post hoc analysis, with results documented in Doc61_AutomaticMonitoringReport (6.1.9). This ensures that even automated outcomes remain fully transparent and traceable.

3.7.2. Gate 6—Monitoring Governance Check

A formal checkpoint Gate 6 (6.2.1) assesses whether the deployed forecasting system continues to meet operational, quality, and compliance requirements. This governance gate functions at a broader level than technical checks, confirming that

Output predictions remain within acceptable accuracy ranges.
Documents are recorded according to the pipeline.
Operational and monitoring workflows are compliant and auditable.
Risks are identified and acceptable.
No unresolved anomalies or system failures persist.

Successful evaluations generate Doc62_Phase6Report (6.2.2) and return the system to passive monitoring. Failures trigger escalation and issue resolution procedures, creating Doc63_LogIssues (6.2.3) and initiating corrective actions. This may involve stakeholders investing. The update process may include tracing back to Phase 0 for re-scoping or refinement of forecasting requirements, Phase 1 for root-cause analysis of the data, or Phase 4 for full model retraining, depending on the diagnostic outcomes. This systematic approach ensures continuous system improvement and maintains operational reliability throughout the forecasting system’s lifecycle.

3.7.3. Wait for Next Update Cycle

When no anomalies are detected, the system enters a passive monitoring state and waits for the next scheduled evaluation or update cycle (6.3). These cycles may be defined by calendar intervals (e.g., weekly, monthly) or adaptive triggers (e.g., seasonal changes, usage shifts). This state is automated and requires no manual oversight.

3.8. Stakeholder Roles and Responsibilities

The proposed energy forecasting framework operates through the coordinated involvement of five core stakeholder groups: data scientists and ML engineers, energy-domain experts, project managers, DevOps and MLOps engineers, and compliance officers and QA professionals. Each group contributes specialized expertise across different phases, following a structured governance approach that balances technical rigor with operational efficiency.

While all stakeholders participate collectively in Phase 0 to establish the project foundation, their subsequent involvement follows a structured specialization pattern. Technical development phases (Phases 1–4) are primarily led by data scientists with domain-expert validation, while operational phases (Phases 5–6) shift leadership to DevOps teams with continued compliance oversight. Project managers maintain coordination responsibilities throughout, and compliance officers provide governance checkpoints at all decision gates.

3.8.1. Responsibility Distribution and Governance Structure

Table 1 presents the comprehensive responsibility distribution across all phases, indicating primary leadership (P), active collaboration (A), consultative input (C), and governance oversight (G) for each stakeholder group.

3.8.2. Core Stakeholder Functions

Data scientists and ML engineers serve as primary technical architects for the development phases (Phases 1–4), implementing data quality criteria established in Phase 0, executing preprocessing and feature-engineering workflows, and conducting the complete model development lifecycle. Their involvement continues into operational phases with reduced intensity, focusing on production behavior verification and anomaly investigation.

Energy-domain experts provide critical semantic validation throughout the pipeline, establishing physical system boundaries and operational constraints in Phase 0, validating technical implementations against energy system realities in Phases 1–4 (particularly for the EDA alignment in step 2.1.6 and feature relevance in step 3.2.5), and verifying forecast realism under production conditions in Phases 5–6.

Project managers maintain comprehensive coordination across all phases, facilitating cross-functional collaboration, managing timelines and resources, and governing documentation standards. Their critical function involves facilitating all decision gate reviews (Gates 0–6), coordinating signoffs, managing issue escalation, and maintaining transparent documentation of project iterations and loopback procedures.

DevOps and MLOps engineers assume primary leadership during operational phases (Phases 5–6), leading deployment infrastructure configuration, UI implementation, data pipeline establishment, and automated monitoring system implementation. Their early-phase involvement focuses on infrastructure preparation and integration readiness assessment.

Compliance officers and QA professionals ensure regulatory adherence throughout the pipeline, establishing compliance frameworks in Phase 0, providing oversight for data processing and model development activities in Phases 1–4, and validating that automated systems maintain governance and traceability standards in Phase 6.

This role-aware approach ensures that specialized expertise is applied where most critically needed while maintaining comprehensive oversight and governance throughout the forecasting pipeline development and deployment process.

4. Comparison of Existing Frameworks

To demonstrate the distinctive contributions of the proposed framework, this section presents a systematic comparison with established methodologies across six critical dimensions: structural coverage, iteration mechanisms, domain-expert involvement, traceability systems, compliance management, modularity and reusability, and QA protocols. This analysis positions the proposed framework within the broader landscape of data science and MLOps methodologies while highlighting its specialized capabilities for energy forecasting applications.

Table 2 provides an overview of how the proposed framework addresses critical methodological gaps through systematic improvements across all comparative dimensions, establishing a comprehensive solution for energy forecasting applications.

The framework synthesizes strengths from existing approaches while addressing their limitations in energy forecasting contexts. It retains CRISP-DM’s intuitive phase structure while incorporating MLOps’ automation and technical rigor, extending both through embedded domain expertise and formal governance mechanisms that support regulatory compliance and organizational accountability. The framework provides both scalability and adaptability, ensuring it can be applied across diverse project scales and technical environments.

The framework’s scalability enables adaptive implementation based on project risk profiles. Exploratory forecasting initiatives may implement lightweight gate reviews, while critical infrastructure deployments can activate comprehensive validation protocols with full audit trails. This flexibility ensures that governance benefits are accessible across diverse energy forecasting applications while maintaining methodological consistency and traceability standards. The framework also accommodates different modes of domain-expert participation, which can be tailored to the project scale and context. Possible adaptations include tiered expert involvement, where concentrated attention is directed toward phases of highest semantic or operational importance, while routine checks rely on automated validation. Another option is semi-automated knowledge encoding, such as rule-based or ontology-driven validation, which can preserve expert input in a reusable form. In addition, sampling-based validation may be applied, where experts periodically review representative subsets of outputs rather than all artifacts. These options are not mandatory but can be employed when necessary and applicable, ensuring that expert knowledge remains integral to the pipeline while preserving workflow efficiency.

Forecasting pipelines in utilities and building operations often need to coexist with established supervisory control and on-premises systems such as supervisory control and data acquisition (SCADA). While these environments are not always designed for cloud-native MLOps, the framework can accommodate integration through pragmatic measures. Possible options include lightweight API connectors that enable data exchange without altering existing systems, hybrid cloud deployment that combines local infrastructure with cloud resources for computation, and compliance-aware data pipelines that respect operational and regulatory constraints. These strategies can be applied selectively when needed, ensuring that the framework remains compatible with diverse technical environments.

Through this systematic integration of technical rigor, domain expertise, compliance management, and adaptive governance, our framework provides a comprehensive methodological foundation specifically designed for the complex requirements of energy forecasting applications in contemporary regulatory and operational environments.

5. Case Study

This case study demonstrates the practical application of the proposed framework through a real-world building electricity usage prediction project. The implementation focuses on short-term forecasting of the hourly electricity consumption for the University of Southern Denmark (SDU) OU33 building while predicting 24-h ahead usage patterns simultaneously.

The case study serves as a comprehensive validation of the seven-phase framework, illustrating how the structured methodology enhances reproducibility, performance, and governance in energy forecasting applications. By systematically implementing each phase and decision gate, this case study demonstrates the framework’s effectiveness in managing complex forecasting projects while maintaining quality control and stakeholder alignment. While it provides concrete validation, broader applicability will require further testing across other building types and operational contexts.

5.1. Phase 0—Project Foundation

The forecasting scope was defined as the daily building-level electricity load prediction using historical and contextual data. Stakeholders specified the need for usage-trend forecasting to support building-level operational planning. The project specification was established following the framework’s structured approach as shown in Table 3. The project success technical criteria are defined in Table 4.

As this project was conducted for internal scientific research purposes, formal compliance requirements were minimal. Data privacy considerations were addressed through prior approval for the use of internal building datasets, ensuring that no regulatory restrictions applied. The system constraints were also modest, with standard computational resources deemed sufficient to support the forecasting workloads. In terms of technical feasibility, all the necessary software libraries and frameworks required for data processing, modeling, and visualization were confirmed to be available and compatible with the project’s development environment.

The project successfully passed Gate 0, satisfying all technical and governance requirements established for the foundation phase. Key artifacts generated included Doc01_ProjectSpecification, detailing the project scope, use case definitions, and system constraints; Doc02_RequirementsCriteria, establishing measurable technical and functional objectives; and Doc04_SystemTechFeasibilityReport, summarizing the resource availability and infrastructure readiness. As this constituted an internal research project with no external regulatory obligations, formal compliance documentation was not required. Doc05_Phase0Report documented the comprehensive review process and confirmed successful gate passage. Following risk assessment and issue evaluation, the project received approval to advance to Phase 1.

5.2. Phase 1—Data Foundation

The data source included hourly electricity usage as the target variable, complemented by weather features (temperature, humidity, pressure, etc.). Raw electricity consumption data was sourced from the building management system, and external weather data was retrieved via trusted APIs. Initial quality checks revealed data completeness: 33,957 records (96.80%) for electricity data and 34,406 records (98.00%) for weather data over the 2021–2024 time range. Figure 10 presents a visualization of the electricity data. To enhance clarity, extreme outliers have been removed from this figure. Figure 11 shows the key weather variables, including time-series data for temperature, humidity, wind speed, and precipitation. A data quality assessment confirmed that missing values were randomly distributed, with no systematic gaps, allowing imputation strategies to be deferred to the preprocessing phase. All datasets were aligned to a common hourly time resolution separately, and no critical schema or structural issues were detected.

In the Data Ingestion and Validation workflow, the datasets were loaded, integrated, and consolidated into a unified time series. All sources were synchronized to a shared hourly timestamp index, and value ranges and data types were validated to confirm their integrity. The data range is from 00:00:00, 1 January 2021 to 23:00:00, 31 December 2024. In the integrated dataset, 94.90% of the records were available from both sources. The resulting integrated dataset formed a coherent foundation for downstream exploratory analysis and feature generation. The data quality was further validated, with the same data coverage as they originally had.

The project successfully passed Gate 1—Data Acquisition Ready, with all internal decision points validated against predetermined criteria. Phase deliverables comprised raw electricity and weather datasets, along with a fully integrated dataset structured as a unified hourly time series. Supporting documentation included Doc11_DataSourceInventory, which provided comprehensive dataset provenance tracking; Doc12_InitialDataQualityReport, which documented the initial quality assessments; Doc13_DataQualityValidationReport, which validated the integrated data quality; and Doc14_IntegrationSummary, which detailed the complete integration methodology. Doc15_Phase1Report documented the comprehensive phase completion and stakeholder validations. With a complete, traceable, and auditable data foundation established, and no unacceptable risks or outstanding compliance requirements identified, the project received authorization to proceed to Phase 2.

5.3. Phase 2—Data Understanding and Preprocessing

Phase 2 focuses on transforming the raw and integrated datasets into analysis-ready inputs through systematic exploratory analysis, domain knowledge validation, and structured preprocessing. In the EDA (2.1) workflow, initial investigations revealed recognizable consumption patterns that were partially consistent with university building operations. Daily cycles showed peak usage during working hours (at 2 p.m.) and reduced loads during nighttime periods (low at 5 a.m.). Weekly patterns indicated a clear dip in usage over weekends, with a 0.79 weekend usage ratio. However, seasonal patterns in the electricity data were not clearly observable, which may be due to a combination of factors, including the variable occupancy typical of a campus office, nearby construction activities, and data collection issues. Correlations with weather variables such as solar radiation and humidity can be observed but were weak and inconsistent, limiting their predictive value at this stage. Figure 12 shows the visualization of the EDA output.

To ensure reproducibility and traceability, these findings were documented in Doc21_EDAReport, which consolidated the statistical summaries, correlation analyses, and validation checks. This documentation provided an auditable link between observed patterns (e.g., peak usage at 14:00, weekend ratio of 0.79) and domain-expert validation, demonstrating how the framework maintains transparency and repeatability across analysis phases.

In the Data Cleaning and Preprocessing (2.2) phase, systematic procedures were implemented to address data quality issues while preserving critical temporal and energy system patterns. A multi-method approach was employed for comprehensive outlier identification, incorporating range-based methods, Z-score analysis [40], and interquartile range (IQR) detection [41]. This systematic approach identified 24,416 anomalous data points across the electricity and weather-related datasets. These values were set to NaN to facilitate subsequent imputation.

Missing data was addressed through a hierarchical imputation framework designed to maintain temporal consistency. Short-term gaps of less than 3 consecutive hours were treated using extended linear interpolation to preserve local trends, while extended gaps utilized prior-period pattern imputation that leveraged historical seasonal patterns. Specialized edge-case handling was applied for dataset temporal boundaries. Additionally, seasonal fallback procedures were employed to address other residual data gaps. The imputation process addressed a total of 41,232 missing data points across 15 features. Specifically, 58 values were filled using edge fill, 2960 through expanded linear interpolation, 33,878 via the prior-period filling method, and 4336 using a seasonal fallback approach. The resulting dataset achieved 100% completeness, with no range or consistency violations detected. Furthermore, it successfully passed pattern validation checks, confirming the preservation of key characteristics of the underlying energy system. Figure 13 illustrates representative examples of the gap-filling methodology and the validation results.

The project successfully passed Gate 2—Preprocessing OK, with all internal decision points validated against established criteria. The final cleaned dataset demonstrated complete alignment with domain understanding and quality requirements, confirming the preservation of critical energy system patterns. Phase deliverables comprised cleaned datasets with verified temporal integrity, along with comprehensive supporting documentation, including Doc21_EDAReport, which provided EDA findings, domain knowledge validation summaries confirming energy system alignment, and detailed data-cleaning procedure documentation, and Doc22_DataCleaningReport, which validated the final data quality. Doc23_Phase2Report documented the complete preprocessing phase activities and domain-expert validations. With all outputs validated, documented, and appropriately versioned for traceability, the project received authorization to advance to Phase 3.

5.4. Phase 3—Feature Engineering

The Feature Engineering and Creation (3.1) workflow transformed 16 input variables into 140 comprehensive features. The process employed domain-aware design principles tailored for university office building energy patterns, incorporating Danish localization for holidays and cultural patterns, weather–energy physics relationships, and multi-scale temporal pattern recognition. Weather feature engineering utilized threshold-based categorization for hot/cold and humidity classifications, physics-based interactions for apparent temperature and comfort indices, and change detection for day-to-day weather variations. Temporal features employed cyclical sin/cos transformations to preserve temporal continuity across hourly, daily, monthly, and seasonal cycles. Lag features exploited strong temporal autocorrelation through usage, moving averages, and trend comparisons. Interaction features captured context-dependent effects such as temperature impacts during different occupancy periods. The intelligent missing-value handling strategy eliminated gaps in derived features through forward-fill techniques and same-hour substitutions, achieving 100% feature completeness without requiring imputation.

The Feature Analysis and Selection (3.2) process evaluated 133 features (after removing redundant features with >0.99 correlation) using a comprehensive multi-methodology approach. Features were categorized into five groups: interaction features, weather features, temporal features, calendar features, and lag features. To enable appropriate correlation analysis, they were further classified as either continuous (71 features) or categorical (62 features) based on predefined unique-value thresholds. Continuous features were assessed using Pearson correlation [42] and mutual information [43], while categorical features were evaluated using effect size [44] measures and mutual information. All metrics were normalized to a [0, 1] range using min–max scaling to enable a fair comparison, with composite scores calculated as 0.5 × primary_metric + 0.5 × mutual_information for both feature types. The analysis revealed that lag features demonstrated the highest predictive power, with usage_seasonal_deviation achieving the top composite score of 0.97, followed by usage_ma_3h (0.78) and usage_lag_1h (0.71). The overall feature analysis is shown in Figure 14.

The category-wise feature ranking is shown in Figure 15 and Figure 16. The feature-selection strategy adopted a balanced approach by selecting the top 10 features from each category, except for interaction features, which included only 9. This resulted in a total of 49 final features, ensuring comprehensive domain representation while enhancing the predictive performance.

Gate 3 Decision: The project successfully passed Gate 3—Features OK, with all internal decision points validated against established criteria. The feature-engineering process achieved comprehensive domain coverage while maintaining statistical rigor. Phase deliverables comprised the complete feature-engineered dataset with verified temporal integrity, along with comprehensive supporting documentation, including Doc31_FeatureCreationReport, detailing the engineering methodology and feature derivation logic, and Doc32_FeatureSelectionReport, providing statistical analysis and selection rationale. Doc33_Phase3Report documented the complete feature-engineering phase activities and validation confirmations. With all outputs validated, documented, and appropriately versioned for traceability, and strong feature importance patterns confirmed across all domain categories, the project received authorization to advance to Phase 4.

5.5. Phase 4—Model Development and Evaluation

The model scoping (4.1) workflow identified candidate models spanning statistical, ML, and DL categories according to established project specifications. The structured model library encompassed four models: recurrent neural network (RNN) [45], long short-term memory (LSTM) [46], Transformer [47], and extreme gradient boosting (XGBoost) [48]. Doc41_CandidateExperiments documented the comprehensive experimental design, including 124 feature-exploration experiments across all combinations of five feature categories, and 312 hyperparameter tuning experiments focusing on the most promising model–feature combinations.

The final data preparation (4.2) workflow transformed the selected 49 features into model-ready formats through the EnergyDataset class, which provided automatic column detection for datetime and target variables, configurable time-series windowing, and PyTorch 2.6.0 [49]-compatible tensor generation. Data transformation employed time-aware splitting with configurable ratios (70% training, 20% validation, 10% test) while maintaining strict temporal ordering to prevent data leakage. Scaling strategies were differentiated by model architecture: MinMaxScaler [50] was fitted exclusively on training data for neural networks, while unscaled data was used for tree-based models. Comprehensive validation checks confirmed no data leakage between splits, proper temporal ordering, feature consistency across datasets, and scaling parameter isolation to training data only. The process generated persistent, reproducible splits with JavaScript Object Notation (JSON)-based configuration storage, enabling consistent experiment tracking and model comparison. Doc42_DataPreparationSummary documented the complete transformation pipeline, scaling parameters, and model-specific adaptations for the 35,064 hourly observations spanning 2021–2024.

The Model Training and Evaluation workflow (4.3) conducted exhaustive feature-combination exploration, testing all four models against all combinations of five-group feature sets, including lag features, time features, weather features, calendar features, and interaction features. This comprehensive exploration revealed that lag+time emerged as the most consistently successful combination (appearing among the top three combinations in four of the models). The systematic testing identified model-specific feature preferences: RNN and LSTM excelled with simple lag+time combinations, tree-based models demonstrated superior weather-feature integration, and attention-based models showed unique affinity for calendar features. The results of the trained model artifacts were stored, and the model training report Doc43_ModelTrainingReport was generated.

Model Refinement (4.4) focused exclusively on hyperparameter optimization using only the proven top-three feature combinations identified for each model, executing 312 targeted experiments to maximize the performance potential. The refinement process optimized model-specific architectural parameters: neural networks explored window sizes (72, 168 h), hidden dimensions (128, 256), layer configurations (2, 3), learning rates (1 × 10⁻⁴, 5 × 10⁻⁵), and dropout rates (0.2, 0.3); tree-based models optimized estimator counts (100, 200), maximum depths (3, 6), and learning rates (0.01, 0.1). Each model was optimized using its optimal feature sets from step 4.3. This targeted approach achieved meaningful gains: Transformer 10.45% (1.21 kWh → 1.08 kWh), LSTM 5.82% (1.22 kWh → 1.15 kWh), and RNN 4.96% (1.10 kWh → 1.05 kWh) in terms of the root mean square error (RMSE) [51]. The tree model XGBoost improved by 0.16% (1.14 kWh → 1.13 kWh), indicating that the feature choices in the previous step were already near-optimal for trees.

The final consolidation workflow (4.5) conducted comprehensive statistical analysis and visual comparisons across all 436 experiments to identify the optimal deployment model. Performance analysis revealed RNN as the top-performing model, representing 6 of the top 10 experiments. The best RNN configuration achieved an RMSE of 1.05 kWh and a mean absolute error (MAE) [52] of 0.78 kWh (lag+time: window = 168, hidden = 256, layers = 2, dropout = 0.2, lr = 0.1), outperforming the best non-RNN model (Transformer, RMSE 1.08 kWh) by approximately 3.0%. Stability analysis on the top-200 subset showed Transformer as most stable by RMSE dispersion, while RNN provided the best accuracy overall. Detailed statistical validation and comparisons are provided in Doc43_ModelTrainingReport and Doc44_PerformanceAnalysisReport, while Doc45_FinalModelDocumentation specifies the final production-ready RNN (lag+time, H = 24). Visualization of representative multi-step forecasts is included in Figure 17.

Gate 4 Decision: The project successfully passed Gate 4—Model Valid, with all internal decision points validated against established performance criteria. The document Doc41_CandidateExperiments was generated to define the experimental plan, and model training proceeded accordingly. Data was successfully loaded, with appropriate preprocessing techniques applied to different model types. No data leakage was detected during the process. The full data-loading workflow is documented in Doc42_DataPreparationSummary. Model development met all predefined accuracy thresholds. The evaluation systematically compared multiple architectures, including RNN, LSTM, Transformer, and XGBoost. Among these, RNN achieved the lowest RMSE and MAE, justifying its selection for deployment. Results from the training runs were recorded, and a summary report was compiled in Doc43_ModelTrainingReport. Deployment-ready configurations and analyses are captured in Doc44_PerformanceAnalysisReport and Doc45_FinalModelDocumentation. Doc46_Phase4Report documented the complete model development phase with stakeholder validations and performance confirmations. With demonstrated model validity, comprehensive documentation through transparent model selection process, identified and accepted risks, and no unresolved technical issues, the project received authorization to advance to Phase 5 deployment activities.

5.6. Phase 5—Deployment and Implementation

The deployment preparation workflow (5.1) began with the validation of the selected RNN model using lag+time features, confirming that the expected RMSE of 1.05 kWh was maintained consistently in production environments. The infrastructure configuration established cloud deployment with an 8-core central processing unit (CPU), 16 gigabytes of random-access memory (RAM), and optional graphics processing unit (GPU) support, achieving target processing times of under 5 s for daily forecast generation. Basic authentication and API access were implemented for 7-day rolling forecast operations using the past 168 h to predict the next 24 h. Integration testing validated the model compatibility with 21 input features and confirmed the system’s readiness for operational deployment.

The interface development (5.2) implemented a web dashboard displaying daily 24-h consumption forecasts, with sample hourly values ranging from 8–13 kWh. The interface featured time-series visualizations showing individual forecast RMSEs over time and key period comparisons between forecasts and actual values. Functionality validation confirmed correct rendering across browsers and daily data-refresh capabilities for 7-day rolling forecast cycles.

The data integration established connections to sensor networks providing continuous 168-hour historical data windows for daily forecast generation in the Data Connection (5.3) step. Primary data sources are from internal databases for the historical energy consumption. Basic data quality checks ensured input consistency with the model’s 21-feature specification, including time-based features (hour, day, week, year cycles) and lag-based consumption history.

Deployment launched a containerized forecasting service generating seven daily rolling forecasts, each producing 24-h predictions at 00:00 in the System Deployment (5.4) step. Initial production runs generated sample forecasts with values like 9.68 kWh and 9.77 kWh for Day 1, demonstrating operational capability across the energy usage range of 8–13 kWh. System validation confirmed zero runtime errors and successful generation of rolling forecast visualizations. Basic deployment documentation included API specifications and operational procedures for the 7-day forecast cycle.

Gate 5 Decision: The project successfully passed Gate 5—Deployment OK, with all operational readiness criteria satisfied and comprehensive system validation completed. The forecasting system demonstrated functional UIs with confirmed usability testing, stable data pipeline connections, and validated prediction outputs meeting accuracy requirements within the production constraints. Phase deliverables comprised production-ready RNN model deployment with verified performance metrics and comprehensive deployment documentation, including Doc51_DataConnectionReport and Doc52_DeploymentReport. Doc53_Phase5Report was generated to fully record the evaluation of Gate 5. With the deployment phase completion validated through systematic testing and stakeholder acceptance, the project received authorization to advance to Phase 6 automated monitoring operations.

5.7. Phase 6—Automated Monitoring and Update

Automated Monitoring (6.1): The monitoring system initiated daily operations that collected 168-h data windows from sensor networks every morning at 00:00. Data preprocessing maintained consistency with the 21-feature model specification, generating 24-h forecasts integrated into dashboard visualizations. Quality assessment evaluated each daily prediction against the rolling forecast RMSE of 1.28 kWh and MAE of 0.99 kWh. Predictions meeting quality criteria proceeded to being archived, while outputs exceeding the RMSE degradation threshold of 25% from baseline triggered alert mechanisms and diagnostic logging for investigation. Figure 18 presents the forecast RMSE trend over time alongside key forecast periods compared with the corresponding actual values for the deployed model.

Gate 6 Decision (6.2): The project successfully passed Gate 6—Monitoring Governance Check. System metrics confirmed sustained accuracy within the RMSE tolerance threshold. Phase deliverables comprised Store Results with performance analytics, including individual forecast RMSE visualizations, and governance documentation, namely Doc61_AutomaticMonitoringReport. Doc63_Phase6Report was generated to record the project successfully passing of the project to Decision Gate 6. The automated monitoring system received approval to continue 7-day rolling forecast operations, with potential retraining in the future based on performance triggers ensuring sustained accuracy for energy management applications in step 6.3—Wait for Next Update Cycle.

6. Conclusions

This paper presents a comprehensive energy forecasting framework that addresses critical methodological gaps in existing data science and MLOps approaches. The framework systematically integrates energy-domain expertise, formal governance mechanisms, and comprehensive traceability requirements to enhance the reliability and trustworthiness of energy forecasting deployments in critical infrastructure contexts.

Building upon CRISP-DM’s structured methodology while incorporating MLOps automation principles, our framework introduces systematic decision gates, mandatory domain-expert validation, and embedded compliance management. The seven-phase architecture (Phases 0–6) provides end-to-end coverage from project foundation through to automated monitoring, with multi-stakeholder governance ensuring quality and regulatory alignment at each transition point.

The framework’s formal validation mechanisms and standardized documentation requirements reduce project risks while ensuring audit readiness for regulatory compliance. Systematic energy-domain-expert involvement at critical checkpoints (EDA alignment, feature interpretability, model plausibility validation) ensures continuous alignment between technical implementations and energy system realities. The tool-agnostic, modular design enables consistent implementation across diverse technical environments while maintaining methodological rigor.

The achieved forecasting performance in the case study demonstrates the effectiveness of the framework’s governance and traceability mechanisms. By enforcing documentation at each phase and requiring domain-expert validation, the framework reduces the likelihood of spurious correlations and ensures alignment between model outputs and physical system behaviors. These methodological safeguards explain the performance observed and underscore the framework’s contribution to trustworthy energy forecasting.

Future research directions include developing implementation support tools (standardized gate checklists, pipeline orchestration templates), conducting empirical validation studies comparing framework adoption outcomes across delivery metrics, and extending the methodology to broader energy informatics applications through systematic case study analysis.

As AI and data science become increasingly integrated into energy system operations and planning, robust process frameworks will become as important as powerful algorithms. The proposed framework provides a balanced approach that ensures technical excellence while maintaining energy-domain oversight and understanding throughout the development lifecycle.

Author Contributions

Conceptualization, B.N.J. and Z.G.M.; methodology, X.Z. and B.N.J.; software, X.Z.; validation, X.Z., B.N.J. and Z.G.M.; formal analysis, X.Z.; investigation, X.Z.; resources, B.N.J. and Z.G.M.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, B.N.J. and Z.G.M.; visualization, X.Z.; supervision, B.N.J. and Z.G.M.; project administration, B.N.J.; funding acquisition, Z.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of the project titled “Automated Data and Machine Learning Pipeline for Cost-Effective Energy Demand Forecasting in Sector Coupling” (jr. Nr. RF-23-0039; Erhvervsfyrtårn Syd Fase 2), which is supported by The European Regional Development Fund.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
APIs	Application Programming Interfaces
CI/CD	Continuous Integration/Continuous Delivery
CPU	Central Processing Unit
CRISP-DM	Cross-Industry Standard Process for Data Mining
CRISP-ML(Q)	Cross-Industry Standard Process for Machine Learning with Quality assurance
DAG	Directed Acyclic Graph
DevOps	Development and Operations
DL	Deep Learning
EDA	Exploratory Data Analysis
EU	European Union
EU AI Act	European Union Artificial Intelligence Act
FIN-DM	Financial Industry Business Data Mode
GDPR	General Data Protection Regulation
GPU	Graphics Processing Unit
IDAIC	Industrial Data Analysis Improvement Cycle
ISO	International Organization for Standardization
IoT	Internet of Things
IQR	Interquartile Range
JSON	JavaScript Object Notation
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
ML	Machine Learning
MLOps	Machine Learning Operations
NaN	Not a Number
QA	Quality Assurance
RAM	Random Access Memory
RMSE	Root Mean Squared Error
RNN	Recurrent Neural Network
SARIMA	Seasonal Autoregressive Integrated Moving Average
SDU	University of Southern Denmark
SHAP	Shapley Additive Explanations
TFX	TensorFlow Extended
UI	User Interface
XGBoost	Extreme Gradient Boosting

Appendix A. Summary of Key Phase Artifacts in the Energy Forecasting Framework

The energy forecasting framework generates comprehensive documentation and datasets across seven phases. To ensure end-to-end traceability and transparency, Table A1 presents a structured summary of all phase-specific outputs, including the document ID and title, a concise description of each document’s main content, and any resulting datasets.

Table A1. Comprehensive phase artifacts and outputs.

Phase	Document ID and Title	Main Content	Output Datasets
Phase 0: Project Foundation	Doc01_ProjectSpecification	Defines the overall project scope, objectives, forecasting purpose, energy type, and domain. It outlines technical requirements, infrastructure readiness, and modeling preferences to guide downstream phases.	/
	Doc02_RequirementsCriteria	Details the functional, technical, and operational requirements across all project phases. It specifies accuracy targets, data quality standards, model validation criteria, and system performance thresholds.
	Doc03_ComplianceCheckReport	Assesses the completeness and compliance of foundational documents with institutional, technical, and regulatory standards. It evaluates workflow adherence, documentation quality, and privacy/security compliance prior to Phase 1 launch.
	Doc04_SystemTechFeasibilityReport	Evaluates the project’s technical feasibility across the system architecture, data pipelines, model infrastructure, UI requirements, and resource needs.
	Doc05_Phase0Report	Summarizes all Phase 0 activities, confirms completion of key artifacts, and consolidates stakeholder approvals. It serves as the official checkpoint for Gate 0 and authorizes the transition to Phase 1.
	Doc06_LogIssues	Records issues identified during Phase 0, categorized by severity and type (e.g., compliance, technical). It documents unresolved items, required actions before re-evaluation, and the feedback loop for rework.
Phase 1: Data Foundation	Doc11_DataSourceInventory	Catalogs all available internal and external data sources, assesses their availability, and verifies data-loading success.	Raw energy datasets. Supplementary datasets. Integrated datasets.
	Doc12_InitialDataQualityReport	Conducts an initial assessment of raw data completeness, timestamp continuity, and overall timeline coverage for each source. Provides pass/fail evaluations against quality thresholds.
	Doc13_DataQualityValidationReport	Validates the integrated datasets against predefined coverage and overlap thresholds. Confirms ingestion completeness, assesses quality across sources, and provides a go/no-go recommendation for proceeding to the next step.
	Doc14_IntegrationSummary	Documents the data integration process, including merge strategies, record preservation logic, and overlaps between datasets.
	Doc15_Phase1Report	Summarizes all Phase 1 activities, including data acquisition, quality assessment, integration, and documentation. Confirms completion of Gate 1 requirements with stakeholder approvals and formally transitions the project to Phase 2.
	Doc16_LogIssues	Records issues encountered during Phase 1, especially data quality-related problems and integration anomalies. Lists unresolved items, assigns responsibilities, and documents actions required before re-evaluation of Gate 1.
Phase 2: Data Understanding and Preprocessing	Doc21_EDAReport	Presents exploratory analysis of the cleaned and integrated datasets, highlighting temporal and seasonal trends, correlations with external variables, and deviations from domain expectations. Includes validation of findings against expert knowledge and system behavior.	Cleaned datasets.
	Doc22_DataCleaningReport	Documents all data-cleaning procedures, including outlier detection, gap filling, and context-aware imputation. Provides validation results for completeness, accuracy, and statistical consistency to confirm feature-engineering readiness.
	Doc23_Phase2Report	Summarizes all Phase 2 activities including exploratory analysis, cleaning, and preprocessing. Confirms all quality checks and domain alignments were met, with stakeholder sign-offs authorizing transition to Phase 3.
	Doc24_LogIssues	Logs all unresolved issues identified, including those affecting data completeness and alignment with domain knowledge. Lists required corrective actions prior to re-evaluation of Gate 2.
Phase 3: Feature Engineering	Doc31_FeatureCreationReport	Details the engineering of a comprehensive feature set from raw and cleaned data using domain-informed strategies. Features are organized into interpretable categories with documented coverage, correlation structure, and quality validation.	Feature-engineered datasets. Selected feature subsets. Training/validation/test splits. Transformed model-ready datasets.
	Doc32_FeatureSelectionReport	Summarizes feature selection process involving redundancy elimination, normalization, and scoring. Features are evaluated by category and ranked based on statistical performance and domain relevance, resulting in a final set optimized for model training.
	Doc33_Phase3Report	Consolidates all Phase 3 activities, including feature generation, selection, and data preparation. Confirms dataset readiness for modeling through rigorous validation checks and stakeholder sign-offs, supporting transition to Phase 4.
	Doc34_LogIssues	Logs issues and challenges encountered during feature engineering, such as problems with derived variable consistency, transformation logic, or dataset leakage risks. It identifies required fixes before re-evaluation of Gate 3.
Phase 4: Model Development and Evaluation	Doc41_CandidateExperiments	Provides a comprehensive record of all candidate modeling experiments, including feature exploration and hyperparameter optimization. Summarizes tested models, feature combinations, and experimental designs across forecasting horizons.	Trained model artifacts. Deployment-ready model packages.
	Doc42_DataPreparationSummary	Describes the transformation of selected feature datasets into model-ready formats, including scaling, splitting, and structure adaptation for different model types. Validate reproducibility, leakage prevention, and data consistency across training and test sets.
	Doc43_ModelTrainingReport	Details the end-to-end training process for all model–feature combinations, including experiment configurations, validation strategy, and training metrics. Confirms successful execution of all planned runs with results stored and traceable.
	Doc44_PerformanceAnalysisReport	Presents evaluation of model performance, identifying the top-performing models based on RMSE, MAE, and efficiency metrics. Includes rankings, model category comparisons, and stability assessments.
	Doc45_FinalModelDocumentation	Documents the architecture, configuration, hyperparameters, and feature usage of the final selected model. Highlights performance, robustness, and production suitability based on comprehensive testing.
	Doc46_Phase4Report	Summarizes all Phase 4 activities, including model scoping, training, refinement, evaluation, and documentation. Confirms compliance, readiness for deployment, and fulfillment of Gate 4 acceptance criteria.
	Doc47_LogIssues	Records issues related to model configuration, performance discrepancies, and documentation gaps. Specifies actions required for Gate 4 re-evaluation and areas for improvement in modeling or validation.
Phase 5: Deployment and Implementation	Doc51_DataConnectionReport	Summarizes the successful establishment and validation of data connections between the system and external sources (e.g., sensors, databases, APIs). Confirms input quality, system access, and the generation of production-ready datasets for real-time or scheduled operations.	Production-ready datasets.
	Doc52_DeploymentReport	Documents the final deployment activities, including system launch, interface rollout, initial prediction testing, and APIs/configuration packaging. Confirms operational readiness and system stability through validation of predictions, logs, and interface behavior.
	Doc53_Phase5Report	Provides a consolidated summary of Phase 5 milestones, including deployment validation across system, data, and UI layers. Captures stakeholder approvals and confirms satisfaction of Gate 5 acceptance criteria for transition into live operation.
	Doc54_LogIssues	Captures issues encountered during deployment such as interface errors, system downtime, or misconfigured outputs. Lists root causes, assigned actions, and the resolution status for any system deployment failure points or rollback triggers.
Phase 6: Automated Monitoring and Update	Doc61_AutomaticMonitoringReport	Reports on the system’s automated monitoring activities, including real-time data ingestion, prediction updates, and UI refreshes. Confirms performance metrics, logs the operational status, and verifies ongoing model effectiveness in production.	Real-time monitoring. Results, e.g., predictions, logs and archives. Hard samples for future retraining.
	Doc62_UnqualifiedReport	Identifies and documents cases where prediction outputs fall below acceptable quality thresholds. Flags performance degradation, specifies unqualified data samples, and archives cases for potential future retraining or manual review.
	Doc63_Phase6Report	Summarizes all monitoring results and update-cycle performance, including model consistency, drift detection, and operational stability. Confirms Gate 6 completion through governance checks and outlines next steps for continuous improvement or retraining cycles.
	Doc64_LogIssues	Tracks incidents such as data stream interruptions, interface failures, or prediction anomalies. Serves as the operational log for Phase 6, capturing diagnostics, recovery actions, and escalation records.

References

Hong, T.; Pinson, P.; Wang, Y.; Weron, R.; Yang, D.; Zareipour, H. Energy forecasting: A review and outlook. IEEE Open Access J. Power Energy 2020, 7, 376–388. [Google Scholar] [CrossRef]
Mystakidis, A.; Koukaras, P.; Tsalikidis, N.; Ioannidis, D.; Tjortjis, C. Energy forecasting: A comprehensive review of techniques and technologies. Energies 2024, 17, 1662. [Google Scholar] [CrossRef]
Im, J.; Lee, J.; Lee, S.; Kwon, H.-Y. Data pipeline for real-time energy consumption data management and pre diction. Front. Big Data 2024, 7, 1308236. [Google Scholar] [CrossRef]
Kilkenny, M.F.; Robinson, K.M. Data quality: “Garbage in–garbage out”. Health Inf. Manag. J. 2018, 47, 1833358318774357. [Google Scholar] [CrossRef]
Amin, A.; Mourshed, M. Weather and climate data for energy applications. Renew. Sustain. Energy Rev. 2024, 192, 114247. [Google Scholar] [CrossRef]
Bansal, A.; Balaji, K.; Lalani, Z. Temporal Encoding Strategies for Energy Time Series Prediction. arXiv 2025, arXiv:2503.15456. [Google Scholar] [CrossRef]
Gonçalves, A.C.R.; Costoya, X.; Nieto, R.; Liberato, M.L.R. Extreme Weather Events on Energy Systems: A Comprehensive Review on Impacts, Mitigation, and Adaptation Measures. Sustain. Energy Res. 2024, 11, 4. [Google Scholar] [CrossRef]
Chen, G.; Lu, S.; Zhou, S.; Tian, Z.; Kim, M.K.; Liu, J.; Liu, X. A Systematic Review of Building Energy Consumption Prediction: From Perspectives of Load Classification, Data-Driven Frameworks, and Future Directions. Appl. Sci. 2025, 15, 3086. [Google Scholar] [CrossRef]
Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine Learning Operations (MLOps): Overview, Definition, and Architecture. IEEE Access 2023, 11, 31866–31879. [Google Scholar] [CrossRef]
Nguyen, K.; Koch, K.; Chandna, S.; Vu, B. Energy Performance Analysis and Output Prediction Pipeline for East-West Solar Microgrids. J 2024, 7, 421–438. [Google Scholar] [CrossRef]
Studer, S.; Bui, T.B.; Drescher, C.; Hanuschkin, A.; Winkler, L.; Peters, S.; Müller, K.-R. Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. Mach. Learn. Knowl. Extr. 2021, 3, 392–413. [Google Scholar] [CrossRef]
Tripathi, S.; Muhr, D.; Brunner, M.; Jodlbauer, H.; Dehmer, M.; Emmert-Streib, F. Ensuring the Robustness and Reliability of Data-Driven Knowledge Discovery Models in Production and Manufacturing. Front. Artif. Intell. 2021, 4, 576892. [Google Scholar] [CrossRef] [PubMed]
EverythingDevOps.dev. A Brief History of DevOps and Its Impact on Software Development. Available online: https://www.everythingdevops.dev/blog/a-brief-history-of-devops-and-its-impact-on-software-development (accessed on 13 August 2025).
Steidl, M.; Zirpins, C.; Salomon, T. The Pipeline for the Continuous Development of AI Models–Current State of Research and Practice. In Proceedings of the 2023 IEEE International Conference on Software Architecture Companion (ICSA-C), L’Aquila, Italy, 13–17 March 2023; pp. 75–83. [Google Scholar]
Saltz, J.S. CRISP-DM for Data Science: Strengths, Weaknesses and Potential Next Steps. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 2337–2344. [Google Scholar]
Pei, Z.; Liu, L.; Wang, C.; Wang, J. Requirements Engineering for Machine Learning: A Review and Reflection. In Proceedings of the 30th IEEE International Requirements Engineering Conference Workshops (RE 2022 Workshops), Melbourne, VIC, Australia, 15–19 August 2022; pp. 166–175. [Google Scholar]
Andrei, A.-V.; Velev, G.; Toma, F.-M.; Pele, D.T.; Lessmann, S. Energy Price Modelling: A Comparative Evaluation of Four Generations of Forecasting Methods. arXiv 2024, arXiv:2411.03372. [Google Scholar] [CrossRef]
European Parliament; Council of the European Union. Regulation (EU) 2016/679–General Data Protection Regulation (GDPR). Available online: https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng (accessed on 13 August 2025).
European Parliament; Council of the European Union. Regulation (EU) 2024/1689–Artificial Intelligence Act (EU AI Act). Available online: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng (accessed on 13 August 2025).
Mora-Cantallops, M.; Pérez-Rodríguez, A.; Jiménez-Domingo, E. Traceability for Trustworthy AI: A Review of Models and Tools. Big Data Cogn. Comput. 2021, 5, 20. [Google Scholar] [CrossRef]
Kubeflow Community. Introduction to Kubeflow. Available online: https://www.kubeflow.org/docs/started/introduction/ (accessed on 13 August 2025).
Apache Software Foundation. Apache Airflow–Workflow Management Platform. Available online: https://airflow.apache.org/ (accessed on 13 August 2025).
Google. TensorFlow Extended (TFX): Real-World Machine Learning in Production. Available online: https://blog.tensorflow.org/2019/06/tensorflow-extended-tfx-real-world_26.html (accessed on 13 August 2025).
Miller, T.; Durlik, I.; Łobodzińska, A.; Dorobczyński, L.; Jasionowski, R. AI in Context: Harnessing Domain Knowledge for Smarter Machine Learning. Appl. Sci. 2024, 14, 11612. [Google Scholar] [CrossRef]
Oakes, B.J.; Famelis, M.; Sahraoui, H. Building Domain-Specific Machine Learning Workflows: A Conceptual Framework for the State-of-the-Practice. ACM Trans. Softw. Eng. Methodol. 2024, 33, 1–50. [Google Scholar] [CrossRef]
Kale, A.; Nguyen, T.; Harris, F.C., Jr.; Li, C.; Zhang, J.; Ma, X. Provenance Documentation to Enable Explainable and Trustworthy AI: A Literature Review. Data Intell. 2023, 5, 139–162. [Google Scholar] [CrossRef]
Theusch, F.; Heisterkamp, P. Comparative Analysis of Open-Source ML Pipeline Orchestration Platforms. 2024. Available online: https://www.researchgate.net/profile/Narek-Grigoryan-3/publication/382114154_Comparative_Analysis_of_Open-Source_ML_Pipeline_Orchestration_Platforms/links/668e31d6b15ba559074d9a4b/Comparative-Analysis-of-Open-Source-ML-Pipeline-Orchestration-Platforms.pdf (accessed on 13 August 2025).
Schröer, C.; Kruse, F.; Gómez, J.M. A Systematic Literature Review on Applying CRISP-DM Process Model. Procedia Comput. Sci. 2021, 181, 526–534. [Google Scholar] [CrossRef]
Chatterjee, A.; Ahmed, B.; Hallin, E.; Engman, A. Quality Assurance in MLOps Setting: An Industrial Perspective. arXiv 2022, arXiv:2211.12706. [Google Scholar] [CrossRef]
ISO 9001:2015; Quality Management Systems—Requirements. ISO: Geneva, Switzerland, 2015. Available online: https://www.iso.org/standard/62085.html (accessed on 13 August 2025).
Jayzed Data Models Inc. Financial Industry Business Data Model (FIB-DM). Available online: https://fib-dm.com/ (accessed on 13 August 2025).
Shimaoka, A.M.; Ferreira, R.C.; Goldman, A. The evolution of CRISP-DM for Data Science: Methods, processes and frameworks. SBC Rev. Comput. Sci. 2024, 4, 28–43. [Google Scholar] [CrossRef]
Ahern, M.; O’Sullivan, D.T.J.; Bruton, K. Development of a Framework to Aid the Transition from Reactive to Proactive Maintenance Approaches to Enable Energy Reduction. Appl. Sci. 2022, 12, 6704. [Google Scholar] [CrossRef]
Oliveira, T. Implement MLOps with Kubeflow Pipelines. Available online: https://developers.redhat.com/articles/2024/01/25/implement-mlops-kubeflow-pipelines (accessed on 13 August 2025).
Apache Software Foundation. Apache Beam—Orchestration. Available online: https://beam.apache.org/documentation/ml/orchestration/ (accessed on 13 August 2025).
Boettiger, C. An Introduction to Docker for Reproducible Research, with Examples from the R Environment. arXiv 2014, arXiv:1410.0846. [Google Scholar] [CrossRef]
Amershi, S.; Begel, A.; Bird, C.; DeLine, R.; Gall, H.; Kamar, E.; Nagappan, N.; Nushi, B.; Zimmermann, T. Software Engineering for Machine Learning: A Case Study. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Montreal, QC, Canada, 25–31 May 2019; pp. 291–300. [Google Scholar]
High-Level Expert Group on Artificial Intelligence. Ethics Guidelines for Trustworthy Artificial Intelligence; High-Level Expert Group on Artificial Intelligence: Brussels, Belgium, 2019. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 4768–4777. [Google Scholar]
Stigler, S.M. Statistics on the Table: The History of Statistical Concepts and Methods; Harvard University Press: Cambridge, MA, USA, 2002. [Google Scholar]
Tukey, J.W. Exploratory Data Analysis; Addison-Wesley: Boston, MA, USA, 1977. [Google Scholar]
Encyclopædia Britannica, Inc. Pearson’s Correlation Coefficient. Available online: https://www.britannica.com/topic/Pearsons-correlation-coefficient (accessed on 13 August 2025).
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
OpenStax. Cohen’s Standards for Small, Medium, and Large Effect Sizes. Available online: https://stats.libretexts.org/Bookshelves/Introductory_Statistics/Introductory_Statistics_2e_(OpenStax)/10%3A_Hypothesis_Testing_with_Two_Samples/10.03%3A_Cohen%27s_Standards_for_Small_Medium_and_Large_Effect_Sizes (accessed on 13 August 2025).
Elman, J.L. Finding Structure in Time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
Scikit-Learn Documentation. MinMaxScaler—Scikit-Learn Preprocessing. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html (accessed on 13 August 2025).
DataCamp. RMSE Explained: A Guide to Regression Prediction Accuracy. Available online: https://www.datacamp.com/tutorial/rmse (accessed on 13 August 2025).
Frost, J. Mean Absolute Error (MAE). Available online: https://statisticsbyjim.com/glossary/mean-absolute-error/ (accessed on 13 August 2025).

Figure 1. High-level overview of the energy forecasting pipeline framework.

Figure 2. Phase 0—Project Foundation workflow.

Figure 3. Phase 1—Data Foundation workflow.

Figure 4. Phase 1.1—Data Quality Verification logic.

Figure 5. Phase 2—Data Understanding and Preprocessing workflow.

Figure 6. Phase 3—Feature Engineering workflow.

Figure 7. Phase 4—Model Development and Evaluation workflow.

Figure 8. Phase 5—Deployment and Implementation workflow.

Figure 9. Phase 6—Automated Monitoring and Update workflow.

Figure 10. Case study—electricity data visualization.

Figure 11. Case study—key weather data visualization.

Figure 12. Case study—EDA visualizations.

Figure 13. Case study—gap-filling examples.

Figure 14. Case study—top 30 features.

Figure 15. Case study—category-wise feature ranking (derivation features).

Figure 16. Case study—category-wise feature ranking (external features).

Figure 17. Case study—top model (RNN) forecast samples.

Figure 18. Case study—deployed model (RNN) performance.

Table 1. Stakeholder responsibility matrix by phase.

Phase	Activity	Data Scientists & ML Engineers	Energy-Domain Experts	Project Managers	DevOps & MLOps Engineers	Compliance Officers & QA
Phase 0	Scenario Understanding and Alignment	A	A	P	C	A
	Compliance and Technical Checks	A	A	P	A	P
	Gate 0: Foundation Check	A	A	P	A	P
Phase 1	Data Identification and Quality Check	P	A	G	C	C
	Data Ingestion & Validation	P	C	G	C	A
	Gate 1: Data Acquisition Ready	P	A	P	C	A
Phase 2	EDA	P	A	G	C	C
	Data Cleaning and Preprocessing	P	A	G	C	C
	Gate 2: Preprocessing OK	P	A	P	C	A
Phase 3	Feature Engineering and Creation	P	A	G	C	C
	Feature Analysis and Selection	P	A	G	C	C
	Gate 3: Features OK	P	A	P	C	A
Phase 4	Model Scoping	P	A	G	C	C
	Input Data Preparation	P	C	G	C	C
	Model Training and Evaluation	P	A	G	C	C
	Model Refinement	P	A	G	C	C
	Result Consolidation	P	A	G	C	C
	Gate 4: Model Valid	P	A	P	C	A
Phase 5	Deployment Preparation	A	C	G	P	C
	UI Setup	C	A	G	P	C
	Data Connection	A	C	G	P	C
	System Deployment	A	C	G	P	C
	Gate 5: Deployment OK	A	A	P	P	A
Phase 6	Automated Monitoring	C	A	G	P	C
	Gate 6: Monitoring Governance	C	A	P	P	A
	Update Cycle Management	A	C	G	P	C

Table 2. Comparative analysis of process frameworks for energy forecasting applications.

Aspect	CRISP-DM	Modern MLOps Pipeline	Proposed Energy Framework
Phases/Coverage	Six phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment. No explicit monitoring phase (left implicit in Deployment).	Varies, but generally covers data ingestion, preparation, model training, evaluation, deployment, and adds monitoring through continuous integration and continuous deployment (CI/CD) for ML [14]. Business understanding is often assumed but not formally part of pipeline; focus is on technical steps.	Seven phases: Project Foundation (Phase 0), Data Foundation (Phase 1), Data Understanding and Preprocessing (Phase 2), Feature Engineering (Phase 3), Model Development and Evaluation (Phase 4), Model Deployment and Implementation (Phase 5), Automated Monitoring and Update (Phase 6). Each phase contains 2–4 sequential workflows with explicit decision gates.
Iteration and Feedback Loops	Iterative in theory (waterfall with backtracking)—one can loop back if criteria not met [11], but no fixed decision points except perhaps after Evaluation. Iteration is ad-hoc, relies on team judgement.	Continuous triggers for iteration (e.g., retrain model if drift detected, or on schedule) are often implemented [14]. Pipelines can be re-run automatically. However, decision logic is usually encoded in monitoring scripts or not explicitly formalized as “gates”—it is event-driven.	Systematic iteration through formal decision gates (Gates 0–6) with predetermined validation criteria. Each gate requires multi-stakeholder review and explicit sign-off. Controlled loopback mechanisms enable returns to earlier phases with documented rationale and tracked iterations.
Domain-Expert Involvement	Present at start (Business Understanding) and end (Evaluation) implicitly. Domain knowledge guides initial goals and final evaluation, but no dedicated phase for domain feedback mid-process. Often, domain experts’ role in model development is informal.	Minimal integration within automated pipelines. Domain experts typically provide initial requirements and review final outputs but lack formal validation checkpoints within technical workflows. Manual review stages often implemented separately from core pipeline.	Structured domain-expert involvement across all phases with mandatory validation checkpoints. Energy-domain expertise formally required for EDA alignment verification (Phase 2), feature interpretability assessment (Phase 3), and model plausibility validation (Phase 4). Five-stakeholder governance model ensures continuous domain–technical alignment.
Documentation and Traceability	Emphasized conceptually in CRISP-DM documentation but not enforced. Artifacts like data or models are managed as part of project assets, but no standard format. Traceability largely depends on the team’s practices. No specific guidance on versioning or provenance.	Strong tool support for technical traceability: version control, experiment tracking, data lineage through pipeline metadata. Every run can be recorded. However, focuses on data and model artifacts; business context or rationale is often outside these tools. Documentation is usually separate from pipeline (confluence pages, etc.).	Comprehensive traceability built-in: each phase produces artifacts with recorded metadata. Framework requires documentation at each gate transition. Combines the best of tool-based tracking (for data/model versions) with process documentation (decision logs, validation reports). Satisfies traceability requirements for trust and compliance by maintaining provenance from business goal to model deployment.
Regulatory Compliance	No explicit compliance framework. Regulatory considerations addressed ad hoc by practitioners without systematic integration into methodology structure.	Compliance capabilities added as optional pipeline components (bias audits, model cards) when domain requirements dictate. No native compliance framework integrated into core methodology.	Embedded compliance management throughout pipeline lifecycle. Phase 0 establishes comprehensive compliance framework through jurisdictional assessment, privacy evaluation, and regulatory constraint identification. All subsequent phases incorporate compliance validation at decision gates with dedicated compliance officer oversight and audit-ready documentation.
Modularity and Reusability	High-level phases are conceptually modular, but CRISP-DM does not define technical modules. Reusability depends on how project is executed. Not tied to any tooling.	Highly modular in implementation—each pipeline step is a distinct component (e.g., container) that can be reused in other pipelines. Encourages code reuse and standardized components. However, each pipeline is specific to an organization’s stack.	Explicitly modular at concept level with energy-specific phases that correspond to pipeline components. Tool-agnostic design allows implementation in any workflow orchestration environment. Clear phase inputs/outputs promote creation of reusable energy forecasting templates for each phase.
QA	Relies on final Evaluation; quality mostly measured in model performance and meeting business objective qualitatively. No explicit risk management steps in each phase.	Quality control is often automated via tests in CI/CD that fail pipeline if data schema changes or performance drops. However, these are set up by engineers, not defined in a high-level methodology.	Multi-layered QA with phase-specific success criteria and validation protocols. Energy-specific quality metrics defined for each phase outcome (data completeness thresholds, feature interpretability requirements, model performance benchmarks). Structured reviews and formal validation artifacts ensure comprehensive quality coverage from foundation through deployment.

Table 3. Case study—project specification.

Category	Specification Details
Use Case Context	Electricity forecasting for office building optimization.
Installation and Infrastructure	Existing smart-meter infrastructure with hourly data collection. Automated hourly readings from the building management system.
Factor Impact Analysis	Weather conditions, academic calendar, occupancy patterns.
Model Scoping	Tree-based models and neural network-based models.

Table 4. Case study—phase-based requirements and success criteria for energy forecasting.

Phase	Requirements	Success Criteria
Overall	Define forecasting value Establish appropriate forecasting horizon Define the automatic update interval	Forecasting value: electricity usage Forecast horizon: 24 h Automatic update interval: 24 h
Phase 1	Establish data quality metrics Ensure key factors (e.g., weather, school calendar) are covered Confirm data-source availability Define data ingestion success Set overall data quality standards	≥85% data completeness All primary factors represented Accessible data sources ≥98% successful ingestions Missing values <15%, consistent time intervals, data type correct.
Phase 2	Align data patterns with domain knowledge Establish data-cleaning success benchmarks	Temporal cycles visible in data Outliers <5% post-cleaning
Phase 3	Generate sufficient, relevant features (lag, calendar, temporal, weather, and interaction features)	Feature set covers key drivers
Phase 4	Validate data preparation pipeline Model meets performance metrics	No data leaks after transform MAPE <15% on validation
Phase 5	Validate model in production Confirm data quality and connectivity Verify system-level integration	Live model predictions match dev-stage metrics within ±25% Quality good, uptime >99%, APIs’ response time <2 s Data pipeline integrated and producing outputs without error
Phase 6	Define output quality acceptance	Forecast degradation <25% from baseline

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, X.; Ma, Z.G.; Jørgensen, B.N. An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise. Information 2025, 16, 805. https://doi.org/10.3390/info16090805

AMA Style

Zhao X, Ma ZG, Jørgensen BN. An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise. Information. 2025; 16(9):805. https://doi.org/10.3390/info16090805

Chicago/Turabian Style

Zhao, Xun, Zheng Grace Ma, and Bo Nørregaard Jørgensen. 2025. "An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise" Information 16, no. 9: 805. https://doi.org/10.3390/info16090805

APA Style

Zhao, X., Ma, Z. G., & Jørgensen, B. N. (2025). An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise. Information, 16(9), 805. https://doi.org/10.3390/info16090805

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An End-to-End Data and Machine Learning Pipeline for Energy Forecasting: A Systematic Approach Integrating MLOps and Domain Expertise

Abstract

1. Introduction

2. Background and Related Work

2.1. Existing Data Science Process Models

2.2. Limitations of Current Approaches

3. Proposed Energy Forecasting Pipeline Framework

3.1. Phase 0—Project Foundation

3.1.1. Scenario Understanding and Alignment

3.1.2. Compliance and Technical Checks

3.1.3. Gate 0—Foundation Check

3.2. Phase 1—Data Foundation

3.2.1. Data Identification and Quality Check

3.2.2. Data Ingestion and Validation

3.2.3. Gate 1—Data Acquisition Ready

3.3. Phase 2—Data Understanding and Preprocessing

3.3.1. EDA

3.3.2. Data Cleaning and Preprocessing

3.3.3. Gate 2—Data Preprocessing OK?

3.4. Phase 3—Feature Engineering

3.4.1. Feature Engineering and Creation

3.4.2. Feature Analysis and Selection

3.4.3. Gate 3—Features OK

3.5. Phase 4—Model Development and Evaluation

3.5.1. Model Scoping

3.5.2. Input Data Preparation

3.5.3. Model Training and Evaluation

3.5.4. Model Refinement

3.5.5. Result Consolidation

3.5.6. Gate 4—Model Valid

3.6. Phase 5—Model Deployment and Implementation

3.6.1. Deployment Preparation

3.6.2. User Interface (UI) Setup

3.6.3. Data Connection

3.6.4. System Deployment

3.6.5. Gate 5—Deployment OK

3.7. Phase 6—Automated Monitoring and Update

3.7.1. Automated Monitoring

3.7.2. Gate 6—Monitoring Governance Check

3.7.3. Wait for Next Update Cycle

3.8. Stakeholder Roles and Responsibilities

3.8.1. Responsibility Distribution and Governance Structure

3.8.2. Core Stakeholder Functions

4. Comparison of Existing Frameworks

5. Case Study

5.1. Phase 0—Project Foundation

5.2. Phase 1—Data Foundation

5.3. Phase 2—Data Understanding and Preprocessing

5.4. Phase 3—Feature Engineering

5.5. Phase 4—Model Development and Evaluation

5.6. Phase 5—Deployment and Implementation

5.7. Phase 6—Automated Monitoring and Update

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Summary of Key Phase Artifacts in the Energy Forecasting Framework

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI