Towards a Reference Architecture for Machine Learning Operations

Mateo-Casalí, Miguel Ángel; Boza, Andrés; Fraile, Francisco

doi:10.3390/computers15040218

Open AccessArticle

Towards a Reference Architecture for Machine Learning Operations

by

Miguel Ángel Mateo-Casalí

^*

,

Andrés Boza

and

Francisco Fraile

^*

Research Centre on Production Management and Engineering (CIGIP), Universitat Politècnica de València (UPV), Camino de Vera S/N, 46022 Valencia, Spain

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(4), 218; https://doi.org/10.3390/computers15040218

Submission received: 19 February 2026 / Revised: 25 March 2026 / Accepted: 26 March 2026 / Published: 1 April 2026

(This article belongs to the Special Issue Machine Learning: Innovation, Implementation, and Impact)

Download

Browse Figures

Versions Notes

Abstract

Industrial organisations increasingly rely on machine learning (ML) to improve quality, maintenance, and planning in Industry 4.0/5.0 ecosystems. However, turning experimental models into reliable services on the production floor remains complex due to the heterogeneity of operational technologies (OTs) and information technologies (ITs), including implementation constraints, latency in edge-fog-cloud scenarios, governance requirements, and continuous performance degradation caused by data drift. Although Machine Learning Operations (MLOps) provides lifecycle practices for deployment, monitoring, and retraining, the evidence is fragmented across tool-centric descriptions, case-specific pipelines, and conceptual architectures, offering limited guidance on which industrial constraints should inform architectural decisions and how to evaluate solutions. This work addresses that gap through a PRISMA-guided systematic review of 49 studies on industrial MLOps (with the search and screening primarily targeting Industry 4.0/IIoT operationalisation contexts, as reflected in the search strategy and corpus) and an evidence-based synthesis of principles, challenges, lifecycle practices, and enabling technologies. From this synthesis, industrial requirements are derived that encompass OT/IT integration, edge-fog-cloud orchestration, security and traceability, and observability-based lifecycle control. On this basis, a reference architecture is proposed that maps these requirements to functional layers, data and control flows, and verifiable responsibilities. To support reproducibility and practical inspectability, the article also presents an open-source architectural instantiation aligned with the proposed decomposition. Finally, the evaluation is illustrated through a predictive maintenance use case (tool breakage) in a single CNC machining cell, where the objective is to demonstrate end-to-end feasibility under realistic operational constraints rather than cross-scenario superiority or broad industrial generalisability.

Keywords:

machine learning operations (MLOps); Industry 4.0; Industry 5.0; edge computing; fog computing; cloud computing; reference architecture; ISO/IEC/IEEE 42010

1. Introduction

The integration of machine learning (ML) into industrial operations is becoming increasingly important as organisations seek to improve efficiency, reliability, and decision-making capabilities [1]. In many sectors, ML is already used to support production line optimisation, predictive maintenance, material handling and other operational activities [2]. However, the ability to extract sustained value from ML rarely depends solely on model accuracy. Industrial environments are dynamic and constrained: data distributions drift due to wear and tear, retrofits, calibration changes, and operating regimes; plant availability requirements restrict deployment windows; and the heterogeneity of legacy systems complicates integration between Information Technology (IT) and Operational Technology (OT) [3]. These realities make it difficult for many industrial ML initiatives to advance beyond pilot phases or to maintain performance once deployed, unless lifecycle governance, monitoring, and controlled change management are treated as top-level engineering concerns.

The adoption of Machine Learning Operations (MLOps) is particularly attractive in this context because it integrates principles of software engineering, DevOps, and data science into a unified operational discipline that reflects the specific characteristics of ML systems in production [1]. Beyond simplifying deployment, MLOps provides structured mechanisms for monitoring, retraining, staged promotion, and lifecycle management, which are critical for sustaining performance under changing operational conditions. Furthermore, industrial use cases are diverse and impose different operational objectives: production may prioritise throughput and defect reduction, while maintenance may focus on early fault detection and risk mitigation. Aligning these objectives with evolving business data and constraints requires repeatable processes and explicit roles, as structured process models improve an organisation’s ability to manage risk throughout the ML lifecycle [4]. In this sense, the use of MLOps in industrial environments cannot be decoupled from distributed computing paradigms. The integration of Cloud, Edge, and Fog architectures enables computing power to be shifted closer to the data source, addressing critical latency and resource-efficiency challenges that traditional centralised models cannot address in real-time scenarios [3].

In real industrial environments, operationalising ML also requires robust infrastructures and appropriate architectural decisions. Containerisation technologies such as Docker facilitate consistent execution environments and simplify dependency management [5], directly addressing reproducibility challenges that would otherwise hinder scaling across locations or platforms. They also enable more secure updates, as model versions and system parameters can be promoted in controlled stages and rolled back when necessary. At the same time, a continuous link between IT layers and OT nodes is essential for deploying ML in Industry 4.0 scenarios. Edge computing enables local decision-making when latency requirements or bandwidth constraints prevent reliance solely on cloud processing [6]. This locally focused approach is relevant for anomaly detection tasks close to the data source [7], where rapid responses can prevent defects or bottlenecks in production. Consequently, industrial MLOps must reconcile edge responsiveness with cloud-scale learning and lifecycle governance, ensuring that artefacts, configurations, and monitoring signals remain consistent across distributed deployments.

However, it is important to recognise multiple limitations. Practical implementations face both technical obstacles, such as drift detection under variable loads, privacy and data governance, and system resource constraints, as well as non-technical factors, including staff training needs and infrastructure licensing costs [8]. Although technical optimisations, such as hyperparameter tuning and containerised deployments, can improve predictive accuracy and ensure deployment consistency [5], the increasing complexity of operational datasets may continue to hinder generalisation beyond defined conditions. These difficulties are often compounded by the fragmentation of evidence in the literature: many studies describe conceptual tools or pipelines but provide little operational detail, little architectural justification, or insufficient criteria for evaluating whether a proposed stack is adequate in the face of industrial constraints such as integration with legacy systems, availability, security/privacy, and costs/operation. As a result, researchers and practitioners lack a consolidated, evidence-based reference framework that connects industrial requirements, architectural design, and implementation choices in a reusable way.

To address this need, this article combines a systematic review of industrial MLOps studies with the derivation of architectural requirements and patterns to inform reusable designs in Industry 4.0- and Industry 5.0-aligned environments. The following research questions guide the analysis:

RQ1: How has the use of ML evolved in industrial environments, and what role does MLOps play in relation to Industry 4.0/5.0?
RQ2: What MLOps principles and best practices are reported in the literature for industrial applications?
RQ3: What are the most frequently cited technical and organisational challenges when adopting MLOps in industrial plants, and how do they affect scalability, legacy system integration, security/privacy, and costs/operations?
RQ4: What architectures, platforms, and technological tools are used in industrial environments to implement MLOps?
RQ5: How is the lifecycle of ML models managed in production environments, from development to monitoring and retraining?
RQ6: What strategies and industrial use cases demonstrate the integration of MLOps with emerging technologies?

While the review strategy focuses on studies that explicitly address the implementation of ML/MLOps in Industry 4.0/IIoT environments, this paper uses the Industry 5.0 perspective to highlight emerging requirements that influence architectural design. These requirements include human-centred oversight, sustainability considerations and explainability, even though empirical evidence of MLOps continues to be predominantly collected under Industry 4.0 terminology. Consequently, references to Industry 5.0 are employed to inform discussions and requirements for future architecture, rather than to demonstrate symmetrical coverage of Industry 5.0-focused empirical studies in the reviewed corpus.

Contributions and Novelty

In response to these questions and the gaps identified, the article focuses on four contributions. First, it provides a PRISMA-guided systematic review of the industrial MLOps literature and consolidates the evidence base around contexts, principles, challenges, lifecycle practices, and enabling technologies. Second, it synthesises recurring industrial requirements and decision factors, focusing on OT/IT integration, edge-fog-cloud operation, governance/traceability, and monitoring-driven lifecycle control. Third, based on this synthesis, it proposes a reference architecture that structures MLOps capabilities from data ingestion and orchestration to deployment and monitoring, explicitly oriented towards industrial constraints and Industry 4.0/5.0 paradigms. Fourth, it presents a vendor-neutral, open-source implementation stack that instantiates the architecture for brownfield environments and makes the proposed evaluation pathway inspectable and reproducible. At the same time, the empirical section is intentionally scoped as a bounded single-cell feasibility demonstration rather than as a broad cross-scenario benchmark.

The novelty of this work does not lie in introducing MLOps as a new discipline in the lifecycle, but in providing a traceable bridge between industrial constraints (as summarised in the literature) and the views of the reference architecture and an implementable, vendor-neutral stack. Specifically, the article explicitly connects: (i) recurring OT/IT and distributed computing requirements; (ii) responsibilities and interfaces, as captured through the reference architecture structure; and (iii) operational evaluation criteria that go beyond predictive performance, accounting for implementation, monitoring and change management issues in realistic industrial environments.

In line with these research questions and contributions, the article is organised as follows. Section 2 presents the background on industrial MLOps, Industry 4.0/5.0, and reference architectures. Section 3 details the PRISMA-guided literature review methodology, including descriptive analysis, content analysis, and validity considerations. Section 4 presents the results of the systematic review, including the maturity of the evidence, the requirements taxonomy, lifecycle capability analysis, deployment patterns, and the resulting research gap. Section 5 introduces the proposed hybrid reference architecture and implementation stack. Section 6 presents the CNC single-cell implementation and feasibility-oriented validation, demonstrating end-to-end operationalisation under realistic industrial constraints. Finally, Section 7 discusses the findings in relation to the research questions, positions the proposal against the state of the art, and outlines limitations and future work.

2. Background

The integration of machine learning into industrial systems has evolved from isolated pilot projects to more complex production-level solutions that directly influence operational decisions. Initially, implementations relied on static (offline) datasets and controlled environments. These models had limited reusability, as real production environments present more volatile data patterns, changing workloads, and anomalies that were difficult to detect [9,10]. The turning point came when machine learning began to align with large-scale data platforms capable of supporting a high volume of continuous data from various physical and digital assets. This transformation created the conditions for real-time data analysis, enabling almost instantaneous insights. The technological infrastructure integrates Internet of Things (IoT) devices with big data architectures and machine learning models to achieve more adaptable, efficient, and human-centred operations [11]. This evolution is sustained not only by advances in algorithm improvement but also by the maturity of infrastructure orchestration layers with MLOps, which provide stability during iterative model updates.

Pioneering initiatives often experienced ‘proof of concept (PoC) fatigue’, in which models performed well in the laboratory but failed to translate into long-term operational benefits [12]. Many projects stalled in the prototype phase due to difficulties maintaining accuracy amid data changes, a lack of transparency in predictions, and insufficient oversight processes [7], which, in turn, required periodic modifications. In addition, the ability to explain decisions has become increasingly important, as stakeholders demanded confidence before integrating automated results into safety-critical processes [13,14]. In this context, operational reliability depends not only on the ML model but also on the surrounding lifecycle controls (data governance, oversight, traceability, and controlled deployment) that determine whether industrial stakeholders can trust, audit, and maintain ML-based decisions over time.

As production machinery and supply networks become increasingly complex, these limitations have led organisations to implement MLOps architectures. These architectures define clear stages: data collection; pre-processing, including feature selection; model development in isolated environments; staged validation; and incremental deployments accompanied by continuous performance monitoring [15,16]. The introduction of containerisation enabled portability between edge nodes and cloud nodes, improving resilience to environment-specific failures. Advances in lifecycle tracking tools, such as MLflow, introduced mechanisms for experiment reproducibility and version control, not only for models but also for training data subsets and hyperparameter configurations. This ensured that predictive behaviour could be analysed throughout the analytical process if quality degradation or deviation was subsequently detected [17]. Thanks to these tracking systems, it was possible to run parallel experiments using historical datasets while maintaining tight control over deployments. However, industrial environments posed limitations in terms of security and data location. Pre-processing operational signals close to the source not only reduced latency but also minimised transmission overheads, while addressing regulatory compliance issues associated with outsourced monitoring [18]. At the same time, distributed deployments introduce additional engineering concerns, such as artefact synchronisation, consistent observability across tiers, and controlled releases, which are less important in purely cloud-centric ML deployments but become critical when inference and data acquisition are combined with the time constraints of the production plant.

The evolution was also influenced by cultures such as DevOps. Practices such as continuous integration/continuous deployment (CI/CD), agile iteration on software artefacts, and cross-team collaboration proved beneficial in MLOps contexts [19]. Applying these principles helped solve a common problem: poor coordination between process experts and ML engineers managing complex workflows. What can be observed in this trajectory is an interaction between technical enablers, containerised deployments, orchestrated workflows, traceable experiments, and organisational shifts towards cultures of continuous improvement [20]. Unlike previous one-off improvements, where new models replaced old ones infrequently, the current implementation of ML favours more frequent updates, validated through testing or phased deployments before full adoption [21]. This implementation approach reduces problems while improving prediction accuracy and system responsiveness.

However, automation is central to the definition of MLOps. Continuous integration and deployment serve as pillars that enable models to be quickly integrated into production environments and updated with minimal disruption [22]. This automation is most effective when combined with containerisation technologies such as Docker or orchestration platforms such as Kubernetes, as these not only enable consistent execution but also simplify dependency management [5]. In addition, these automation workflows are often complemented by extensions such as MLflow, Airflow, or TensorFlow, which provide logging, tracking, and experiment management for machine learning-centric operations [23]. The scope also includes lifecycle management capabilities in AI/machine learning contexts. Some platforms offer complete environments that bundle multiple libraries for tasks ranging from pre-processing to deployment. Others specialise in coordinating distributed training jobs. Model registries are an essential component: they store trained artefacts along with metadata for reproducibility, audit trails for compliance, and rollback points in case updated models degrade system performance [8,24]

From a tooling perspective, MLOps relies on a broad ecosystem: knowledge spaces such as GitLab Wiki for sharing documentation; repository platforms such as GitHub or GitLab for version control; build tools such as Maven to ensure reproducible environments; continuous integration servers such as Jenkins for running automated tests; deployment automation frameworks such as Spinnaker; container orchestrators such as Kubernetes; AI workflow managers such as Apache Airflow; and monitoring solutions such as Prometheus and Grafana, as well as communication channels that facilitate coordination [22,23]. In essence, MLOps combines technical enablement and organisational synchronisation: it aligns people, processes and platforms to continuously deliver adaptable ML solutions in environments with high downtime costs.

MLOps applications span a wide range of industrial sectors, and each domain has its own operational constraints, data characteristics, and performance requirements.

Predictive maintenance scenarios: MLOps coordinates the lifecycle of time-series models that predict machinery failure points using vibration signals, acoustic patterns, sensor readings, and historical maintenance records. Synchronising IoT sensor inputs through data pipelines ensures integrity before these datasets trigger automated training jobs [25,26]. The models are then deployed as online services, periodically queried by plant control systems, or as components integrated into the machine controllers themselves, depending on the latency tolerances. One of the advantages here is maintaining synchronisation between a system’s changing conditions and its digital representation, especially when supported by Digital Twin infrastructures that extract real-time data streams to recalibrate model parameters [27]. By structuring this within an MLOps framework, organisations can avoid situations in which models become obsolete without clear operational signals, only to fail.
Logistics and supply chain: this sector offers a different context in which MLOps provides measurable strategic value. In this case, the complexity stems not so much from latency constraints as from handling fluctuating demand signals, transport delays, and multi-level supplier dependencies [28]. Predictive models can incorporate data streams ranging from sales forecasts to weather predictions to anticipate potential bottlenecks. The implementation of these models in decision-support layers requires strict governance to ensure that only validated versions with known performance limits are promoted to real-time route optimisation systems [7,29].
Quality control: This is another manufacturing-focused area where MLOps adoption is accelerating. Automated feedback loops allow thresholds or classifier weights to be recalibrated without manual intervention when processes change batch properties or raw material composition. This responsiveness depends on containerised deployments that preserve execution consistency across edge devices distributed across multiple facilities [5].
Advanced robotics: Especially in Industry 5.0 contexts involving human–machine collaboration and the use of cobots, the deployment of machine learning-based control architectures under an MLOps regime adds reliability to autonomy functions while enabling the explainability mechanisms required by safety regulators [25,26]. Robotic arms equipped with force-sensitive grippers can leverage reinforcement learning policies refined offline and continuously evaluated in actual operation via a monitoring interface that provides feedback on reward function fulfilment. If environmental changes degrade success rates beyond acceptable levels, pipelines automatically activate restricted retraining routines with updated real-world simulation datasets.

The common denominator across these sectors is the importance of deployment architectures, which must be reliable and capable of supporting operational loads, while taking into account domain-specific constraints such as regulatory compliance or safety tolerances [30]. Variations arise primarily in how feedback loops are structured: latency-sensitive contexts prioritise edge inference with local retraining capabilities; global supply chain optimisations favour cloud-centric batch retraining based on longer-term historical trends. It is important to note that while core MLOps practices provide a common structure, their parameterisation reflects sector-dependent trade-offs between accuracy, stability, speed of adaptation, stakeholder-mandated transparency levels, and infrastructure cost optimisation goals [1]. By aligning these sector-specific demands with a mature MLOps workflow that structures experiment tracking alongside deployment orchestration and active monitoring, industrial operators transform ML initiatives from isolated prototypes into sustained operational assets, directly integrated into the fabric of their core activities.

The transition from Industry 4.0 to Industry 5.0 marks a conceptual shift in how artificial intelligence is integrated into complex manufacturing ecosystems. While Industry 4.0 drove the digitisation of production through cyber-physical systems, advanced robotics, IoT connectivity, and large-scale data analytics, it logically required an operational backbone capable of supporting the machine learning components embedded in these infrastructures. MLOps emerges as a central pillar, enabling repeatable, transparent, and adaptable ML workflows that align with Industry 4.0’s demand for automated decision-making and real-time responsiveness [31].

The transition to Industry 5.0 adds additional layers of complexity to this paradigm, as it also includes cooperation between humans and machines, sustainability goals, and resilience to disruptions [32]. Industry 5.0 introduces a more balanced interaction in which humans remain the central actors in decision loops, leveraging the advantages of machines without relinquishing oversight. Within this paradigm shift, MLOps must adapt from a technical enabler to a coordination layer that mediates between algorithmic precision and human interpretability. For example, contextual explanations tailored to the operator’s knowledge enable faster override actions when real-time ML model decisions clash with situational awareness on the production floor [25]. From an operational standpoint, this means that workflows must manage inference destinations: some geared toward autonomous action at the edge, as is common in Industry 4.0 architectures; and others designed for collaborative analysis sessions where humans review results before committing [33]. This duality forces MLOps architectures to be flexible: containerised deployments can be sent directly to embedded controllers for low-latency environments, while parallel instances run on centralised dashboards with explanation modules for operators to review [5]. Consistent management of both paths ensures that AI-assisted decisions are reliable, whether automated or subject to human approval. On the other hand, automated resource utilisation monitoring integrated into CI/CD enables teams to adjust architectural choices, thereby reducing the environmental footprint without significant performance compromises [29,34].

MLOps also naturally aligns with the sustainability goals embedded in Industry 5.0 thinking, as it introduces lifecycle assessment tools capable of quantifying energy consumption and carbon impact during the training and inference phases [35]. From a collaboration perspective, the move towards Industry 5.0 underscores the need for alignment among roles within MLOps teams, as addressing the quality of human–machine interaction requires interdisciplinary input [19]. The introduction of business analysts ensures that model outputs serve operational purposes; domain engineers provide contextual constraints; data scientists tune algorithms under those constraints; and MLOps engineers ensure that these adjustments translate smoothly into production deployments without disrupting ongoing work cycles [36]. Agile methodologies, such as Scrum, already provide a framework for iterative development that resonates with the industrial shift towards rapid responsiveness while maintaining reliability standards [1]. Through limited-scope sprints that incorporate feedback from human operators who interact directly with AI and ML-driven equipment, MLOps processes become more sensitive to practical nuances that are often overlooked in purely automated loops.

In both Industry 4.0 efficiency-driven strategies and Industry 5.0 collaborative environments, poor input creates systemic vulnerabilities regardless of model performance. Another point of convergence lies in deployment strategies, which have been extensively explored in Industry 4.0 but are gaining renewed importance within Industry 5.0’s focus on resilience. Hybrid workloads between the edge and the cloud allow critical functions to persist even when connectivity is low, while less urgent analyses are aggregated centrally for broader optimisation objectives [5]. Standardising these implementations through serialisation formats further ensures version compatibility across mixed hardware fleets deployed across multiple sites.

Reference architectures offer an additional way of organising these practices and technologies into reusable design guidelines. In industrial and software systems, a reference architecture is commonly understood as an abstract, generalised blueprint that captures the key responsibilities, interfaces, and trade-offs among quality attributes for a class of systems. It serves as a reusable template rather than a design for a single, specific system. It is important to note that a reference architecture is not a reference implementation; it does not prescribe a specific implementation, but rather provides a structured description, typically in the form of viewpoints, that helps stakeholders to reason consistently about design decisions, constraints, and evaluation criteria. In the context of industrial MLOps, a reference architecture can therefore act as a bridge between industrial requirements (such as OT/IT integration, latency, availability, and governance) and specific implementation stacks and operational processes that must meet these requirements within the constraints of real-world implementation.

This paper frames its contribution from the perspective of ISO/IEC/IEEE 42010 [37], which emphasises aligning architecture descriptions with stakeholder concerns through explicit viewpoints. This is particularly relevant for industrial MLOps because key concerns span multiple stakeholder groups, including operations, maintenance, IT/OT security, and data science. In addition, the system-of-systems nature of edge–fog–cloud deployments makes it difficult to assess architectural adequacy solely from isolated component descriptions. Accordingly, the contribution of this paper is not a single deployment design, but a reusable architectural blueprint that links industrial requirements, stakeholder concerns, architectural responsibilities, and evaluation criteria in a consistent and inspectable manner.

To avoid overstating the evidence base, it is important to distinguish between (i) what the reviewed literature empirically demonstrates, often reported under Industry 4.0/IIoT terminology, and (ii) the forward-looking requirements associated with Industry 5.0 (e.g., explainability, sustainability and human-centric oversight), which motivate architectural design choices even when the empirical MLOps corpus remains uneven. This distinction ensures a consistent scope interpretation is maintained throughout the background, methodology, and later discussion sections.

3. Literature Review Methodology

To ensure a rigorous process, the literature review was conducted following PRISMA guidelines. The objective was to identify, select, and critically analyse studies on the adoption of MLOps in industrial environments, focusing on ML lifecycle operational practices (e.g., deployment, monitoring, orchestration, CI/CD, and drift management) within Industry 4.0/IIoT contexts.

3.1. Review Design and Descriptive Scope

The review was designed as a systematic review based on a structured search of reference bibliographic indexes, complemented by (i) descriptive analysis of the resulting corpus and (ii) content analysis supported by standardisation and keyword coding. The search and selection language was English. The last queries were run on 28 August 2025.

It is important to note that no explicit time filter was applied. However, the retrieved records are concentrated between 2022 and 2025, reflecting the emergence of the term “MLOps” as an explicit keyword in indexed publications. This methodological decision limits the scope of the analysis to works that self-identify their approach as MLOps (or a direct synonym), which favours comparability. However, it introduces a potential topicality bias discussed later as a validity threat in Section 3.5. In addition, this scope choice implies that relevant industrial operationalisation studies that apply equivalent practices under adjacent labels (e.g., AIOps, DevOps/DataOps for ML) may be under-represented in the final corpus; this is therefore treated explicitly as a selection-bias threat in Section 3.5.

3.2. Bibliographic Sources, Search Strategy and Restrictions

The search was conducted in two international databases widely used in engineering, systems and operations: Scopus and Web of Science (WoS). The strategy was defined to locate works that simultaneously: (i) used the term MLOps or its direct synonym, (ii) were situated in an industrial/Industry 4.0/IIoT context, and (iii) incorporated concepts associated with the operational life cycle of ML.

In Scopus, the TITLE-ABS-KEY fields (title, abstract and keywords) were used, employing the following equation:

TITLE-ABS-KEY ((“MLOps” OR “Machine Learning Operations”) AND (“Industrial” OR Manufactur* OR “Industry 4.0” OR “IIoT” OR “smart factory” OR “Factory” OR “production Line” OR “process industr*”) AND (“Deployment” OR “Model Serving” OR “Monitoring” OR “Model Registry” OR “Feature store” OR “CI/CD” OR “Orchestration” OR “concept Drift” OR “Data Drift” OR “Governance” OR “Lifecycle”)).

An equivalent strategy was applied in Web of Science, adapting the syntax and fields to the database’s own search engine.

The search strategy prioritised ‘MLOps’ (or a direct synonym) as the anchor term to ensure that retrieved works were explicitly positioned within the MLOps construct, thereby improving comparability across the corpus. Consequently, broader umbrella terms such as ‘DevOps’ or the generic ‘CI/CD’ were not used as primary anchors in isolation, as these would substantially broaden the search space towards general software delivery studies that do not address the operationalisation of the ML lifecycle in industrial contexts. Instead, the operational intent associated with DevOps/CI/CD was captured by including concrete ML lifecycle constructs in the query (e.g., deployment, monitoring, registry, orchestration, and drift). While this reduces noise, it can also exclude adjacent industrial evidence that uses alternative terminology. This limitation is therefore explicitly acknowledged as a selection bias threat (Section 3.5).

The selection process was carried out in two stages (Figure 1). In the first iteration, the Scopus query returned 64 records, while the WoS query returned 45. Due to restrictions on access to full text, 50 articles from Scopus and 38 from WoS were retrieved for screening and reading. After this point, duplicates were removed from both sources, leaving 61 unique articles. In the second stage, a relevance screening based on title, abstract, and keywords was applied, discarding papers that, although they matched the search terms, did not explicitly address the integration of MLOps (or ML lifecycle management in MLOps) in industrial contexts. The final sample consisted of 49 articles, which were analysed in depth (Table 1).

To ensure methodological consistency and alignment with the study objective, inclusion and exclusion criteria were established to guide the selection and refinement process of the corpus. These criteria enabled systematic filtering of the works identified in the initial searches, ensuring that the studies ultimately analysed substantively addressed the operationalisation of the machine learning lifecycle in industrial environments. In terms of inclusion criteria, studies that explicitly mentioned MLOps or its direct synonym, “Machine Learning Operations,” or that explicitly addressed the operational management of the ML lifecycle within a framework equivalent to MLOps were considered. This includes aspects such as deployment, monitoring, retraining, CI/CD or CT, and pipeline and service orchestration. Finally, inclusion was restricted to publications in English with full-text access to allow for detailed and consistent analysis.

Regarding exclusion criteria, works outside the industrial context and studies in which MLOps appeared only tangentially, without providing relevant content on lifecycle operation, were discarded. Similarly, conceptual articles that did not present practices, architectures, tools, or processes that could be analysed in relation to the research questions were eliminated.

3.3. Descriptive Analysis

The descriptive analysis characterises the evolution over time, the type of contribution and the authorship patterns. In the final set (49 articles), 7 papers were identified in 2022, 10 in 2023, 22 in 2024, and 10 in 2025, suggesting a recent acceleration of academic and industrial interest in MLOps in Industry 4.0/IIoT contexts (Figure 2). In terms of document type, conference contributions predominate (32, ≈65%), followed by journal articles (16, ≈33%) and a book chapter (1, ≈2%). This distribution is consistent with an emerging field of consolidation, in which conference forums serve as the primary channel for initial dissemination.

Scientific collaboration was also analysed based on the number of authors per article (Figure 3). The average observed is 4.71 authors per paper, with a concentration between 3 and 4 authors and the presence of extended collaborations, indicating an active and expanding community.

3.4. Content Analysis

Content analysis was performed on the final sample of 49 records by normalising keywords (e.g., ml ops → mlops; industrial internet of things → iiot → ci/cd) and assigning them to six categories aligned with the structure of the article: Background/Context, MLOps Principles, Challenges, Technological Foundations, Lifecycle Management, and Future Directions. To prevent the repetition of terms within the same article from skewing the results, two indicators were combined: (i) the total frequency of occurrences and (ii) coverage per article, understood as the percentage of articles that include ≥1 keyword associated with each category.

The most frequent keywords are “mlops” and “machine learning,” accompanied by concepts that reflect the operations of the ML industry (DevOps, pipelines), markers of the industrial context (Industry 4.0, IIoT, big data), and consistent with the study’s objective (Figure 4).

In terms of keyword-based category coverage, the resulting distribution is Background/Context 38.8%, Future Directions 18.4%, MLOps Principles 77.6%, Lifecycle 12.2% and Challenges 10.2%; the presence of Technological Foundations is residual, as tool names are rarely recorded as keywords, even when used in implementations (Figure 5). This is interpreted as an indexing effect: many papers discuss challenges and tools in the body of the text, but do not necessarily declare them as keywords.

The keyword co-occurrence network shown in Figure 6, created with VOSviewer, complements this reading. In the figure, the size of the nodes corresponds to the frequency of the terms, the thickness of the edges to co-occurrence, and the colour identifies the communities. From this, five clusters consistent with our taxonomy can be distinguished:

Core (mlops-machine learning), which is linked to the rest of the concepts and explains their centrality.
Operational cluster (devops, dataops, pipeline, with lighter CI/CD), which supports coverage in Principles (20.5%) by reflecting engineering practices focused on traceability and repeatability.
Industrial/IIoT cluster (Industry 4.0, IIoT, manufacturing, big data, with connections to edge computing), responsible for the highest percentage in Background/Context (34.1%) due to its systematic co-occurrence with the core.
Lifecycle and quality cluster, visible in implementation and with weaker links to supervision and conceptual drift, in line with moderate coverage of Lifecycle (11.4%) and the fact that some maintenance practices (recycling, model registration, validation/testing) are not explicitly labelled as keywords.
Future directions cluster (edge computing, federated learning, digital twin, Industry 5.0, big data), which justifies the high coverage in Future (22.7%) and points to distributed architectures and collaborative learning that preserve local data.

This bibliometric analysis is used solely to support structural interpretation (communities of terms) and not as causal evidence. Figure 6 suggests a “mlops–machine learning” core connected to operational (devops/dataops/pipeline), industrial (Industry 4.0/IoT/manufacturing), and trend (edge, federated learning, digital twin, Industry 5.0) clusters, reinforcing the qualitative reading and the adopted taxonomy.

3.5. Quality Assessment, Risk of Bias and Threats to Validity

Given that the literature on industrial MLOps combines conceptual proposals, partial implementations and cases with varying levels of evidence, a quality assessment was defined to characterise the methodological maturity of the research and thus avoid treating all sources as equivalent. Each study was evaluated using a checklist of five criteria, scored from 0 to 2:

Clarity of the industrial context and problem formulation.
Study design and evaluation method (e.g., case study, experiment, deployment, analytical evaluation).
Adequacy of data and infrastructure description (sources, pre-processing, execution environment).
Operational detail and lifecycle governance (deployment, observability, retraining, rollback).
Replicability and transparency (code/artefacts, parameters, sufficient procedures).

The resulting quality scores were aggregated and used as a weighting signal for subsequent interpretation of the corpus. This makes it possible to distinguish patterns supported by empirical deployments from those supported mainly by conceptual or partial-implementation evidence, without turning the quality checklist into a strict exclusion mechanism. In this way, studies with stronger operational detail, evaluation evidence, and replicability contribute more to the interpretation of architectural patterns and requirement salience. At the same time, conceptual proposals remain visible without being over-weighted.

The risk of bias was considered along three principal axes: (i) publication bias, due to the predominance of conference literature in an emerging field; (ii) selection bias, by explicitly privileging the construct “MLOps” over adjacent labels such as AIOps or DevOps/DataOps applied to ML; and (iii) coverage bias, by restricting the search to English-language records indexed in the selected bibliographic sources. Additionally, a terminology bias may occur because industrial practices equivalent to MLOps are sometimes reported under alternative labels, such as AIOps, MLO lifecycle management, or platform-specific “production ML” practices. This can reduce recall even when concrete lifecycle terms are included in the search string. This limitation is therefore taken into account when interpreting gaps, such as the lack of evidence for closed-loop retraining or reproducible implementations.

Finally, threats to validity are analysed in terms of internal, construct, external, and concluding validity. Internal validity may be affected by inconsistencies in the extraction and coding of evidence between studies; to mitigate this, an explicit category scheme and systematic standardisation of terms were applied. Construct validity depends on how categories such as “governance”, “observability”, or “edge–fog–cloud parity” are operationalised; therefore, labels were anchored in widely accepted technical terms (CI/CD, drift, serving, monitoring, registry). Conclusive validity is limited by the largely observational nature of the corpus and by the fact that keyword co-occurrence does not imply causal relationships; consequently, the results are interpreted as patterns and trends rather than as effect estimates. External validity is limited by the fact that the experimental validation reported later in the manuscript is restricted to a single CNC machining cell. Accordingly, the architectural and technological sections should be interpreted as reference tools for assessing requirements coverage and transferability. In contrast, generalisation across devices, sites, and industrial scenarios is treated as a limitation and a direction for future work rather than as an established property of the evaluation.

4. Results of the Systematic Review: State of the Art Analysis

The analysis of the 49 selected primary studies (2022–2025) allows us to trace the evolution of MLOps from a set of emerging practices to an established engineering discipline. Unlike previous reviews that list tools, this section assesses the maturity of the evidence (Section 4.1), synthesises critical industrial requirements (Section 4.2), analyses actual lifecycle coverage (Section 4.3), identifies dominant architectural patterns and their limitations (Section 4.4), and, finally, identifies the gaps found (Section 4.5). Throughout Section 4, frequency-based evidence is used as a proxy for salience in the corpus under review. It is not interpreted as a direct measure of real-world importance, but rather as an indicator of the areas in which challenges, solutions and architectural focus are currently concentrated in the literature.

4.1. Assessment of the Quality and Maturity of the Evidence

The application of quality criteria shows that, although a large proportion of the studies manage to define the industrial context and the problem to be solved, the replicability and availability of technical artefacts remain the main structural deficit in the field. A detailed analysis reveals that only a minority of the works, notably those by von Hahn and Mechefske, Manickam, Sood, and Wewer [12,38,39,40], provide open access to the source code, configuration scripts, or datasets used. This opacity limits the scientific community’s ability to verify results or build on previous findings, reinforcing the perception of “proof-of-concept fatigue,” in which proposed architectures rarely progress beyond the conceptual design or simulated validation phase (Table 2).

From an operational maturity perspective, we observe a clear difference in the reviewed literature. On the one hand, there is a predominance of conceptual-level studies, represented by works such as those by Raffin, Kreuzberger, or Bodor [8,22,41], which are limited to proposing reference frameworks, taxonomies, or theoretical reviews without providing a verifiable physical implementation. On the other hand, we identify a subset of research that reaches a level of actual production, reporting plant deployments with continuous operation metrics, such as the wind energy prediction system by Oyucu and Aksöz [5] or the virtual sensor for biofermentation by Metcalfe [21]. These latter works provide a more solid evidence base, demonstrating that the viability of MLOps depends on the effective integration of closed-loop feedback cycles. However, their limited statistical representativeness suggests that it has not yet reached widespread maturity in the industry. This heterogeneity justifies using quality assessment as a weighting signal in the synthesis. Claims derived from the corpus are interpreted based on whether the supporting works report empirical deployments, partial implementations or conceptual proposals.

4.2. Requirements Taxonomy for Industrial MLOps

Through thematic coding of the challenges and solutions reported in the search, it has been possible to identify twelve functional and non-functional requirements that form the real demands of Industry 4.0/5.0. Weighted frequency analysis enables us to identify the generic enumeration of characteristics, highlighting the real technical friction points that current architectures must resolve as a priority. Here, ‘weighted frequency’ is used to prioritise recurring requirements within the reviewed corpus and to distinguish frequently reported friction points from those reported less frequently but potentially more strategic (e.g., sustainability and explainability). The need for Adaptability and Drift Management (R01) stands out above the rest, cited as the technical challenge that appears most frequently in the literature [21,30,41,42]. Unlike traditional software, industrial models suffer continuous degradation not because of code errors, but because of changes in the physical environment, such as tool wear or variations in raw materials, which require the implementation of systems capable of detecting deviations in data distribution and activating retraining without constant human intervention. Closely linked to adaptability is the requirement for Continuous CI/CD (R06), whose comprehensive automation from code to deployment is considered non-negotiable for scaling solutions [4,40] (Table 3).

However, analysis reveals that many solutions labelled under the umbrella of MLOps are limited to automating initial deployment, neglecting continuous training (CT) and automatic rollback mechanisms in the event of failures (R08), which are critical for avoiding production downtime and are only explicitly addressed in a small fraction of studies, such as Oyucu and Aksöz [5] or Wewer [12]. Likewise, there remains an unresolved architectural tension between integration with legacy or OT systems (R04) [7,26] and the requirement for real-time latency (R05) [1,27], suggesting that cloud-based architectures are insufficient for controlling high-speed processes and that edge computing is required. Finally, Industry 5.0 requirements such as Sustainability (R11) [43] and Explainability (R09) [30,35] are beginning to take hold, conditioning design towards more efficient and transparent models for the human operator.

Beyond individual requirements, the coding also suggests interdependencies that influence architectural design. For instance, R01 (drift/adaptability) is closely linked to R06 (CI/CD automation) and R08 (rollback), as uncontrolled deployment and recovery pathways are insufficient in contexts with high downtime costs. Similarly, R05 (real-time latency) often co-occurs with R04 (legacy/OT integration) in industrial settings, as the most stringent latency constraints are typically at the OT layer. Here, protocol heterogeneity and certification constraints limit the adoption of purely cloud-centric delivery models. These relationships inform the subsequent architectural rationale, in which requirements are treated as a coupled set of constraints that must be satisfied jointly, rather than as a flat list.

4.3. Life Cycle Capability Analysis and Tools

By mapping the tools and processes described in the literature across the different phases of the MLOps life cycle, we identified uneven coverage that accurately pinpoints the current technological bottlenecks. The Data Ingestion and Model Training phases appear to be well-covered. They can be considered technologically mature, with a high percentage of studies [23,44,45] detailing robust solutions based on consolidated stacks such as Kafka, Spark, and TensorFlow (Table 4).

In contrast, the transition to production deployment and ongoing maintenance is notably fragmented. Although several studies address model serving [7,13,15], there is a lack of standardisation between the use of generic containers and specialised inference frameworks, which hinders interoperability. The situation is even more critical in the areas of governance and drift detection, where only a minority of studies, notably Leest, Garrone and Metcalfe [21,30,42], implement automated statistical monitoring mechanisms. This implies that most of the architectures reviewed lack the capacity for “deep observability”, operating blindly in the face of model degradation once deployed and lacking the feedback loop necessary to ensure long-term reliability. This gap aligns with the quality assessment results (Section 4.1), which indicate that operational detail and replicability are structurally weak. Even when tools are mentioned, many studies do not provide sufficient implementation artefacts or evaluation procedures to support claims about the lifecycle’s reproducibility.

4.4. Deployment Architectural Patterns

One of the most relevant findings of this systematic review is the identification of five distinct architectural patterns that dictate how training and inference workloads are distributed between the IT infrastructure and the OT plant. Quantitative analysis of these patterns reveals a structural disconnect between prevailing academic practices and the industrial requirements identified above. The dominant pattern is Centralised Deployment (P-01), adopted by an overwhelming majority of studies, including Raffin, Amirkhanova, Martínez-Arellano, and Ratchev (2024) [8,46,47]. While this approach simplifies management by unifying the technology stack across the cloud and on-premises servers, it systematically fails to meet latency and resilience requirements during network outages, making it unviable for controlling critical processes in real time (Table 5).

Given the limitations of the centralised model, the Hybrid Cloud-Fog-Edge pattern (P-02) emerges as the most promising alternative, documented in advanced works such as those by Colombi, Paul and Liao [4,7,48]. This approach decouples the life cycle, leveraging cloud computing for large-scale training while delegating light inference to edge devices, thereby satisfying both the need for complex retraining and low operational latency. However, its implementation involves significant orchestration complexity that most current frameworks do not natively address. Hybrid patterns introduce additional coordination overhead across tiers (e.g., synchronising model artefacts and metadata, ensuring consistent observability and managing safe rollouts), which can become a limiting factor, even when the latency benefits are clear. The reviewed evidence suggests that, while this overhead is widely recognised, it is less often addressed through reproducible mechanisms. This reinforces the need for a reference architecture that explicitly defines these cross-tier responsibilities. Other patterns, such as TinyML (P-03) [41,46] or Federated Learning (P-04) [17,48], occupy specific niches for ultra-low-power or high-privacy scenarios, respectively, but lack the versatility required for a general-purpose reference architecture. The evidence, therefore, suggests an imperative to consolidate a hybrid architecture that standardises synchronisation between the cloud and the edge, closing the gap between computing capacity and the immediacy of physical action.

4.5. Research Gap

Considering the findings presented, there is evidence of a structural fracture in the state of the art that prevents the widespread adoption of MLOps in industry. While mature tools exist for isolated phases such as training (e.g., TensorFlow) or orchestration (e.g., Airflow), cross-analysis of the 49 studies reveals that no current proposal simultaneously meets the requirements necessary for Industry 4.0/5.0: deterministic latency at the edge (R05), autonomous adaptability to drift (R01), and open technical replicability (C5). Specifically, we identify three critical gaps that this research aims to address:

The Architectural Gap (Centralisation vs. Physical Reality): As demonstrated by the prevalence of pattern P-01 (65.3%), academic inertia continues to favour centralised architectures inherited from web development, where inference occurs in the cloud. This approach is incompatible with cyber-physical systems that require millisecond decisions and offline operation. Although the hybrid pattern P-02 is theoretically identified as the optimal solution, it lacks a standardised reference implementation that resolves the complexity of synchronising models and metadata between the cloud and a fleet of heterogeneous devices.
The Closed-Loop Gap (Passive vs. Active Monitoring): There is a disconnect between anomaly detection and corrective action. Most of the architectures reviewed implement monitoring as a passive dashboard for human operators, failing to close the automatic retraining loop (CT). Without a mechanism that connects drift detection on-site directly to the retraining pipeline in the cloud, industrial models become static assets that depreciate rapidly, raising maintenance costs to unsustainable levels.
The Replicability Gap (Black Boxes vs. Open Artefacts): Finally, the literature is polarised between proprietary “black box” solutions (hyperscaler platforms) that generate vendor lock-in and simplistic academic proofs of concept that do not scale. The almost total absence (81.6%) of open-source technology stacks that integrate industrial-grade components (such as Kubernetes, Kafka, or MLflow) into a coherent architecture forces professionals to reinvent system integration from scratch for each project.

Therefore, this research is not limited to describing tools but proposes a hybrid reference architecture and an open reference implementation specifically designed to address these tensions. The proposal aims to demonstrate that a complete industrial MLOps lifecycle can be orchestrated by combining cloud-side training and governance capabilities with edge-side low-latency inference, while maintaining traceability, controlled adaptation, and operational transparency. At the same time, the validation reported later is intentionally scoped to a single CNC machining cell, and cross-scenario generalisation is treated as a limitation and future work rather than as a concluded property of the evaluation.

5. Hybrid Reference Architecture and Implementation Stack

In response to the gap identified in the literature, where centralisation predominates at the expense of operational latency, this research proposes a Hybrid Reference Architecture (Cloud-Fog-Edge) designed specifically for industrial environments. Unlike traditional monolithic approaches, this proposal is based on the MLOps principles identified in the systematic review. It implements strict decoupling between the computation-intensive training cycle and the latency-sensitive inference cycle. This structural separation is not arbitrary but responds to the need to orchestrate Continuous Integration, Delivery, and Training (CI/CD/CT) processes across two distinct physical planes, ensuring that the rigidity of industrial control systems does not compromise the flexibility required for the evolution of machine learning models. This hybrid rationale directly addresses the trade-off observed in the reviewed corpus between centralised manageability and shop-floor latency/resilience requirements, and it provides a reusable blueprint for aligning lifecycle automation with OT/IT constraints.

To provide the proposal with the methodological rigour required in systems engineering and to facilitate its adoption by various actors, the architecture documentation has been structured in accordance with the international standard ISO/IEC/IEEE 42010:2022. Specifically, a multi-viewpoint approach is adopted, allowing the system’s complexity to be decomposed into complementary perspectives. Thus, the description is divided into a Logical View (Section 5.1) that defines the design principles; a Development View (Section 5.2) that details the technology stack; a Process View (Section 5.3) focused on data flow and drift management; and finally, an Evaluation View (Section 5.4) that establishes the quality criteria. This structure of viewpoints also facilitates traceability from the requirements identified in Section 4 to the associated architectural responsibilities and evaluation criteria. This reduces ambiguity regarding the role of each layer and component. In addition to this architectural description, the proposal is accompanied by an open-source architectural instantiation that operationalises the same logical decomposition through two aligned deployment flavours: a Docker-based environment for local replication and a K3s/Rancher-oriented environment for more production-like deployment. This artefact is not introduced as a separate software contribution, but as a traceable implementation layer that strengthens the reproducibility and inspectability of the reference architecture. In this way, the architecture is documented not only as a conceptual model but also as a reusable, vendor-neutral operational blueprint.

5.1. Logical View: Dual Loop and Lifecycle Automation

The proposed architecture is structured around a Dual-Loop Strategy that synthesises the hybrid deployment patterns identified in the systematic review, transforming the need for OT/IT integration into a specific physical topology that operates asynchronously but at interconnected speeds (Figure 7). The first circuit, called Inference, focuses on edge infrastructure close to the assets and is designed to meet latency requirements by applying Continuous Deployment (CD) principles. At this operational level, deployment, accompanied by the DevOps paradigm, i.e., the automated injection of model artefacts into inference containers operating in real time, aligns with the reference architectures for manufacturing described by Raffin [8] and the orchestration strategies on constrained devices validated by Antonini [43]. To ensure the operational robustness required in the plant, this loop is autonomous, which means that in the event of a network interruption with the upper level, the inference system continues to operate with the latest valid version of the model, reducing the risks of network instability highlighted by Marinova [15] and ensuring business continuity. After the Edge node, the Fog Computing layer follows. Fog offers the flexibility to run applications close to the Edge layer with minimal data transmission delay and the ability to train complex machine learning models [49]. In this architecture, the fog is positioned as an intermediate tier for near-plant coordination and workload placement. This bridges the gap between strict edge constraints and cloud-scale governance.

Simultaneously and in coordination with the edge, the Learning loop operates in the centralised infrastructure (Cloud), where the computational load for Continuous Training (CT) is executed. Unlike traditional practices where changes in the source code restart the cycle, this loop assumes that model degradation is inevitable due to the physical environment, as documented by Venanzi [6]. Therefore, its reactivation must be governed by triggers based on information evolution, not solely by human intervention. It is at this level that the massive intake of historical data, schema validation, and hyperparameter re-optimisation are centralised, acting as the Master of System Governance. No model generated in this controlled environment is sent to the fast edge without first passing formal registration and exhaustive validation in the Model Registry, thus establishing a quality firewall that prevents degraded models from impacting physical processes, in line with the industry’s traceability and versioning guidelines. This governance barrier aligns with the requirements for controlled promotion and rollback outlined in the systematic review and facilitates industrial risk management in scenarios with high downtime costs.

At the implementation level, this separation is operationalised through explicit lifecycle control points rather than implicit deployment assumptions. In the accompanying open-source artefact, the learning plane is represented by versioned orchestration workflows for bootstrap, retraining and rollback governance. In contrast, the inference plane is represented by dedicated edge-side deployment and scoring services. The coordination between both planes is therefore treated as a governed promotion path, in which candidate models are registered, validated, promoted, and, if necessary, rolled back through auditable control actions rather than ad hoc manual replacement. This operational interpretation is important because it clarifies that the dual-loop strategy is not only a conceptual separation of concerns, but a controlled lifecycle mechanism for industrial deployment.

The element that links these two universes and closes the life cycle is the Continuous Monitoring (CM) mechanism, which evolves from classic IT resource monitoring to deep statistical observability. Following the findings of Leest [30] and Garrone [42] on change management in data distribution, the system does not wait for catastrophic model failure; instead, it implements data drift sensors that continuously compare, in real time, the distribution of input data with baseline distributions. When a significant deviation is detected, the monitoring system acts as an automatic trigger, connecting both loops and sending an alert signal to the central plane to initiate immediate retraining. This integration embodies the concept of “self-adaptive systems” proposed by Liao [48] and validated in biofermentation by Metcalfe [21], transforming model maintenance from a manual reactive task into an autonomous industrial process that effectively closes the operational gap detected in the state of the art. For the sake of transparency, the specific statistical method(s) used for these drift sensors, the basis for defining the threshold, and how false positives are handled are treated as implementation details that must be specified and validated in the use case (Section 6). If these details are not fully available, they must be framed as limitations alongside a minimal experimental plan.

5.2. Development View: Open Implementation Technology Stack

To implement the double-loop conceptual architecture and address the replicability deficit (Criterion C5) identified in the systematic review, in which 81.6% of studies lack accessible artefacts, an implementation stack has been designed based entirely on industrial-grade open-source components. This technological selection was not chosen at random but rather responds to a strict traceability matrix against the requirements identified in the literature, rejecting proprietary “black box” solutions that generate vendor dependency and limit interoperability with legacy systems (R04). The core of the lifecycle orchestration is entrusted to MLflow, selected for its Model Registry management capabilities. This tool ensures that each deployable artefact has complete traceability, from training data and hyperparameters to validation metrics, thus addressing the governance and version control challenges critical to industrial safety at the time of manufacturing deployments.

Workflow management and Continuous Training (CT) automation are implemented using Apache Airflow. Unlike simple task schedulers (cron) seen in less mature studies, Airflow allows the complex dependencies of the retraining process to be defined using directed acyclic graphs (DAGs), ensuring that the Edge side is activated only after data ingestion and schema validation have been completed. This robust orchestration capability is essential to meet the Continuous Integration requirement (R06), enabling reproducible, auditable data pipelines, as suggested by Metcalfe [21] in their proposal for self-adaptive soft sensors. To support the persistence layer and manage large volumes of unstructured data, MinIO, an S3-compatible object storage service that decouples storage from computation, is integrated. This architectural decision facilitates the horizontal scalability of the industrial Data Lake without incurring the licensing costs of traditional databases, aligning with the massive data management needs described in Cheng and Long’s work on federated architectures [50].

Finally, to resolve the critical tension between computing capacity and latency at the edge (R05), the execution infrastructure is based on containerisation with Docker, orchestrated by Kubernetes, and explicitly uses the lightweight K3s distribution for edge nodes. This choice allows the same inference container to be validated in the cloud and deployed identically to industrial devices with limited resources, in line with Cloud-Fog-Edge deployment patterns (P-02). The adaptation cycle closure (R01) is implemented through an observability system based on Prometheus and Grafana, configured not only to collect system metrics, but also to ingest custom model quality metrics exposed by the inference microservices. This “deep observability” mechanism allows statistical deviations to be detected in real time and alerts to be triggered that activate retraining, thus materialising the principle of Continuous Monitoring (CM) necessary to maintain operational reliability in dynamic environments, a critical shortcoming highlighted in the review by Bayram and Ahmed [51]. From a Green AI perspective, selecting K3s as a lightweight orchestrator and using optimised Docker containers supports resource-aware deployment on constrained industrial hardware by reducing the memory and CPU footprint at the edge. Although this design choice is consistent with the sustainability requirement identified in the review, the extent to which it reduces energy consumption relative to standard Kubernetes deployments remains an empirical question and should therefore be interpreted cautiously within the present validation scope. To reinforce replicability, this stack is not presented only as a narrative selection of technologies. It is instantiated as a vendor-neutral reference environment in which the same architectural roles are preserved across a Docker-based replication setting and a K3s-based deployment path. This alignment is methodologically relevant because it reduces the risk that the proposed architecture will be interpreted as tied to a single local setup or to a specific proprietary platform. Instead, the stack should be understood as a single reproducible instantiation of the architectural responsibilities defined in Section 5.1, Section 5.2, Section 5.3 and Section 5.4.

5.3. Process View: Data Flow and Continuous Adaptation Mechanism

The architecture’s operability is defined by a sequential data flow designed to minimise friction between OT systems and IT infrastructure, ensuring transactional integrity from the sensor signal to model reconfiguration. The process begins at the physical layer, where industrial connectors based on lightweight protocols such as MQTT ingest high-frequency time series from the machinery. Following the decoupling patterns recommended by Raffin [8], this data undergoes immediate bifurcation at the edge: a “hot path” that feeds the local inference container to generate predictions in milliseconds, and a “cold path” that serialises and asynchronously transmits the data to the centralised Data Lake in MinIO. This segregation of flows is critical to ensure that network latency or storage saturation does not interfere with the real-time control loop, a common vulnerability in centralised architectures.

The continuous adaptation mechanism, designed to meet the drift management requirement (R01), does not operate on fixed schedules, but rather on statistical triggers. The edge monitoring system constantly evaluates the distribution of incoming data using divergence metrics. When a significant deviation exceeding the defined tolerance thresholds is detected, an alert event is generated and sent to the central control plane. In the open-source reference instantiation, drift is operationalised through a multi-criterion feature-level check that combines a two-sample Kolmogorov–Smirnov test, Population Stability Index (PSI), and a normalised mean-shift indicator. A feature is flagged when at least one of the following conditions is met: KS p-value < 0.05, PSI ≥ 0.20, or normalised mean shift ≥ 1.0. These feature-level signals are then aggregated into an overall drift-severity score, and retraining is triggered only when the aggregated condition reaches medium or high severity. This design reduces the risk of overreacting to isolated fluctuations and frames drift monitoring as a governed operational trigger rather than as a claim about the universal superiority of a specific statistical test.

The cycle is closed via a controlled OTA (Over-the-Air) deployment path rather than opaque in-place replacement. Once a candidate model is validated and promoted in MLflow, the edge synchronisation agent retrieves a signed deployment manifest, verifies its signature, downloads the corresponding model artefact, and validates its checksum before updating the local deployment state. At the edge, the inference service continues to operate with the last accepted deployment generation until a newer, validated generation becomes available and can be loaded. Version consistency is enforced through a monotonic generation policy: the edge only accepts manifests whose generation is strictly newer than the locally applied one and whose model version has already been promoted in the control plane; stale, duplicated, or non-validated manifests are ignored. If validation of the downloaded artefact fails, the edge remains pinned to the last accepted deployment generation, thereby turning rollback into a controlled state-preservation mechanism rather than an ad hoc manual recovery action. In this way, model evolution is handled as a state transition under governance, with auditable deployment events rather than an implicit overwrite.

In the present instantiation, this control path is additionally protected through signed control messages, checksum validation of the downloaded model artefact, and persisted deployment-state records, thereby providing basic guarantees of deployment authenticity, integrity, and auditability across the Cloud-to-Edge promotion path. The present study, therefore, addresses the authenticity, integrity, and auditability of the model-promotion path. Still, it does not claim to provide a full benchmark of encrypted-tunnel overheads or a complete OT-network cybersecurity evaluation. These broader aspects remain relevant for future multi-site validation in heterogeneous industrial environments. This is particularly relevant in CNC environments, where operational continuity matters more than aggressive update frequency. Accordingly, the contribution is framed as demonstrating controlled deployment continuity and deployment integrity at the architectural level, not as a formally benchmarked hard real-time guarantee in the microsecond range. This interpretation is also supported by a controlled local OTA experiment in which a newly promoted generation was applied during an active inference stream. In that run, one OTA command was accepted, the applied generation changed cleanly from 2 to 3, no sync failures or schema mismatches were observed, and the persisted inference sequence showed a single monotonic generation transition rather than unstable interleaving between versions. The observed OTA application latency was 167.20 ms. These results strengthen the claim of governed deployment continuity at the architectural level, while still falling short of proving strict hard real-time continuity or zero interruption at inference-path timestamp resolution.

5.4. Evaluation View: Evaluation Framework and Industrial Alignment

To validate the suitability of the proposed architecture against real-world constraints, a multidimensional evaluation framework has been established that transcends traditional machine learning metrics (such as Accuracy or F1-Score) to incorporate operational performance indicators (KPIs). This framework addresses the need to quantify the solution’s systemic impact, as demanded by Watson and Larson [17] in their critique of the current literature’s lack of business metrics. The first dimension of evaluation is E2E Inference Latency, measured from the capture of the event in the PLC to the reception of the control signal, whose objective is to certify compliance with machinery cycle times (R05).

Secondly, resource efficiency is treated as an important evaluation dimension for edge-side deployment, particularly in terms of container CPU and memory consumption on constrained industrial hardware, in line with the sustainability requirement (R11) highlighted by Li [42]. To strengthen this dimension beyond purely conceptual framing, a controlled edge-profiling run was executed in the local validation environment using 300 valid synthetic events routed through the deployed edge-inference path. This additional profiling run produced repeated latency observations alongside container-level CPU and memory measurements, thereby complementing the operational indicators reported in Section 6. These results should be interpreted as auditable environment-specific evidence of edge-side viability in the local setup, rather than as cross-device benchmarks or energy-efficiency claims.

In the reference instantiation, these operational KPIs are not treated solely as abstract evaluation categories. The edge layer exports inference latency histograms, current drift score indicators, prediction counters, and schema-mismatch counters via Prometheus-compatible telemetry. In contrast, the governance layer summarises recent health metrics as average latency, false-alarm rate, missed-alarm rate, and rollback recommendation status. This instrumentation is methodologically relevant because it makes the proposed evaluation directly auditable and reproducible, even when the present article does not position the results as a cross-device benchmark.

The third critical dimension is Drift Recovery Speed (MTTR-D), defined as the time elapsed from the detection of a degradation in model quality to the effective deployment of a corrected version. This KPI captures the agility of the proposed CI/CD/CT cycle at the architectural level, even though the current experimental scope does not yet provide a broad comparative benchmark across devices or sites. Finally, interoperability is evaluated through integration testing with standard protocols (OPC-UA), verifying that the architecture does not act as a technological silo isolated from existing SCADA systems. Overall, the evaluation framework should be read as a structured basis for operational validation under industrial constraints. In line with the scope of Section 6, the observed indicators are interpreted as evidence of end-to-end feasibility and lifecycle-governance operability in a single CNC setting, rather than as definitive performance estimates for heterogeneous industrial scenarios. Extending these KPIs across devices, sites, and industrial contexts remains future work.

6. Implementation and Feasibility-Oriented Validation: Predictive Maintenance Use Case

To validate the hybrid cloud–fog–edge architecture proposed in this work, an industrial use case of predictive maintenance for tool breakage in a CNC machining cell is presented. The objective of the validation is not to evaluate the performance of a classification model, but to demonstrate an end-to-end industrial MLOps deployment (capture, processing, inference, observability, and lifecycle governance) under typical operational constraints: low latency, resilience to network outages, traceability, and data sovereignty.

Accordingly, the validation is framed as an end-to-end demonstration of the architecture’s and stack’s feasibility, rather than as a cross-scenario benchmarking study. The implications of validating on a single cell are explicitly discussed as a limitation and a topic for future work. To strengthen reproducibility and reviewer inspectability, the validation is accompanied by an open-source operational instantiation of the proposed architecture. This artefact provides a traceable implementation layer for the same Cloud–Fog–Edge decomposition described in Section 5, together with governance and enterprise supervision services, and is released in two aligned deployment flavours: a Docker-based environment for local replication and a K3s-based environment for more production-like execution. The purpose of this release is not to shift the contribution from architecture to software, but to make the validation pathway auditable, reusable and extendable.

Before introducing the target architecture (to-be), it is necessary to describe the organisation’s current state (as-is), as these conditions constrain decisions and the scope of changes. As summarised in Figure 8, the previous system was centralised on a Windows server that hosted several virtual machines (Hyper-V). Data acquisition from the CNC machine was performed using FANUC MT-LINKi, storing signals and events in a MongoDB database. On this basis, operational and analytical components were also deployed in VMs: a Node.js Express access service for queries and APIs; an orchestration/ETL layer with Node-RED; analytical/visualisation tools such as Apache Druid and Apache Superset; and Python scripts for processing and model development. In parallel, there was an analytics/orchestration layer with Apache Airflow and a React interface for internal consumption.

Although this approach enabled initial digitisation, it presented structural limitations for an on-site MLOps scenario: (i) dependence on a single central node for ingestion and serving, (ii) higher latency for near-machine inference, (iii) lower fault tolerance in the event of connectivity interruptions or server saturation, and (iv) difficulties in operationalising MLOps practices in a homogeneous manner (controlled model promotion, reproducible deployments, edge inference observability, and layered governance). Therefore, it evolved towards a hierarchical Cloud-Fog-Edge pattern. This “as-is → to-be” transition demonstrates how hybrid deployment addresses the centralisation–latency trade-off identified in the systematic review (Section 4), while preserving governance and traceability through explicit lifecycle tooling.

To validate the proposed hybrid cloud-fog-edge architecture, it is presented as an industrial MLOps deployment that represents the architecture seen and generalised from a predictive maintenance case in CNC (Figure 9). In this industrial machining cell, a multi-axis centre with monitoring of multiple internal signals was implemented. Specifically, key process variables were captured (spindle speed, cutting feed, motor load, servo consumption, cutting time, temperature, etc.). These six signals per turret (upper and lower) are synchronised with the machine’s own alarm records, which indicate failure events or abnormal conditions (e.g., servo overload or tool breakage). Each breakage alarm is identified by a code and time stamp, allowing it to be correlated with the immediately preceding multi-signal pattern. In this way, the training dataset was constructed by associating signal sequences with the occurrence (or non-occurrence) of a tool breakage, providing labelled instances for the predictive model. This dataset construction step is central to traceability in the learning loop: it links raw OT signals and machine alarms to the labelled instances used in model training and subsequent model governance.

From the standpoint of validation design, the implementation deliberately distinguishes between two complementary experimental lines. The first is a reproducible synthetic CNC line used to exercise the end-to-end orchestration path, including ingestion, monitoring, drift-aware retraining, promotion, rollback and OTA-style deployment. The second is a company-derived CNC reference line used to verify that the same architectural backbone can ingest, register, deploy and supervise an industrially grounded signal schema. This separation is methodologically important: the synthetic line supports reproducibility, whereas the company reference line supports industrial realism. Together, they strengthen the architectural validation while avoiding over-claiming cross-scenario generalisability.

The deployed architecture follows a hierarchical Cloud-Fog-Edge pattern, designed to balance computational load and ensure operational resilience. At the plant, near the machine, the Edge Tier was established using an industrial gateway (an industrial PC) that serves as an inference node at the edge. This level is responsible for data ingestion using a stack composed of RabbitMQ, Mosquitto (MQTT) and Kafka, enabling decoupled, high-frequency communication with the sensors and the CNC. The equipment runs a lightweight container environment (Docker on K3s) with dedicated microservices: one for real-time data acquisition and another for ML inference. The Python-based inference service loads the trained predictive model (a binary LSTM network). It continuously calculates the probability of imminent tool breakage from the most recent time window of multivariate signals. When the probability exceeds an alert threshold, the system generates a local notification to the operator, enabling preventive action before actual breakage. This local inference layer operates autonomously, ensuring minimal latency (on the order of milliseconds) and high availability, even during network interruptions. In addition, a “store-and-forward” mechanism is enabled in the Edge Tier to ensure robustness in the event of connectivity failures: queues (MQTT/RabbitMQ/Kafka) act as high-frequency local buffers, enabling controlled retries and preventing data loss while decoupling production from Fog/Cloud delivery. On the other hand, the Edge publishes telemetry and inference metrics (such as latency) for subsequent centralised observability. This edge autonomy and buffering behaviour operationalises the resilience requirement under intermittent connectivity: inference continues with the last valid deployed artefact while delivery to upper tiers is temporarily deferred without losing high-frequency signals.

As an intermediate layer in the distributed architecture, the Fog Tier is deployed, with the primary function of data validation and transformation. At this level, Node-RED is used to orchestrate pre-processing flows, ensuring that information is filtered and normalised before being persisted or sent to higher levels. This Fog node acts as a hub that manages traffic using an Ingress/Load Balancer, optimising bandwidth usage to the cloud. In addition, the Fog Tier assumes the role of an “on-premise data plane” (close to the plant) where operational persistence is consolidated: Node-RED routes the validated flows to the data layer (time series, objects and metrics), allowing both historical signals and inference results and event logs to be stored locally. This approach enables sensitive data to be retained on-site, for cleaning and adjustment policies to be applied before synchronising with the cloud, and for analytics to be performed close to the operation. At the same time, a centralised management and retraining plan was implemented covering the Cloud Tier and the Enterprise Tier. The Cloud Tier focuses on experimentation, training and orchestration, hosting the following main components:

MLflow: used as an experiment tracking and model logging system. Each trained model version is stored with its metrics and parameters, facilitating traceability and retrieval.
Apache Airflow: responsible for orchestrating data pipelines and machine learning through automated DAGs.
Python is the base language for large-scale training and advanced analytics.

The entire data infrastructure (Data Layer) is centralised to offer polyglot persistence:

MinIO: deployed as S3-compatible object storage for historical data and artefacts.
Prometheus: integrated for real-time (NoSQL) monitoring of infrastructure metrics and model performance at the Edge.
TimescaleDB/PostgreSQL: used for efficient time series storage, enabling fast analytical queries on historical machinery behaviour.

In this deployment, the Data Layer is preferably located in the Fog/On-Premise environment (close to the plant) to guarantee data sovereignty and low latency in operational queries, and thus the Cloud consumes this data. Finally, the Enterprise Tier is incorporated as the solution’s operation, governance and observability layer (oriented towards users and administrators), separating exploitation from training in the Cloud. At this level, the following are deployed: (i) a web interface (React) for operators where statuses, alarms and inference results are queried; (ii) Grafana as a monitoring console consuming Prometheus metrics; and (iii) the cluster management layer using Rancher (UI + API/Controller) to manage K3s nodes, deployment policies, updates and rollbacks. Thus, the Enterprise Tier acts as an “entry point” (Ingress/Load Balancer) and as a control layer for the lifecycle: model promotion (from staging to production), threshold configuration, version control, and end-to-end health monitoring (Edge–Fog–Cloud). This separation clarifies the dual-loop operationalisation: (i) inference and immediate alerting remain at the edge, (ii) data validation/persistence remains on-premise in fog, and (iii) experimentation/training orchestration is handled in the cloud—while the enterprise tier provides lifecycle governance (promotion, rollback, and supervisory observability).

Finally, the use of Docker and K3s ensures portability and scalability at all levels, enabling seamless model updates through continuous deployment (CD) strategies. This hybrid Cloud-Fog-Edge architecture ensures that the bulk of production data remains on-premises for privacy and latency reasons, while leveraging central computing resources to continuously improve the predictive model. At the operational level, this tiered separation also facilitates compliance with industrial requirements (resilience, traceability and governance): the Edge guarantees continuity and latency; the Fog consolidates data and normalises; the Cloud optimises and retrains; and the Enterprise enables supervision, control and exploitation by end users (Figure 10).

This implementation demonstrates an end-to-end industrial MLOps deployment in a single CNC machining cell, including ingestion, edge inference, buffering under connectivity disruption (“store-and-forward”), and lifecycle tooling that supports traceability and operational governance. The validation, therefore, supports feasibility claims at the architectural and operational levels, namely that the proposed Cloud–Fog–Edge decomposition can be instantiated using the selected open-source stack while preserving low-latency edge inference, on-premises data handling, and lifecycle governance.

Table 6 summarises the operational indicators directly observed in the CNC single-cell validation environment. The evidence confirms that the instantiated architecture was exercised across ingestion, edge inference, governance-side drift evaluation, deployment-state control, and routing decisions for retraining and rollback. Company-derived reference data were loaded, company-mode sensor ingestion was persisted, one edge inference was recorded with an observed latency of 99.6613 ms, and repeated drift evaluations remained in the low-severity regime without triggering retraining or rollback. These values should be interpreted as operational evidence of end-to-end architectural execution in the local CNC setting, rather than as cross-device performance benchmarks. It should also be noted that some indicators, particularly the reported edge inference latency, were observed over a very limited number of persisted prediction events in the current local validation snapshot and are therefore presented as auditable operational observations rather than as statistically representative performance estimates.

Table 7 complements Table 6 by reporting a controlled edge-profiling run performed on the local Docker validation environment. Unlike the company-derived operational snapshot in Table 6, this experiment was designed to generate a sufficient number of valid inference events to characterise repeated edge latency and container-level resource usage under a reproducible synthetic load. The results provide directly observed latency percentiles, CPU, and memory measurements for the deployed edge-inference path, thereby strengthening the evaluation of resource-constrained deployment. However, these values remain environment-specific observations obtained in a local setup and should not be interpreted as cross-device benchmarks, energy measurements, or hard real-time guarantees.

Table 8 reports the results of a controlled OTA continuity experiment performed during an active inference stream in the local Docker validation environment. The evidence shows that a newly promoted generation was applied with one accepted OTA command, zero observed sync failures, zero schema mismatches, and a clean monotonic transition from generation 2 to generation 3 in the persisted inference sequence. Although the inferred sequence supports the claim of operational continuity at the architectural level, the available timestamps do not provide exact inference-path commit times and therefore do not justify stronger claims such as zero interruption or hard real-time continuity at millisecond or microsecond resolution.

However, validation in a single cell limits transferability across machines, operating regimes, device classes, and industrial sites. For this reason, the present study does not claim cross-scenario generalisability, nor does it position the reported implementation as a definitive benchmark for drift engineering, secure model distribution, OTA continuity under hard real-time constraints, or heterogeneous edge-device performance across OT settings. Although the local validation now includes controlled edge profiling and OTA transition evidence, broader comparisons across devices, sites, energy conditions, disruption scenarios, and timestamp-precise update continuity remain relevant directions for future quantitative validation.

7. Discussion: Summary and Positioning of the Proposal

The central objective of this research has been to bridge the gap between MLOps theoretical frameworks and their effective implementation in industrial environments subject to physical constraints. Through a systematic review of 49 recent studies and experimental validation in a CNC machining use case, the six research questions posed have been addressed. This section discusses the implications of the findings, contrasting the literature with the results obtained, to position the proposed architecture against the limitations of the state of the art. In line with the validation scope stated in Section 6, the discussion positions the experimental part as an end-to-end feasibility demonstration of operationalisation, rather than cross-scenario benchmarking; accordingly, the conclusions are framed in terms of architectural adequacy and operational plausibility under industrial constraints.

7.1. From Experimentation to Resilient Operation (RQ1, RQ2, RQ3)

The evolutionary analysis of the literature (RQ1) confirms that MLOps has evolved from a mere extension of DevOps to a strategic backbone for Industry 4.0 and 5.0, as Watson and Larson [19] suggested in their description of the field’s maturity. However, the data reveal a worrying asymmetry: while 85% of the studies analysed correctly contextualise the industrial problem, only 18.4% (Criterion C5) offer replicable artefacts or open-source code. This finding corroborates the existence of an “implementation gap” noted by Wewer [12], in which academia produces conceptual architectures that, lacking field validation, fail to withstand real-world conditions.

Our research addresses this deficit by complementing the reference architecture with an open-source architectural instantiation that can be inspected, replicated and extended. Rather than being presented as a standalone software product, this repository functions as a traceable implementation layer for the claims advanced in Section 5 and Section 6, including lifecycle governance, tiered deployment responsibilities and closed-loop operationalisation. To avoid making overly broad claims, the repository is not interpreted as proof of cross-scenario transferability, but as a reproducible artefact that strengthens transparency, inspectability and methodological reuse. The corresponding access details are provided in the Data Availability Statement. Further deployments remain necessary to assess transferability across sites, device classes and industrial contexts.

When examining adoption challenges (RQ3), the review ranked Drift Management (R01, 16.3%) and Latency (R05, 8.2%) as the most critical technical vectors, often in tension with organisational constraints. Our experimental validation with the CNC machine demonstrates that these requirements are architecturally opposed and require compromise solutions. While drift management requires cloud computing power to retrain complex models, control latency demands strict local execution. This trade-off was also reflected in the additional local edge-profiling run, where 300 valid synthetic events were processed through the deployed inference path, yielding a mean edge latency of 19.03 ms (p50 = 9.71 ms; p95 = 75.60 ms; p99 = 85.76 ms), zero observed schema mismatches, and stable container-level memory usage around 127.46 MiB. These values do not constitute a cross-device benchmark, but they do strengthen the claim that the proposed decomposition can sustain repeated low-latency inference within the local validation environment. The experimental results confirmed that a centralised architecture (P-01), which is dominant in 65.3% of the literature reviewed by Metcalfe and Amirkhanova [21,47], can introduce unacceptable latency in high-speed manufacturing processes.

In contrast, the implementation presented in this article adopts a Cloud-Fog-Edge computing strategy following a hybrid pattern (P-02), reducing latency by decentralising inference deployment. As Antonini [52] argues, this is not a design choice but a physical requirement of modern cyber-physical systems. This outcome directly supports both the hybrid rationale set out in Section 5 and the ‘as-is to-be’ transition outlined in Section 6.

7.2. Comparative Analysis with Consolidated Frameworks (RQ4, RQ5)

This comparison is intentionally requirement-driven rather than benchmark-driven. In line with the evidence synthesised in Section 4, the discussion contrasts the proposed architecture with consolidated alternatives through four dimensions that are recurrent in the reviewed literature: (i) placement of training and inference workloads across tiers, (ii) suitability for latency-sensitive industrial execution, (iii) support for lifecycle governance and controlled adaptation, and (iv) openness and replicability of the implementation stack. For clarity, Table 9 summarises this requirement-oriented positioning. Unlike Table 5, which reports the architectural patterns identified in the reviewed corpus, Table 9 synthesises how those patterns compare against the four analytical dimensions used in Section 7.2: workload placement, suitability for latency-sensitive industrial execution, lifecycle governance and controlled adaptation, and openness/replicability.

One of the central contributions of this work is the definition of a technology stack that challenges monolithic solutions. Unlike previous reviews that catalogue tools in isolation, this study positions the proposed stack against consolidated alternatives from the perspective of the industrial constraints identified in RQ4 and RQ5. First, compared to standard centralised deployments (P-01), which dominate the reviewed state of the art with 65.3%, the proposed stack is explicitly designed as a hybrid system that separates edge inference from cloud-side training and governance. This distinction is not presented as a universal performance superiority claim, but as an architectural response to the recurrent mismatch observed in the literature between centralised manageability and shop-floor latency/resilience requirements.

Second, compared to lightweight proof-of-concept deployments that address only isolated phases of the lifecycle, the present work’s contribution lies in making cross-tier responsibilities explicit, including promotion, rollback, monitoring, on-premises persistence, and versioned orchestration between learning and inference loops. This point is especially relevant in industrial settings where hybrid patterns can deliver latency benefits, but also introduce orchestration and synchronisation overhead across tiers. In this sense, the contribution of the proposed architecture is not merely to combine tools, but to define how those responsibilities are distributed and coordinated in a traceable way.

Third, compared to standard Kubernetes-centric deployments, the proposed architecture addresses resource constraints at the edge (R03) that many general-purpose stacks do not explicitly address in brownfield industrial environments. While native Kubernetes deployments often require cluster resources beyond the capacity of a typical industrial PC, the proposed solution relies on lightweight orchestration with K3s and decoupled workflow management with Airflow, making the architecture better suited for deployment on constrained nodes near the process. As resource footprints depend on both workload and device, this is framed as an argument of architectural suitability rather than as a universal performance claim. In this regard, the local edge profiling reported in Section 6 is used only to support bounded feasibility within the validation environment.

Furthermore, compared to proprietary “black box” platforms, our solution, based on MLflow and MinIO, guarantees data sovereignty. More precisely, the architecture retains the preferred location of the data layer in the fog/on-premises environment, while enabling experimentation and orchestration in the cloud to consume this data under controlled policies. This design helps reduce vendor lock-in and supports OT/IT integration (R04) by acting as a junction point between plant-level systems and higher-level lifecycle services. In environments with legacy protocols and strong governance requirements, this role is particularly relevant.

Finally, in contrast with more specialised deployment patterns such as TinyML (P-03) or Federated Learning (P-04), the proposed architecture is positioned as a general-purpose hybrid baseline for industrial environments where low-latency inference, lifecycle governance, and reproducible open deployment must coexist. TinyML remains more appropriate for ultra-low-power or microcontroller-based execution, while Federated Learning is better suited to privacy-preserving multi-site collaboration. However, neither of these patterns currently offers the same degree of general-purpose architectural coverage for the type of brownfield industrial MLOps scenario addressed in this study.

Accordingly, the comparison offered in this manuscript should be read as a structured, evidence-based positioning exercise grounded in the reviewed architectural patterns, the industrial requirements extracted in Section 4, and the bounded feasibility results reported in Section 6. Its purpose is not to claim numerical superiority over any single framework, but to make explicit where the proposed architecture offers a more suitable configuration for industrial environments that combine OT/IT integration, low-latency inference, lifecycle governance, and reproducible open deployment.

7.3. MLOps as an Enabler of Industry 5.0 and Emerging Technologies (RQ6)

Finally, the discussion on RQ6 suggests that the proposed architecture can be interpreted as a conceptual enabler for integrating emerging technologies beyond stand-alone prediction. In this respect, the current contribution provides the lifecycle and governance backbone that could support future extensions such as Digital Twin synchronisation, Federated Learning, or explainability-oriented services. Still, these integrations should not be interpreted as validated capabilities of the present implementation.

By standardising data intake and implementing immutable model versioning, the system functions as a service that keeps a Digital Twin synchronised with its physical counterpart, thereby solving the model obsolescence problem identified by Kruschinski [27]. This positioning is therefore conceptual: the current contribution provides the backbone for lifecycle and governance, which can support such synchronisation. At the same time, specific Digital Twin integrations depend on the implementation of the target twin. They are therefore considered an extension rather than a demonstrated result of the present CNC case study.

Furthermore, the physical separation between the training and inference planes lays the technical groundwork for future Federated Learning implementations (P-04). In scenarios where data privacy (R10) prevents centralisation in the cloud, the proposed architecture allows Fog computing nodes to evolve into local participants in a federated network. The federated pattern (P-04) is still emerging, with only 6.1% adoption, but it is necessary for factory collaboration. In this way, the Fog layer not only performs inference but can also calculate local gradients without sharing raw data, as proposed by Cheng and Long for collaborative environments, aligning with the privacy and sustainability requirements of Industry 5.0 [51]. This approach also requires identifying practical feasibility challenges, such as cross-site data heterogeneity, communication overheads, and secure aggregation. These challenges are not validated in the current CNC case study and therefore remain part of future work.

In the current implementation, observability is primarily used for operational monitoring and lifecycle triggering; therefore, explainability (XAI) should be considered a possible architectural extension rather than a validated module. From the perspective of human-centred Industry 5.0, the current observability layer provides a plausible basis for building such an explainability service later. At the architectural level, such an extension would operate as an explainability service connected to the monitoring layer and operator UI, generating operator-oriented summaries that relate monitored variables, model version, degradation status, and recent inference behaviour. The expected outputs would therefore not be raw feature-importance vectors alone, but concise operational explanations that support supervision, validation, and override decisions on the shop floor. However, this explainability layer is not validated in the present CNC case study and therefore remains part of future work.

7.4. Academic and Managerial Implications

This work contributes to industrial MLOps research by providing a traceable bridge from evidence-based requirements (Section 4) to a reference architecture organised according to ISO/IEC/IEEE 42010 viewpoints (Section 5), together with an open-source stack that supports bounded end-to-end validation in an industrial CNC setting (Section 6). Rather than merely cataloguing tools, the contribution is a reusable architecture description that clarifies cross-tier responsibilities, especially the separation and coordination of inference and learning loops, and the role of governance and traceability in environments with high downtime costs. It also identifies areas where current evidence remains weak, such as replicability, closed-loop automation, and quantified edge constraints, thereby helping future studies focus on measurable validation rather than purely conceptual proposals.

For practitioners, the study provides a structured basis for deciding how MLOps responsibilities should be distributed across cloud, fog/on-premises, and edge layers under brownfield industrial constraints. In particular, it highlights the managerial relevance of lifecycle governance, version traceability, controlled promotion paths, and bounded local autonomy when deploying ML in settings where latency, continuity, and auditability matter. From this perspective, the reference architecture can support communication between technical and operational stakeholders by clarifying which architectural decisions are driven by business continuity, infrastructure constraints, and plant-level change-management requirements.

Finally, the discussion reiterates that the validation is limited to a single CNC machining cell. Consequently, cross-scenario generalisability and comparative benchmarking remain to be explored. Section 6 outlines the future validation plan, which defines the minimum additional instrumentation and reporting required to strengthen conclusiveness, without altering the core architecture. This includes dual-loop synchronisation protocol, drift statistics and thresholds, edge profiling, OTA continuity checks and security/integrity mechanisms.

8. Conclusions and Future Work

8.1. Conclusions

The effective integration of MLOps in the manufacturing industry represents a paradigm shift that transcends mere software tool adoption, constituting a systemic challenge that requires reconciling the agility of data science with the determinism of operational technology (OT) systems. This research has addressed the current fragmentation of the field through a mixed-methods approach combining an evidence-based systematic review with a feasibility-oriented operational validation within a single CNC machining cell. The bibliometric analysis revealed a structural dependence on centralised architectures in most of the reviewed cases (as reported in Section 4). In response, the main contribution of this work is the formalisation of a Hybrid Reference Architecture (Cloud-Fog-Edge) and a reproducible, open-source instantiation aligned with it. The scope of the validation is intentionally limited to demonstrating that the proposed decomposition can be operationalised in a bounded industrial setting while preserving near-machine inference autonomy, on-premises persistence for sovereignty, and centrally managed lifecycle governance.

From a methodological perspective, the study aligns the architectural design with the ISO/IEC 42010 standard [37]. It releases an open-source reference implementation that addresses the replicability deficit (Criterion C5) present in 81.6% of the previous literature. By integrating components such as MLflow, Airflow, and K3s into a coherent stack, we can operationalise model maintenance workflows with lifecycle governance (traceability, controlled promotion, and rollback) and monitoring-driven retraining triggers, reducing reliance on purely manual and reactive interventions. This finding supports the position that operational resilience depends not only on predictive performance but also on controlled lifecycle mechanisms, monitoring, traceability, promotion, and rollback that enable timely, auditable model updates within industrial constraints. In line with the scope statement in Section 6, this conclusion is framed in terms of architectural feasibility and governance adequacy rather than as a claim of quantified superiority across machines or sites.

From an academic perspective, the work combines evidence-based requirement synthesis with a multi-view reference architecture description and an open-source, implementable stack, enabling reproducible discussion of cross-tier responsibilities in industrial MLOps. From an industrial standpoint, the proposed hybrid operationalisation offers a practical migration path from centralised deployments to edge–fog–cloud separation, better aligning with shop-floor latency and resilience constraints while maintaining sensitive data on-premises.

8.2. Future Research Directions

In this article, we present a reference architecture for operationalising models in resource-constrained environments and validate its bounded feasibility in a single CNC machining cell. The following directions, therefore, describe architectural extensions and additional validation requirements that fall outside the scope of the current implementation. In particular, the rapid evolution of Industry 5.0 paradigms raises new challenges that go beyond the present single-cell, centrally trained setting. First, future work should address the reliance on centralised cloud training to explore multi-factory scenarios where data sovereignty is critical.

Future work should extend the functionality of Fog Computing nodes, currently dedicated to inference, so that they evolve into active participants in Federated Learning networks (P-04) on existing edge infrastructure, allowing multiple factories to collaborate on model training without sharing sensitive data. In this setting, the Airflow-based orchestration layer would need to coordinate the secure exchange of gradients or model updates rather than raw data. A realistic extension of the present architecture, therefore, requires explicit treatment of cross-site data heterogeneity, communication overheads, and secure aggregation, none of which are validated in the current single-cell study.

Additional future work should focus on deployment patterns for the controlled distribution of artefacts across multiple systems and factories. This includes formalising the synchronisation protocol between the learning and inference loops at fleet scale, including version-consistency rules, conflict-handling policies, rollback triggers, and promotion semantics across heterogeneous edge nodes. It also includes systematic edge profiling, such as latency distributions, memory footprints, CPU utilisation, and, where feasible, energy proxies, as well as OTA continuity checks during orchestrated updates under realistic production conditions.

Finally, the next architectural evolution would be to move from diagnostic monitoring to increasingly autonomous action. This would require adapting the edge/fog infrastructure to support agents capable of making constrained, real-time decisions based on the monitored state, for example, by integrating lightweight simulation environments and distributed experience buffers. In this article, this agent-based approach is framed as a future architectural evolution rather than as a feature validated in the current implementation.

Author Contributions

M.Á.M.-C., F.F. and A.B. were involved in the whole process of producing this paper, including conceptualisation, methodology, modelling, validation, visualisation, and manuscript preparation. All authors have read and agreed to the published version of the manuscript.

Funding

The research that led to these findings received funding from the European Union Horizon Europe research and innovation programme under the Grant Agreement 101177368 “Agile Manufacturing as a Service through AI Autonomous Agents (MaaSAI)”.

Data Availability Statement

The architectural instantiation, deployment artefacts, and replication-oriented implementation materials supporting this study are openly available at the public repository: MLOPs-Reference-Architecture (https://github.com/CIGIP-UPV/MLOPs-Reference-Architecture (accessed on 17 March 2026)). Due to industrial confidentiality constraints, not all raw production data from the company-derived CNC reference line can be released in full; however, the repository provides the open implementation context and the reproducibility-oriented materials used to support the architectural validation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Papageorgiou, A.V.; Symeonidis, G.; Nerantzis, E.; Papakostas, G.A. Agile MLOps: Bridging the Gap Between Agility and Machine Learning Operations. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Limassol, Cyprus, 26–29 June 2025; Springer Nature: Cham, Switzerland, 2025; pp. 15–27. [Google Scholar]
Mateo-Casalí, M.A.; Boza, A.; Fraile, F. Digital assets in zero-defect manufacturing: Literature review and proposed framework. Int. J. Prod. Res. 2025, 1–28. [Google Scholar] [CrossRef]
Mohammed, W.M.; Ferrer, B.R.; Martinez, J.L.; Sanchis, R.; Andres, B.; Agostinho, C. A Multi-Agent Approach for Processing Industrial Enterprise Data. In Proceedings of the 2017 International Conference on Engineering, Technology and Innovation (ICE/ITMC), Funchal, Portugal, 27–29 June 2017; IEEE: New York, NY, USA, 2018; pp. 1209–1215. [Google Scholar]
Paul, A.; Son, R.Y.; Balodi, S.A.; Crooks, K. MLOps FMEA: A proactive & structured approach to mitigate failures and ensure success for machine learning operations. In Proceedings of the 2024 Annual Reliability and Maintainability Symposium (RAMS), Albuquerque, NM, USA, 22–25 January 2024; IEEE: New York, NY, USA, 2024; pp. 1–7. [Google Scholar]
Oyucu, S.; Aksöz, A. Integrating machine learning and MLOps for wind energy forecasting: A comparative analysis and optimization study on Türkiye’s wind data. Appl. Sci. 2024, 14, 3725. [Google Scholar] [CrossRef]
Venanzi, R.; Dahdal, S.; Solimando, M.; Campioni, L.; Cavalucci, A.; Govoni, M.; Tortonesi, M.; Foschini, L.; Attana, L.; Tellarini, M.; et al. Enabling adaptive analytics at the edge with the Bi-Rex Big Data platform. Comput. Ind. 2023, 147, 103876. [Google Scholar] [CrossRef]
Colombi, L.; Gilli, A.; Dahdal, S.; Boleac, I.; Tortonesi, M.; Stefanelli, C.; Vignoli, M. A machine learning operations platform for streamlined model serving in industry 5.0. In Proceedings of the NOMS 2024–2024 IEEE Network Operations and Management Symposium, Seoul, South Korea, 6–10 May 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Raffin, T.; Reichenstein, T.; Werner, J.; Kühl, A.; Franke, J. A reference architecture for the operationalization of machine learning models in manufacturing. Procedia CIRP 2022, 115, 130–135. [Google Scholar] [CrossRef]
Zimelewicz, E.; Kalinowski, M.; Mendez, D.; Giray, G.; Santos Alves, A.P.; Lavesson, N.; Azevedo, K.; Villamizar, H.; Escovedo, T.; Lopes, H.; et al. Ml-enabled systems model deployment and monitoring: Status quo and problems. In Proceedings of the International Conference on Software Quality, Vienna, Austria, 23–25 April 2024; Springer Nature: Cham, Switzerland, 2024; pp. 112–131. [Google Scholar]
Schreier, U.; Reimann, P.; Mitschang, B. A Kanban-based approach to manage machine learning projects in manufacturing. Procedia CIRP 2025, 134, 109–114. [Google Scholar] [CrossRef]
Dahdal, S.; Tortonesi, M. Enabling Big Data and Machine Learning Applications in High-Stakes Environments. In Proceedings of the NOMS 2024–2024 IEEE Network Operations and Management Symposium, Seoul, South Korea, 6–10 May 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Wewer, C.R.; Mahapatra, H.; Esterle, L.; Larsen, P.G. Using FactoryML for Deployment of Machine Learning Models in Industrial Production. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Faubel, L.; Woudsma, T.; Methnani, L.; Ghezeljhemeidan, A.G.; Buelow, F.; Schmid, K.; Van Driel, W.D.; Kloepper, B.; Theodorou, A.; Nosratinia, M.; et al. A mlops architecture for XAI in industrial applications. In Proceedings of the 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), Padova, Italy, 10–13 September 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Bachinger, F.; Zenisek, J.; Affenzeller, M. Automated machine learning for industrial applications–challenges and opportunities. Procedia Comput. Sci. 2024, 232, 1701–1710. [Google Scholar] [CrossRef]
Marinova, S.; Tian, Y.; Leon-Garcia, A. E2E network slice assurance for B5G/6G: Realizing data collection and management, MLOps, and closed-loop control. IEEE Open J. Commun. Soc. 2025, 6, 759–774. [Google Scholar] [CrossRef]
Martínez-Arellano, G.; Ratchev, S. Towards Frugal Industrial AI: A framework for the development of scalable and robust machine learning models in the shop floor. Int. J. Adv. Manuf. Technol. 2025, 138, 169–191. [Google Scholar] [CrossRef]
Rigas, S.; Tzouveli, P.; Kollias, S. An end-to-end deep learning framework for fault detection in marine machinery. Sensors 2024, 24, 5310. [Google Scholar] [CrossRef]
Ruf, P.; Reich, C.; Ould-Abdeslam, D. Aspects of module placement in machine learning operations for cyber physical systems. In Proceedings of the 2022 11th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 7–11 June 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Watson, H.J.; Larson, D. MLOps: From a Cottage Industry to a Factory Approach. Int. J. Bus. Intell. Res. IJBIR 2024, 15, 1–22. [Google Scholar] [CrossRef]
Antonini, M.; Pincheira, M.; Vecchio, M.; Antonelli, F. An adaptable and unsupervised TinyML anomaly detection system for extreme industrial environments. Sensors 2023, 23, 2344. [Google Scholar] [CrossRef]
Metcalfe, B.; Acosta-Pavas, J.C.; Robles-Rodriguez, C.E.; Georgakilas, G.K.; Dalamagas, T.; Aceves-Lara, C.A.; Daboussi, F.; Koehorst, J.J.; Corrales, D.C. Towards a machine learning operations (MLOps) soft sensor for real-time predictions in industrial-scale fed-batch fermentation. Comput. Chem. Eng. 2025, 194, 108991. [Google Scholar]
Kreuzberger, D.; Kühl, N.; Hirschl, S. Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access 2023, 11, 31866–31879. [Google Scholar] [CrossRef]
Grilo, A.; Figueiras, P.; Rêga, B.; Lourenço, L.; Khodamoradi, A.; Costa, R.; Jardim-Gonçalves, R. Data analytics environment: Combining visual programming and mlops for ai workflow creation. In Proceedings of the 2024 IEEE International Conference on Engineering, Technology, and Innovation (ICE/ITMC), Funchal, Portugal, 24–28 June 2024; IEEE: New York, NY, USA, 2024; pp. 1–9. [Google Scholar]
Maier, R.; Schlattl, A.; Guess, T.; Mottok, J. CausalOps—Towards an industrial lifecycle for causal probabilistic graphical models. Inf. Softw. Technol. 2024, 174, 107520. [Google Scholar]
Hegedűs, C.; Varga, P. Tailoring mlops techniques for industry 5.0 needs. In Proceedings of the 2023 19th International Conference on Network and Service Management (CNSM), Niagara Falls, ON, Canada, 30 October–2 November 2023; IEEE: New York, NY, USA, 2023; pp. 1–7. [Google Scholar]
Varga, P.; Kővári, Á.; Herkules, M.; Hegedűs, C. MLOps in CPS–a use-case for image recognition in changing industrial settings. In Proceedings of the NOMS 2024–2024 IEEE Network Operations and Management Symposium, Seoul, South Korea, 6–10 May 2024; IEEE: New York, NY, USA, 2024. [Google Scholar]
Kruschinski, D.; Ngassam, D.T.; Durak, U.; Hartmann, S. An MLOps Framework to Data-Driven Modelling of Digital Twins with an Application to Virtual Test Rigs. In Proceedings of the International Conference on Conceptual Modeling, Pittsburgh, PA, USA, 28–31 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 71–86. [Google Scholar]
Mateo-Casalí, M.Á.; Gil, F.F.; Boza, A.; Nazarenko, A. An Industry Maturity Model for Implementing Machine Learning Op-erations in Manufacturing. Int. J. Prod. Manag. Eng. 2023, 11, 179–186. [Google Scholar] [CrossRef]
Chakraborty, A.; Das, S.; Gary, K. Machine Learning Operations: A Mapping Study. In Proceedings of the World Congress in Computer Science, Computer Engineering & Applied Computing, Las Vegas, NV, USA, 22–25 July 2024; Springer Nature: Cham, Switzerland, 2024; pp. 3–21. [Google Scholar]
Leest, J.; Gerostathopoulos, I.; Raibulet, C. Evolvability of machine learning-based systems: An architectural design decision framework. In Proceedings of the 2023 IEEE 20th International Conference on Software Architecture Companion (ICSA-C), L’Aquila, Italy, 13–17 March 2023; IEEE: New York, NY, USA, 2023; pp. 106–110. [Google Scholar]
Faubel, L.; Woudsma, T.; Kloepper, B.; Eichelberger, H.; Buelow, F.; Schmid, K.; Ghezeljehmeidan, A.G.; Methnani, L.; Theodorou, A.; Bång, M. MLOps for Cyberphysical Production Systems: Challenges and Solutions. IEEE Softw. 2024, 42, 65–73. [Google Scholar] [CrossRef]
Andres, B.; Diaz-Madronero, M.; Soares, A.L.; Poler, R. Enabling Technologies to Support Supply Chain Logistics 5.0. IEEE Access 2024, 12, 43889–43906. [Google Scholar] [CrossRef]
Rani, F.; Chollet, N.; Vogt, L.; Urbas, L. Industrial Edge MLOps: Overview and Challenges. Comput. Aided Chem. Eng. 2024, 53, 3019–3024. [Google Scholar]
Chadli, K.; Botterweck, G.; Saber, T. The environmental cost of engineering machine learning-enabled systems: A mapping study. In Proceedings of the 4th Workshop on Machine Learning and Systems, Athens, Greece, 22 April 2024; ACM: New York, NY, USA, 2024; pp. 200–207. [Google Scholar]
Raffin, T.; Reichenstein, T.; Klier, D.; Kühl, A.; Franke, J. Qualitative assessment of the impact of manufacturing-specific influences on machine learning operations. Procedia CIRP 2022, 115, 136–141. [Google Scholar] [CrossRef]
Safdar, M.; Paul, P.P.; Lamouche, G.; Wood, G.; Zimmermann, M.; Hannesen, F.; Bescond, C.; Wanjara, P.; Zhao, Y.F. Fundamental requirements of a machine learning operations platform for industrial metal additive manufacturing. Comput. Ind. 2024, 154, 104037. [Google Scholar]
ISO/IEC/IEEE 42010:2022; Software, Systems and Enterprise Architecture Description. ISO: Geneva, Switzerland, 2022.
von Hahn, T.; Mechefske, C.K. Machine learning in cnc machining: Best practices. Machines 2022, 10, 1233. [Google Scholar] [CrossRef]
Manickam, D.D.; Mohamed, S.; Jain, V.; Goswami, D.; Lensink, L. A structured inference optimization approach for vision-based DNN deployment on legacy systems. In Proceedings of the 2023 IEEE 28th International Conference on Emerging Technologies and Factory Automation (ETFA), Sinaia, Romania, 12–15 September 2023; IEEE: New York, NY, USA, 2023; pp. 1–8. [Google Scholar]
Sood, I.; Kaushik, A.; Bulgerin, T.; Kumar, P.; Rath, S.; Khemiri, A.; Chang, J.; Hsu, S.; Bedorf, J. Supporting fab operations using multi-agent reinforcement learning. In Proceedings of the 2024 35th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), Albany, NY, USA, 13–16 May 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Bodor, A.; Hnida, M.; Najima, D. MLOps: Overview of current state and future directions. In Proceedings of the International Conference on Smart City Applications, Castelo Branco, Portugal, 19–21 October 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 156–165. [Google Scholar]
Garrone, A.; Minisi, S.; Oneto, L.; Dambra, C.; Borinato, M.; Sanetti, P.; Vignola, G.; Papa, F.; Mazzino, N.; Anguita, D. Simple non regressive informed machine learning model for prescriptive maintenance of track circuits in a subway environment. In Proceedings of the International Conference on System-Integrated Intelligence, Genova, Italy, 7–9 September 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 74–83. [Google Scholar]
Li, P.; Mavromatis, I.; Khan, A. Past, present, future: A comprehensive exploration of ai use cases in the umbrella iot testbed. In Proceedings of the 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and Other Affiliated Events (PerCom Workshops), Biarritz, France, 11–15 March 2024; IEEE: New York, NY, USA, 2024; pp. 787–792. [Google Scholar]
Chen, H.; Liu, C.T.; Hsu, H.Y.; Hsieh, J.Y. A Federated implementation for MLOps framework based on non-intrusive load monitoring. In Proceedings of the 2023 IEEE 5th Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 27–29 October 2023; IEEE: New York, NY, USA, 2023; pp. 284–289. [Google Scholar]
Luley, P.P.; Deriu, J.M.; Yan, P.; Schatte, G.A.; Stadelmann, T. From concept to implementation: The data-centric development process for AI in industry. In Proceedings of the 2023 10th IEEE Swiss Conference on Data Science (SDS), Zurich, Switzerland, 22–23 June 2023; IEEE: New York, NY, USA, 2023; pp. 73–76. [Google Scholar]
Amirkhanova, G.; Amirkhanov, B.; Tyulepberdinova, G.; Ishmurzin, T. Application of Machine Learning Algorithms in Digital Twin Monitoring Systems: An Overview of Approaches, Methods, and Prospects. In Proceedings of the 2024 International Conference on Intelligent Computing and Next Generation Networks (ICNGN), Bangkok, Thailand, 23–25 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
Martínez-Arellano, G.; Ratchev, S. Improving the development and reusability of Industrial AI through Semantic Models. In Proceedings of the Conference on Learning Factories, Enschede, The Netherlands, 17–19 April 2024; Springer Nature: Cham, Switzerland, 2024; pp. 179–186. [Google Scholar]
Liao, Q.; Kesters, M.; Landuyt, D.V.; Joosen, W. Data Chameleon: A Self-adaptive Synthetic Data Management System. In Proceedings of the IFIP Annual Conference on Data and Applications Security and Privacy, Gjøvik, Norway, 23–24 June 2025; Springer Nature: Cham, Switzerland, 2025; pp. 44–56. [Google Scholar]
Kukkaro, A.; Moreschini, S.; Hästbacka, D. Continuous Training vs. Transfer Learning on Edge and Fog Environments: A Steam Detection use Case. In Proceedings of the 2024 50th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Paris, France, 28–30 August 2024; IEEE: New York, NY, USA, 2024; pp. 138–141. [Google Scholar]
Cheng, Q.; Long, G. Federated learning operations (flops): Challenges, lifecycle and approaches. In Proceedings of the 2022 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), Tainan, Taiwan, 1–3 December 2022; IEEE: New York, NY, USA, 2022; pp. 12–17. [Google Scholar]
Bayram, F.; Ahmed, B.S. Towards trustworthy machine learning in production: An overview of the robustness in mlops approach. ACM Comput. Surv. 2025, 57, 1–35. [Google Scholar] [CrossRef]
Antonini, M.; Pincheira, M.; Vecchio, M.; Antonelli, F. Tiny-MLOps: A framework for orchestrating ML applications at the far edge of IoT systems. In Proceedings of the 2022 IEEE international Conference on Evolving and Adaptive Intelligent Systems (EAIS), Larnaca, Cyprus, 25–27 May 2022; IEEE: New York, NY, USA, 2022; pp. 1–8. [Google Scholar]

Figure 1. Review methodology (The asterisk (*) in the search string denotes truncation; therefore, Manufact* captures lexical variants such as manufacture, manufacturing, manufacturer, and manufactured).

Figure 2. Publications by year and type.

Figure 3. Authors per paper distribution.

Figure 4. The 10 main keywords.

Figure 5. Category coverage based on keywords.

Figure 6. Co-occurring words in VOSviewer (v1.6.20).

Figure 7. MLOps architecture.

Figure 8. “As-is” architecture (starting point) based on a central server and VMs (Hyper-V).

Figure 9. Cloud-fog-edge architecture.

Figure 10. Equipment distribution.

Table 1. Selected final articles.

Paper ID	Title	Year
P01	A reference architecture for the operationalisation of machine learning models in manufacturing	2022
P02	Federated Learning Operations (FLOps): Challenges, Lifecycle and Approaches	2022
P03	Machine Learning in CNC Machining: Best Practices	2022
P04	Tiny-MLOps: a framework for orchestrating ML applications at the far edge of IoT systems	2022
P05	Qualitative assessment of the impact of manufacturing-specific influences on Machine Learning Operations	2022
P06	A Federated implementation for the MLOps framework based on non-intrusive load monitoring	2023
P07	An Adaptable and Unsupervised TinyML Anomaly Detection System for Extreme Industrial Environments	2023
P08	An Industry Maturity Model for Implementing Machine Learning Operations in Manufacturing	2023
P09	Data Analytics Environment Combining Visual Programming and MLOps for AI workflow creation	2023
P10	Enabling adaptive analytics at the edge with the Bi-Rex Big Data platform	2023
P11	Evolvability of Machine Learning-based Systems: An Architectural Design Decision Framework	2023
P12	From Concept to Implementation: The Data-Centric Development Process for AI in Industry	2023
P13	Machine Learning Operations (MLOps): Overview, Definition, and Architecture	2023
P14	MLOps: Overview of Current State and Future Directions	2023
P15	Tailoring MLOps Techniques for Industry 5.0 Needs	2023
P16	Simple Non-Regressive Informed Machine Learning Model for Prescriptive Maintenance of Track Circuits…	2023
P17	A Machine Learning Operations Platform for Streamlined Model Serving in Industry 5.0	2024
P18	An MLOps Architecture for XAI in Industrial Applications	2024
P19	A Structured Inference Optimisation Approach for Vision-Based DNN Deployment on Legacy Systems	2024
P20	An End-to-End Deep Learning Framework for Fault Detection in Marine Machinery	2024
P21	CausalOps—Towards an industrial lifecycle for causal probabilistic graphical models	2024
P22	Application of Machine Learning Algorithms in Digital Twin Monitoring Systems: An Overview of Approaches, Methods, and Prospects	2024
P23	Aspects of Module Placement in Machine Learning Operations for Cyber-Physical Systems	2024
P24	An MLOps Framework to Data-Driven Modelling of Digital Twins with an Application to Virtual Test Rigs	2024
P25	Automated Machine Learning for Industrial Applications—Challenges and Opportunities	2024
P26	Continuous Training vs. Transfer Learning on Edge and Fog Environments: A Steam Detection Use Case	2024
P27	Deploying a Sustainable Deep Learning Pipeline for Poison Ivy Image Classification	2024
P28	E2E Network Slice Assurance for B5G/6G: Realising Data Collection and Management, MLOps and Closed Loop Control	2024
P29	Enabling Big Data and Machine Learning Applications in High-Stakes Environments	2024
P30	Fundamental requirements of a machine learning operations platform for industrial metal additive manufacturing	2024
P31	Improving the Development and Reusability of Industrial AI Through Semantic Models	2024
P32	Industrial Edge MLOps: Overview and Challenges	2024
P33	Integrating Machine Learning and MLOps for Wind Energy Forecasting	2024
P34	ML-Enabled Systems Model Deployment and Monitoring: Status Quo and Problems	2024
P35	MLOps FMEA: A Proactive & Structured Approach to Mitigate Failures	2024
P36	MLOps for Cyberphysical Production Systems: Challenges and Solutions	2024
P37	MLOps in CPS: a use-case for image recognition in changing industrial settings	2024
P38	MLOps: From a cottage industry to a factory approach	2024
P39	Past, Present, Future: A Comprehensive Exploration of AI Use Cases in the UMBRELLA IoT Testbed	2024
P40	The Environmental Cost of Engineering Machine Learning-Enabled Systems: A Mapping Study	2024
P41	Using FactoryML for the Deployment of Machine Learning Models in Industrial Production	2024
P42	Supporting Factory Operations Using Multi-Agent Reinforcement Learning	2024
P43	A Kanban-based Approach to Managing Machine Learning Projects in Manufacturing	2025
P44	Agile MLOps: Bridging the Gap Between Agility and Machine Learning Operations	2025
P45	Machine Learning Operations: A Mapping Study	2025
P46	Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach	2025
P47	Towards Frugal Industrial AI: a framework for the development of scalable and robust machine learning models	2025
P48	Towards a machine learning operations (MLOps) soft sensor for real-time predictions in industrial-scale fed-batch fermentation	2025
P49	Data Chameleon: A Self-adaptive Synthetic Data Management System	2025

Table 2. Quality assessment of articles.

Paper ID	C1	C2	C3	C4	C5	Brief Justification
P01	1	0	1	1	0	Conceptual proposal for reference architecture. Describes necessary components (Docker, MQTT) but does not present an experimental implementation or validation with real data.
P02	1	0	0	1	0	Methodological proposal. Coins the term “FLOps”. Defines the life cycle and challenges of MLOps in federated (cross-silo) environments. Purely theoretical/conceptual, with no case study or technical implementation.
P03	2	2	2	1	2	Excellent replicability. Real-world case of CNC tool wear + public dataset. Shares code and data. Focuses on model construction “best practices,” although it acknowledges that continuous deployment (C4) is a future task.
P04	2	2	1	2	1	TinyML/Edge. Extends MLOps to microcontrollers (Far Edge). Evaluates deployment and inference latencies on limited devices. Addresses the challenge of orchestration on low-power hardware.
P05	1	0	0	1	0	Qualitative study. Cross-references MLOps capabilities with manufacturing requirements through literature review. Identifies semantic gaps between OT and IT but does not present implementation or data.
P06	2	2	2	2	1	Clear technical case (NILM/Smart Grid) with public dataset (AMPds). Federated implementation with GitHub Actions and Docker. The code is mentioned, but there is no direct link to the specific repo.
P07	2	2	2	1	1	Proposes an unsupervised TinyML system for anomaly detection in extreme industrial environments running on a microcontroller (e.g., ESP32), reporting “real” edge metrics (memory, inference/training times, footprint), with a practical approach to deployment.
P08	1	1	1	2	0	Maturity model (IMM-MLOps) validated by experts (interviews). Very strong in defining operational practices (C4), but without technical validation through actual deployment.
P09	2	2	2	2	1	No-Code platform for SMEs. Use case: Injection moulding (real data). Integrates Node-RED with MLflow and Docker. Supports full cycle but does not include public repo.
P10	2	1	2	1	1	It presents an OT/IT industrial platform for adaptive analysis, articulating an OT layer at the industrial edge (close to the machine) and an IT layer with services and storage (MQTT/Kafka-type connectivity), useful as a reference architecture for interoperability and deployment in the plant.
P11	1	1	0	1	0	Architectural Framework. Proposes a decision-making framework for managing evolvability (drift). Focuses on design strategy (“when to retrain”), illustrated with examples, but without detailed actual deployment.
P12	2	2	2	2	1	Data-Centric AI for SMEs. Describes a specific process applied to manufacturing/machining. Detailed implementation using Airflow, DVC, and MLflow. Excellent description of the data lifecycle and management.
P13	1	0	0	0	0	Fundamental reference. Defines MLOps through mixed review and expert interviews. Establishes principles and roles. Being theoretical/definitional, it scores low on technical implementation but is key to your theoretical framework.
P14	1	0	0	1	0	Overview. Introduces basic concepts, tools (MLflow, Kubeflow) and the lifecycle. It is introductory, with no novel contributions in terms of design, data or implementation.
P15	1	1	1	1	0	Architectural Proposal. Proposes the “Olympics Model” to integrate MLOps with Systems Engineering (CPS). It is based on the requirements of the AIMS 5.0 project, but the article is conceptual/propositional without detailed experimental validation.
P16	2	2	2	2	1	Real case (Hitachi Rail). Metro track maintenance. It stands out in C4 for addressing a critical operational problem: ensuring that weekly retraining is not regressive (does not introduce new errors), using constraints in XGBoost.
P17	2	2	2	2	2	Real case study at a gear manufacturing company (Bonfiglioli). Complete MLOps infrastructure (K8s, Jenkins) and comparative performance evaluation (BentoML vs. TorchServe). Code available.
P18	1	1	1	2	2	Architecture focused on XAI and feedback loops. General industrial context (EXPLAIN project). It excels in operation (C4) and has a repository in GitLab, although validation is preliminary.
P19	2	2	2	1	2	Specific industrial case: deployment of vision on legacy hardware. Clear technical methodology (OpenVINO, quantisation). Repository available.
P20	2	2	2	1	2	Clear naval context with real ship data. Complete (end-to-end) fault detection pipeline. The focus is on modelling, with less detail on continuous operation (feedback loops).
P21	2	1	1	2	1	Defines an “industrial” lifecycle for causal models (causal PGMs) with an emphasis on roles, phases, artefacts and governance, supported by practical experience (e.g., automotive/safety), which makes it very robust for arguing organisational maturity.
P22	1	0	0	0	0	Review. Analyses ML methods for Digital Twins from a theoretical perspective. Proposes a general monitoring scheme but does not present experimental implementation, proprietary data or deployment.
P23	2	2	1	1	1	Addresses distributed deployment (Edge vs. Cloud) in CPS. Validated through Proof of Concept (PoC) in a factory simulation (Fischertechnik). Infrastructure described (K3s, Zenoh), but synthetic simulation data.
P24	2	2	2	2	1	Proposes an MLOps framework for data-driven Digital Twins with a very explicit stack (e.g., Kubeflow Pipelines/Katib/MinIO + MQTT/ETL-type ingestion + REST/gRPC serving), incorporates monitoring and triggers for degradation or distribution changes to update/retrain the model, and also evaluates operational aspects (including serving load/latency tests).
P25	2	1	1	1	1	This is a job focused on the requirements and challenges of industrialising AutoML/ML (data heterogeneity, drift, monitoring, interpretability, functional safety, traceability), with support from industrial experience/partners.
P26	2	2	2	2	1	Clear industrial case (detection of steam leaks in sterilisation). Experimentally compares retraining strategies (CT vs. TL) in Edge. Detailed infrastructure (K3s), but no link to the repo.
P27	1	2	2	1	2	Green AI approach and quantisation at the edge (Jetson/Rpi). Non-industrial-manufacturing context (environmental), therefore C1 = 1. Notable for having code available.
P28	2	2	2	2	1	Telecom sector (5G slicing). Implements a closed loop to ensure SLAs through automatic retraining. Very comprehensive architecture (ZSM, Kafka, MinIO).
P29	1	0	0	0	0	Doctoral Symposium. Describes the research plan (PhD journey) on data management in critical environments (Humanitarian/Industry 5.0). It is a conceptual proposal with no implementation or validation reported yet.
P30	2	1	1	2	0	Requirements Engineering. Defines the functional requirements and roles for a specific MLOps platform in Metal Additive Manufacturing (MAM). Very valuable for defining the operational architecture (C4), although it does not implement the final platform.
P31	1	1	1	1	0	Proposes a semantic framework (ontology) to capture context and facilitate reuse. Validation through a preliminary conceptual “scenario,” with no reported full industrial deployment.
P32	1	0	0	0	0	Review and survey. Analyses tools and defines a base architecture for Edge MLOps. Useful for a theoretical framework, but does not present experimentation or proprietary data.
P33	2	2	2	2	1	Real case with SCADA data from a wind turbine in Turkey. Implements End-to-End MLOps pipeline (Docker, Jenkins) and measures inference latencies and accuracy (RMSE). Very comprehensive from a technical standpoint.
P34	1	1	1	1	2	Survey of 188 professionals. Identifies actual practices and problems (e.g., legacy integration). Does not implement a system but stands out for sharing the dataset of responses and scripts (Open Science).
P35	2	2	1	2	1	Risk Management. Adapts FMEA (Failure Mode and Effects Analysis) to the CRISP-ML(Q) life cycle. Validated with a Predictive Maintenance use case (maintenance text classification).
P36	2	1	1	2	1	Multi-sector experience. Describes challenges and solutions based on three real scenarios (electronics manufacturing, metallurgy, chemistry). Proposes the “oktoflow” platform and discusses Edge/On-premises architectures.
P37	2	2	2	2	1	Technical Implementation. Security use case (geofencing of humans/forklifts). Automated pipeline with Jenkins, Docker, and YOLOv5. Details the retraining and deployment flow (CI/CD/CT).
P38	1	1	1	1	0	Book chapter/tutorial. Uses an analogy (“Craft Industry vs. Factory”) and an e-commerce scenario to illustrate concepts. Describes roles and processes well, but is not a real industrial manufacturing case study and does not present technical experimentation.
P39	2	2	2	2	1	Real Implementation (Testbed). Describes four use cases (smart lighting, digital twin, etc.) in a real IoT network with 200 nodes. Deployment with Kubernetes and Federated Learning. Dataset published, but does not provide a direct link to the system code.
P40	1	0	0	0	0	Systematic Mapping (SMS). Analyses 52 studies on the energy cost of MLOps. Useful for identifying sustainability metrics (Green AI), but as it is a secondary review, it scores low on its own technical implementation.
P41	2	2	1	2	2	Open-Source Framework. Proposes “FactoryML” to deploy models in PLCs and air-gapped environments. Real-world case study. Notable for C5 = 2 (code available) and focus on rigid industrial infrastructure.
P42	2	2	2	2	0	Production deployment (Micron). Use of RL for scheduling in semiconductor factory. Detailed MLOps section: automatic retraining, acceptance tests (UAT) and cluster deployment. Actual throughput improvement results (+2%).
P43	1	1	0	1	0	Project management approach (adapted Kanban). Validation based on a simulated “use case” rather than on a technical implementation with real data or infrastructure.
P44	1	1	0	1	0	Theoretical methodological proposal based on literature review. Analyses the integration of Scrum/Kanban in MLOps but lacks technical implementation or empirical validation.
P45	1	0	0	0	0	Mapping Study. Classifies 32 studies into Data, Modelling and Deployment pipelines. Useful for understanding research trends and taxonomy, but does not provide technical implementation or proprietary data.
P46	1	0	0	0	0	Review (Survey). Explores the intersection between “Trustworthy AI” and MLOps. Very theoretical, organises concepts of robustness, but does not present implementation or use cases of its own.
P47	2	1	1	1	0	Semantic Approach. Proposes an ontology-based framework for “Frugal AI” (little data) and reusability. Validation through conceptual monitoring scenarios, with no reported continuous operational deployment.
P48	2	2	2	2	1	Industrial Case (Biotech). Complete pipeline for a penicillin “Soft Sensor” (IndPenSim). Implements automatic retraining based on drift detection (PSI). Very strong in design and operation.
P49	1	2	1	2	1	Adaptive architecture for synthetic data management. Implements an MAPE-k loop to detect drift and retrain generators. Evaluation through simulation with a retail dataset (not physical industrial one).

Table 3. Identified industrial requirements (R01–R12) and their frequency.

ID	Constraint/Requirement	Operational Definition	Articles	Coverage (%)	Architectural Implication
R01	Adaptability and Drift	Management of changes in data distribution (drift) or physical processes over time.	P11, P14, P15, P16, P22, P46, P48, P49	16.3	Continuous monitoring pipeline and automatic retraining triggers (CT).
R02	Data Quality	Challenges of costly labelling, sample scarcity (Small Data) or synthetic data.	P15, P24, P29, P43, P46	10.2	Integration of Data Engineering (ETL) and Data-Centric tools into the MLOps loop.
R03	Edge/TinyML	Deployment on devices with severe computing and power constraints.	P06, P18, P20, P27, P45	10.2	Model optimisation (quantisation) and lightweight remote orchestration (OTA).
R04	OT/IT integration	Interoperability with plant systems (PLCs, SCADA) and legacy hardware.	P06, P08, P20, P34, P40, P42	12.2	Use of industrial connectors (OPC-UA, MQTT) and containerised gateways (Docker) at a Fog point.
R05	Real-time latency	Strict response time requirement for inference in process control.	P01, P28, P44, P45	8.2	On-device inference, hardware acceleration, and low-latency architectures.
R06	Continuous cycle (CI/CD)	Complete automation of the flow from code to deployment and validation.	P04, P14, P17, P28, P35, P44	12.2	Complex orchestration (Jenkins, Kubeflow) and strict model version control.
R07	Scalability/Federated	Distributed fleet management or collaborative training without data sharing.	P04, P14, P15, P23, P38, P49	12.2	Decentralised architectures, fleet configuration management and secure aggregation.
R08	Robustness and reliability	Guaranteed safe operation in the event of failures and regression prevention.	P33, P41, P48	6.1	Failure mode analysis (FMEA), rollback strategies and regression tests.
R09	Explainability (XAI)	Need for human operators to understand and validate decisions.	P05, P14, P15, P30, P36	10.2	User interfaces for explainability and interpretability metrics in monitoring.
R10	Security and Privacy	Protection of sensitive data, model IP, and defence against attacks.	P04, P21, P46	6.1	Encryption, differential privacy, access control (RBAC) and security auditing.
R11	Sustainability	Minimisation of energy consumption for training and inference.	P18, P39	4.1	Energy efficiency metrics and selection of green hardware/algorithms.
R12	Governance and Methodology	Organisational frameworks, roles, maturity and alignment with business.	P03, P07, P10, P11, P25, P30, P47	14.3	Definition of standards, business KPIs and IT/OT collaboration structures.

Table 4. Mapping of tools and processes.

Phase/Capacity	Articles (IDs)	Coverage (%)	Most Cited Tools
Data Ingestion	P01, P04, P06, P08, P09, P13, P14, P16, P17, P19, P20, P24, P28, P29, P35, P38, P42, P44, P45, P46, P48, P49	44.9%	Kafka, MQTT, Spark, Node-RED, OPC-UA
Validation and Quality	P11, P14, P15, P24, P29, P43, P46	14.3%	DVC, Synthetic Generators, Manual Scripts
Feature Engineering	P09, P14, P28, P29, P44	10.2%	Pandas/Python (Custom), Feature Stores (rare)
Training	P01, P04, P06, P08, P09, P14, P15, P16, P18, P19, P24, P28, P29, P35, P38, P42, P44, P45, P46, P48, P49	42.9%	TensorFlow, PyTorch, Scikit-learn, XGBoost
Model Registry	P01, P14, P15, P17, P24, P28, P35	14.3%	MLflow Registry, DVC, Git-based
CI/CD and Orchestration	P01, P04, P14, P15, P17, P19, P28, P35, P44, P49	20.4%	Jenkins, GitHub Actions, Airflow, Kubeflow
Deployment/Serving	P01, P06, P08, P13, P14, P15, P17, P18, P20, P28, P35, P42, P45, P49	28.6%	Docker, K8s, BentoML, TorchServe, OpenVINO
Monitoring	P01, P06, P14, P15, P16, P19, P20, P28, P34, P35, P38, P44, P48, P49	28.6%	Prometheus, Grafana, ELK Stack
Drift Detection	P14, P16, P22, P44, P46, P48	14.3%	Alibi Detect, Evidently, Tests
Retraining	P14, P11, P16, P19, P35, P44, P46, P49	16.3%	Airflow DAGs, Jenkins Triggers, Custom Loops
Governance	P03, P07, P10, P11, P25, P30, P33, P47	16.3%	Excel, Kanban Boards, FMEA, Manual
Rollback/Insurance	P01, P34, P35, P48	8.2%	K8s Rollouts, Manual Scripts

Table 5. Architecture patterns.

Pattern ID	Deployment Architecture	Cost (%)	Pros (+) and Cons (−)	Representative Studies
P-01	Centralised (Cloud-Centric) Train & Inference in Cloud/Server.	65.3	(+) Simplifies management and scaling. (−) High latency, dependency on internet connection, privacy risks.	P01, P02, P03, P05, P07, P09, P10, P11, P12, P14, P15, P17, P19, P21, P22, P24, P25, P26, P30, P31, P32, P33, P36, P37, P39, P40, P41, P43, P44, P46, P47, P48
P-02	Hybrid (Cloud-fog-Edge) Train in Cloud and Deploy to Edge	16.3	(+) Real-time response, operational autonomy. (−) Complex orchestration and synchronisation challenges.	P08, P13, P16, P20, P28, P34, P35, P49
P-03	TinyML/Far Edge Inference on microcontrollers (MCU).	8.2	(+) Ultra-low power consumption. (−) Severe hardware constraints, difficult to update.	P06, P18, P27, P45
P-04	Federated Learning Training is distributed across nodes.	6.1	(+) Maximum Data Privacy (data never leaves the plant). (−) High network overhead, slow convergence.	P04, P23, P38
P-05	Air-Gapped/Isolated Manual deployment via physical media.	4.1	(+) Critical infrastructure security. (−) No monitoring, obsolete models, slow updates.	P29, P42

Table 6. Operational indicators observed in the CNC single-cell validation.

Indicator	Observed Value	Unit	Evidence Interpretation
Seeded company reference events	4848	events	Company-derived reference dataset loaded into the validation environment
Persisted company sensor events	21	events	Company-mode ingestion path was exercised and persisted
Persisted edge predictions	1	predictions	Edge inference was exercised and recorded in the operational database
Observed edge inference latency	99.6613	ms	Latency of the persisted edge inference event
Closed-loop drift score	0.178127	score	Repeatedly observed in governance-side drift evaluation
Closed-loop drift severity/drifted features	low/1	categorical/features	Repeated across observed drift runs

Table 7. Controlled edge-profiling results in the local validation environment.

Indicator	Observed Value	Unit	Evidence Interpretation
Valid synthetic events sent	300	events	Controlled profiling load injected through the deployed edge-inference path
Persisted inference events	300	events	All profiling inputs produced persisted inference records in the experimental window
Prediction counter	300	predictions	metrics prediction counter matched the persisted inference count
Schema mismatches	0	events	No schema mismatches were observed during the profiling run
Edge inference latency (mean)	19.0321	ms	Mean latency from persisted inference_events.latency_ms
Edge inference latency (p50/p95/p99)	9.713/75.6004/85.7648	ms	Percentile summary from persisted edge inference events
Edge inference latency (min/max)	2.6077/133.7507	ms	Observed latency range in the profiling run
Container CPU usage (mean/p95/max)	293.75/421.76/425.44	%	Container-level CPU from docker stats on a multi-core host
Container memory usage (mean/p95/max)	127.46/127.74/127.80	MiB	Container-level memory usage from docker stats

Table 8. Controlled OTA continuity results in the local validation environment.

Indicator	Observed Value	Unit	Evidence Interpretation
Initial accepted generation/version	2/2	generation/version	Edge started from an already accepted deployment state
Applied promoted generation/version	3/3	generation/version	A newly promoted model generation was successfully applied
Persisted inferences during OTA run	1200	events	Inference stream remained active throughout the experimental window
Predictions before/after generation switch	288/912	predictions	Persisted sequence shows a single clean transition from generation 2 to generation 3
First cycle served by new generation	289	cycle	Generation switch became visible at a single transition point
OTA application latency	167.20	ms	Observed from persisted edge_sync_status
OTA commands accepted	1	commands	Prometheus-compatible metrics confirmed one accepted OTA command
Sync failures	0	failures	No failed synchronization was observed during the experiment
Schema mismatches during OTA run	0	events	No schema mismatches were observed during the generation transition

Table 9. Comparative positioning of the proposed architecture.

Pattern	Placement of Training/Inference Workloads	Suitability for Latency-Sensitive Industrial Execution	Lifecycle Governance and Controlled Adaptation	Openness/Replicability	Positioning in This Study
P-01 Centralised (Cloud-Centric)	Training and inference are mainly concentrated in cloud/server environments	Low to medium. Easy to manage centrally, but vulnerable to network dependency and latency in shop-floor settings	Medium. Can support orchestration and monitoring, but often with weaker local continuity and limited OT-side autonomy	Variable. Depends strongly on the chosen platform and deployment model	Treated as the dominant baseline in the literature, but potentially misaligned with low-latency and continuity requirements in brownfield industrial environments
P-02 Hybrid (Cloud–Fog–Edge)	Training and higher-level governance remain cloud-side, while inference is deployed close to the process, with fog/on-premises support	High. Better aligned with local inference, continuity, and timing constraints near the machine	High, but more complex. Enables separation of inference and learning loops, promotion, rollback, monitoring, and on-premises persistence, at the cost of cross-tier orchestration complexity	High when implemented with open tools and explicit cross-tier responsibilities	Positioned in this work as the most suitable general-purpose pattern for bounded industrial MLOps operationalisation when low-latency inference, OT/IT integration, and lifecycle governance must coexist
P-03 TinyML/Far Edge	Inference is pushed to ultra-constrained devices such as MCUs	Very high for extreme edge proximity	Low to medium. Strongly constrained updateability, observability, and lifecycle tooling	Medium. Often constrained by hardware-specific deployment choices	Better suited to ultra-constrained execution scenarios than to the broader CNC brownfield setting addressed here
P-04 Federated Learning	Training is distributed across multiple nodes/sites; inference placement may vary	Variable. Useful for privacy-preserving collaboration, but not primarily designed to solve near-machine latency in single-cell operation	Medium. Strong for decentralised collaboration, but introduces coordination, aggregation, and synchronisation complexity	Medium to high depending on implementation maturity	Interpreted here as a promising future extension for privacy-preserving multi-site collaboration, not as a validated capability of the present implementation
P-05 Air-Gapped/Isolated	Manual deployment via physically isolated infrastructure	Variable. Can support isolated execution, but with poor flexibility and slow update cycles	Low. Closed-loop adaptation, online monitoring, and fast retraining are severely limited	Low to medium. High isolation reduces operational openness and reuse	Relevant for highly isolated infrastructures, but misaligned with the closed-loop lifecycle-governance objective pursued in this work
Proposed architecture	Cloud-side training and governance; fog/on-premises persistence and integration; edge-side low-latency inference	High within the validated scope. Designed to support bounded local low-latency execution under industrial constraints	High within the validated scope. Explicitly structures promotion, rollback, monitoring, versioning, and separation between learning and inference loops	High. Vendor-neutral and open-source oriented by design	Positioned as a reproducible hybrid reference instantiation for industrial environments where OT/IT integration, low latency, and controlled lifecycle management must coexist

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mateo-Casalí, M.Á.; Boza, A.; Fraile, F. Towards a Reference Architecture for Machine Learning Operations. Computers 2026, 15, 218. https://doi.org/10.3390/computers15040218

AMA Style

Mateo-Casalí MÁ, Boza A, Fraile F. Towards a Reference Architecture for Machine Learning Operations. Computers. 2026; 15(4):218. https://doi.org/10.3390/computers15040218

Chicago/Turabian Style

Mateo-Casalí, Miguel Ángel, Andrés Boza, and Francisco Fraile. 2026. "Towards a Reference Architecture for Machine Learning Operations" Computers 15, no. 4: 218. https://doi.org/10.3390/computers15040218

APA Style

Mateo-Casalí, M. Á., Boza, A., & Fraile, F. (2026). Towards a Reference Architecture for Machine Learning Operations. Computers, 15(4), 218. https://doi.org/10.3390/computers15040218

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Towards a Reference Architecture for Machine Learning Operations

Abstract

1. Introduction

Contributions and Novelty

2. Background

3. Literature Review Methodology

3.1. Review Design and Descriptive Scope

3.2. Bibliographic Sources, Search Strategy and Restrictions

3.3. Descriptive Analysis

3.4. Content Analysis

3.5. Quality Assessment, Risk of Bias and Threats to Validity

4. Results of the Systematic Review: State of the Art Analysis

4.1. Assessment of the Quality and Maturity of the Evidence

4.2. Requirements Taxonomy for Industrial MLOps

4.3. Life Cycle Capability Analysis and Tools

4.4. Deployment Architectural Patterns

4.5. Research Gap

5. Hybrid Reference Architecture and Implementation Stack

5.1. Logical View: Dual Loop and Lifecycle Automation

5.2. Development View: Open Implementation Technology Stack

5.3. Process View: Data Flow and Continuous Adaptation Mechanism

5.4. Evaluation View: Evaluation Framework and Industrial Alignment

6. Implementation and Feasibility-Oriented Validation: Predictive Maintenance Use Case

7. Discussion: Summary and Positioning of the Proposal

7.1. From Experimentation to Resilient Operation (RQ1, RQ2, RQ3)

7.2. Comparative Analysis with Consolidated Frameworks (RQ4, RQ5)

7.3. MLOps as an Enabler of Industry 5.0 and Emerging Technologies (RQ6)

7.4. Academic and Managerial Implications

8. Conclusions and Future Work

8.1. Conclusions

8.2. Future Research Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI