Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study

González-Potes, Apolinar; Martínez-Castro, Diego; Paredes, Carlos M.; Ochoa-Brust, Alberto; Mena, Luis J.; Martínez-Peláez, Rafael; Félix, Vanessa G.; Félix-Cuadras, Ramón A.

doi:10.3390/ai7020051

Open AccessArticle

Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study

by

Apolinar González-Potes

^1,*

,

Diego Martínez-Castro

²

,

Carlos M. Paredes

³

,

Alberto Ochoa-Brust

¹

,

Luis J. Mena

⁴

,

Rafael Martínez-Peláez

^4,5

,

Vanessa G. Félix

⁴

and

Ramón A. Félix-Cuadras

¹

Facultad de Ingeniería Mecánica y Eléctrica, Universidad de Colima, Colima 28040, Mexico

²

Facultad de Ingeniería y Ciencias Básicas, Universidad Autónoma de Occidente, Cali 760030, Colombia

³

Laboratorio de Investigación y Desarrollo en Inteligencia de Software (LIDIS), Facultad de Ingeniería, Universidad de San Buenaventura, Cali 760030, Colombia

⁴

Unidad Académica de Computación, Universidad Politécnica de Sinaloa, Mazatlan 82199, Mexico

⁵

Departamento de Ingeniería de Sistemas y Computación, Universidad Católica del Norte, Antofagasta 1270709, Chile

^*

Author to whom correspondence should be addressed.

AI 2026, 7(2), 51; https://doi.org/10.3390/ai7020051 (registering DOI)

Submission received: 15 December 2025 / Revised: 14 January 2026 / Accepted: 22 January 2026 / Published: 1 February 2026

(This article belongs to the Special Issue Artificial Intelligence in Industrial Systems: From Data Acquisition to Intelligent Decision-Making)

Download

Browse Figures

Versions Notes

Abstract

A hybrid AI and LLM-enabled architecture is presented for real-time decision support in industrial batch processes, where supervision still relies heavily on human operators and ad hoc SCADA logic. Unlike algorithmic contributions proposing novel AI methods, this work addresses the practical integration and deployment challenges arising when applying existing AI techniques to safety-critical industrial environments with legacy PLC/SCADA infrastructure and real-time constraints. The framework combines deterministic rule-based agents, fuzzy and statistical enrichment, and large language models (LLMs) to support monitoring, diagnostic interpretation, preventive maintenance planning, and operator interaction with minimal manual intervention. High-frequency sensor streams are collected into rolling buffers per active process instance; deterministic agents compute enriched variables, discrete supervisory states, and rule-based alarms, while an LLM-driven analytics agent answers free-form operator queries over the same enriched datasets through a conversational interface. The architecture is instantiated and deployed in the Clean-in-Place (CIP) system of an industrial beverage plant and evaluated following a case study design aimed at demonstrating architectural feasibility and diagnostic behavior under realistic operating regimes rather than statistical generalization. Three representative multi-stage CIP executions—purposively selected from 24 runs monitored during a six-month deployment—span nominal baseline, preventive-warning, and diagnostic-alert conditions. The study quantifies stage-specification compliance, state-to-specification consistency, and temporal stability of supervisory states, and performs spot-check audits of numerical consistency between language-based summaries and enriched logs. Results in the evaluated CIP deployment show high time within specification in sanitizing stages (100% compliance across the evaluated runs), coherent and mostly stable supervisory states in variable alkaline conditions (state-specification consistency

Γ_{s} \geq 0.98

), and data-grounded conversational diagnostics in real time (median numerical error below 3% in audited samples), without altering the existing CIP control logic. These findings suggest that the architecture can be transferred to other industrial cleaning and batch operations by reconfiguring process-specific rules and ontologies, though empirical validation in other process types remains future work. The contribution lies in demonstrating how to bridge the gap between AI theory and industrial practice through careful system architecture, data transformation pipelines, and integration patterns that enable reliable AI-enhanced decision support in production environments, offering a practical path toward AI-assisted process supervision with explainable conversational interfaces that support preventive maintenance decision-making and equipment health monitoring.

Keywords:

Clean-in-Place; large language models; multi-agent systems; industrial IoT; process supervision; explainable AI; real-time decision support; predictive maintenance; Industry 4.0

1. Introduction

Industrial batch processes in food and beverage, pharmaceuticals, and wastewater treatment depend critically on Clean-in-Place (CIP) operations and similar multi-stage procedures to guarantee hygiene, product quality, and regulatory compliance [1,2]. These cycles typically execute multi-stage programs—for example, pre-rinse, alkaline wash, intermediate rinse, acid wash, sanitizing and final rinse—under tight constraints on temperature, flow, conductivity and contact time [2,3]. In current practice, such sequences are orchestrated by programmable logic controllers (PLCs) and supervised through SCADA systems, where operators monitor trends, acknowledge alarms and manually interpret complex process conditions. While this architecture is robust and widely adopted, it offers limited support for proactive decision-making, root-cause analysis or flexible what-if exploration over historical and real-time data—limitations that become increasingly problematic as plants grow in complexity and product portfolios diversify [3,4].

To address the need for flexibility and scalability, recent Industry 4.0 frameworks advocate modular, component-based and microservice-oriented architectures for industrial automation [5,6]. These approaches promote loose coupling between control, monitoring, data management and higher-level applications, enabling incremental deployment and technology heterogeneity. Previous work on component-based microservices for bioprocess automation has shown that containerised services and publish/subscribe communication can decouple control and supervision across heterogeneous equipment, including bioreactors and CIP systems while meeting industrial real-time and robustness requirements [6]. Complementary contributions in bioprocess monitoring and control have proposed advanced observers, model predictive controllers and fault-detection schemes, but typically focus on algorithmic performance rather than on how human operators interact with increasingly complex automation stacks in day-to-day operation [1].

In parallel, large language models (LLMs) and conversational agents have emerged as powerful tools for making complex data and models more accessible to domain experts, enabling natural-language querying, explanation and guidance [7,8]. Early studies in industrial settings indicate that LLM-based assistants can help operators explore process histories, retrieve relevant documentation and reason about abnormal situations using free-form queries [8]. However, integrating LLMs into safety-critical environments remains challenging: LLM outputs are non-deterministic, may hallucinate and must coexist with hard safety constraints, deterministic interlocks and real-time requirements [7,8]. Existing architectures rarely combine deterministic rule-based supervision, continuous analytics and LLM-based conversational interfaces in a way that preserves safety while providing meaningful assistance for CIP operations and other multi-stage cleaning or batch processes.

1.1. From Reactive Alarms to Diagnostic Intelligence

Modern food and beverage facilities equipped with advanced control strategies—including model-based flow optimization and automated parameter regulation—have achieved significant process stability. In such environments, catastrophic process failures are rare, and traditional alarm systems designed to detect imminent faults often generate excessive nuisance alerts that operators learn to ignore or dismiss [9]. The supervisory challenge has consequently evolved from detecting failures to interpreting operational signals: distinguishing between acceptable process variability and subtle patterns indicating emerging maintenance needs before they impact production [10,11].

Consider a CIP execution where flow rates are 10% below optimal but still within regulatory acceptance bounds. A traditional threshold-based alarm system remains silent, yet this pattern—if consistent across multiple cycles—may signal gradual pump wear requiring scheduled maintenance. Similarly, slight temperature deviations that do not compromise product safety may indicate boiler efficiency degradation, and conductivity variations within specification may reflect dosing pump calibration drift. In all cases, individual executions complete successfully by specification, but trend analysis across execution cycles reveals actionable maintenance opportunities that can prevent unplanned downtime and extend equipment lifespan [11].

Extracting these diagnostic insights currently depends on expert operator knowledge and time-intensive manual log analysis—a process that scales poorly as facilities expand and product portfolios diversify. Moreover, the high dimensionality of sensor data (dozens of variables sampled at sub-second intervals) makes it difficult for operators to recognize subtle patterns amid normal process variability [9].

In well-optimized plants where CIP executions routinely meet specifications, failures are rare but emerging equipment degradation must be identified through longitudinal trend analysis rather than catastrophic fault detection. A system that monitors 100 consecutive successful CIP runs provides no evidence of diagnostic capability; conversely, analyzing executions that span the operational spectrum—from nominal baseline to preventive warning patterns to diagnostic alert regimes—demonstrates the architecture’s ability to distinguish acceptable process variability from systematic equipment drift requiring maintenance intervention before operational impact occurs.

1.2. Proposed Architecture and Evaluation Approach

This paper proposes a generic, process-agnostic multi-agent architecture for AI-assisted monitoring and decision support in industrial batch processes. The architecture is instantiated and evaluated following a case study design in the Clean-in-Place (CIP) system of an industrial beverage plant, aimed at demonstrating architectural feasibility and diagnostic capabilities under realistic operating conditions rather than statistical generalization across populations of plants. While tailored here to CIP, the architectural components—process-aware context management, hybrid deterministic/LLM supervision, and incremental agent deployment—are designed to be transferable to other multi-stage batch operations (fermentation, distillation, pasteurisation) by reconfiguring process-specific rules and ontologies, though empirical validation in these contexts remains future work [6,12].

Unlike existing LLM-based assistants that typically operate as generic chatbots over manuals, historical databases or SCADA tags, the proposed architecture treats CIP executions as first-class contexts. A set of CIP-aware agents load their context on demand according to the active programme and stage, subscribe to enriched data streams and produce supervisory states, alerts and explanations that are aligned with the lifecycle of real CIP runs. This process-centric view of context enables new decision-support agents (LLM-based or otherwise) to be added incrementally, without modifying the underlying CIP programmes. This process-centric abstraction—treating batch executions as first-class contexts—constitutes the main architectural innovation, enabling heterogeneous AI components to operate over a shared process lifecycle without coupling to specific control logic.

The architecture is instantiated and evaluated on real CIP cycles executed in an industrial beverage plant over a six-month deployment period. From 24 complete executions monitored during this period, 3 representative cases are purposively selected to provide architectural and operational validation evidence across the diagnostic spectrum observed during deployment: (i) a nominal baseline execution demonstrating routine equipment health, (ii) a preventive-warning case exhibiting subtle operational signals (flow reduction, temperature drift) that do not violate safety thresholds but indicate emerging equipment degradation requiring scheduled maintenance, and (iii) a diagnostic-alert regime capturing multiple concurrent deviations (pump wear, boiler efficiency loss, dosing drift) requiring prioritised maintenance review.

Rather than pursuing large-scale statistical generalization—which would be infeasible given the operational frequency of CIP cycles (1–3 executions per day) and the stable baseline achieved through prior control optimization—this case study evaluation demonstrates how the decision-support layer interprets operational signals, issues contextualized alerts, and generates natural-language diagnostic summaries that inform preventive maintenance decisions while preserving the determinism required for safety-critical alarm handling. The focus is on proving architectural feasibility and showing that the diagnostic patterns identified by the system align with maintenance needs confirmed by plant operators and maintenance logs.

1.3. Contributions and Scope

This work addresses the practical integration and deployment of existing AI techniques—large language models, fuzzy logic, statistical process control, and anomaly detection—into safety-critical industrial environments with legacy PLC/SCADA infrastructure and real-time constraints. The contribution lies in demonstrating how to bridge the gap between AI theory and industrial practice through system architecture, data transformation pipelines, and integration patterns that enable reliable AI-enhanced decision support in production environments. This aligns with the objectives of applied industrial AI research: integrating AI into legacy systems, addressing technical challenges such as data sparsity and model robustness, and demonstrating tangible improvements in industrial performance [5,6].

Four AI-specific deployment challenges are addressed: (i) temporal domain separation to preserve real-time safety guarantees despite non-deterministic LLM behavior, (ii) signal enrichment pipelines to bridge the semantic gap between numerical sensor data and natural language LLM inputs, (iii) retrieval-augmented generation (RAG) to prevent hallucination in safety-critical contexts, and (iv) resource-efficient deployment under edge computing constraints. These integration challenges, while critical for industrial AI adoption, are distinct from algorithmic contributions proposing novel machine learning methods or model architectures. The focus is on making existing AI techniques operationally viable in industrial contexts where data sovereignty, resource limitations, timing guarantees, and fail-safe behavior are non-negotiable requirements.

The main contributions of this paper are:

A batch-process-aware, multi-agent decision-support architecture that treats batch executions (such as CIP cycles) as first-class contexts. Agents load their context on demand according to the active programme and stage, subscribe to enriched data streams and produce supervisory states, alerts and explanations aligned with the lifecycle of real batch runs. The architecture addresses AI-specific integration challenges, including (i) temporal domain separation to preserve real-time safety guarantees despite non-deterministic LLM components, (ii) signal enrichment pipelines to bridge the semantic gap between numerical sensor data and natural language LLM inputs, (iii) retrieval-augmented generation (RAG) to prevent hallucination in safety-critical contexts, and (iv) resource-efficient deployment under edge computing constraints (Section 3.6).
A process-centric context management approach that allows heterogeneous AI components (rule-based agents, fuzzy logic, neural networks, LLM-based assistants) to be added incrementally. New agents can reuse the same process context and message bus without modifying the underlying control programmes or restarting the supervision layer.
A set of process-level evaluation metrics that quantify the behavior of the decision-support layer over real executions, including compliance with stage specifications, consistency with state specifications, and stability of state labeling, complemented by spot checks of the numerical consistency between language-based summaries and enriched logs.
An experimental study on three complete CIP runs that instantiates the architecture in a real Cleani-in-Place application, demonstrating its ability to maintain high specification compliance across nominal, preventive and diagnostic scenarios, provide coherent and stable supervisory states, and generate data-grounded natural-language explanations in real time without altering the existing CIP control logic. The case studies collectively illustrate how diagnostic interpretation of alert patterns across execution cycles can inform preventive maintenance scheduling—addressing pump wear, boiler degradation and dosing system calibration—before operational impact occurs. These contributions are validated within the CIP deployment context; transferability to other batch processes (fermentation, distillation) is demonstrated conceptually through the generic architectural design (Table 1) but requires empirical validation in future work.

Collectively, these contributions advance the state of knowledge by demonstrating—through a real industrial deployment—that hybrid deterministic/LLM architectures can be integrated into safety-critical batch supervision to support maintenance-oriented decision-making and by providing a reproducible architectural pattern and evaluation methodology that can guide future implementations in other batch process domains.

The architecture is implemented and deployed in a real industrial environment, supporting CIP operations at the VivaWild Beverages plant over a six-month validation period, which provides the basis for the experimental evaluation reported in this paper.

1.4. Paper Organization

The remainder of this paper is organized as follows. Section 2 reviews related work on bioprocess monitoring, Industry 4.0 architectures, LLMs in industrial settings, and task planning. Section 3 presents the generic batch process supervision architecture, including the rationale for a layered, agent-based design, AI-specific integration challenges and solutions (Section 3.6), and the process-agnostic instantiation pattern. Section 4 details real-time data management and LLM integration strategies. Section 5 describes the experimental methodology, including the industrial deployment, selection rationale for representative execution cases, data collection procedures and evaluation metrics. Section 6 presents results from three representative CIP executions, quantifying compliance, state consistency, stability and LLM fidelity. Section 7 discusses the findings, positions the work relative to existing approaches, addresses limitations and outlines directions for future longitudinal analysis. Section 8 concludes the paper.

2. Related Work

2.1. Bioprocess and CIP Monitoring and Control

Monitoring and control of industrial bioprocesses has been extensively studied, with a strong emphasis on state estimation, soft sensors, fault detection and advanced control strategies such as model predictive control [1]. Recent advances in soft sensor modeling have explored multiple paradigms: interval type-2 fuzzy logic systems for uncertainty handling [13], deep learning approaches for feature representation in high-dimensional data [14], fuzzy hierarchical architectures that combine rule-based reasoning with adaptive learning for improved interpretability [15], and deep neural network-based virtual sensors for retrofitting industrial machinery without additional physical instrumentation [16]. Within this broader context, Clean-in-Place (CIP) systems are recognized as critical operations that ensure hygienic conditions and prevent cross-contamination between batches [2], but also as major contributors to water, chemical and energy consumption [4].

Reported approaches for CIP focus mainly on sequence design, parameter optimization and endpoint detection, including improved scheduling of cleaning cycles, optimisation of temperature and flow profiles, and rule-based supervision to guarantee that each stage meets predefined set-points [2,3,4]. In most cases, the supervisory logic is implemented directly in PLCs and SCADA systems, with alarm rules defined as static thresholds or simple logical combinations, providing limited support for richer situation awareness, multi-variable diagnostics, cross-cycle trend analysis or operator guidance for preventive maintenance beyond trends and alarm lists in production environments.

While these approaches successfully detect threshold violations and ensure regulatory compliance, they typically operate on a per-execution basis and do not explicitly support the interpretation of subtle operational signals that may indicate emerging equipment degradation (for example, gradual pump wear, boiler efficiency drift or dosing system calibration errors) before they impact process outcomes. Recent soft sensor frameworks have demonstrated improved handling of uncertainty [13], nonlinear pattern recognition [14], and rule-based feature processing with fuzzy reasoning [15], yet their integration with conversational interfaces for maintenance-oriented decision support remains unexplored. The supervisory challenge has consequently shifted from detecting imminent failures to identifying preventive maintenance opportunities through longitudinal trend analysis and diagnostic pattern recognition across multiple CIP cycles.

2.2. Industry 4.0 Architectures and Microservice-Based Automation

The move towards Industry 4.0 has motivated a variety of reference architectures and frameworks that aim to increase modularity, interoperability and scalability in industrial automation [5]. These frameworks typically distinguish between physical assets, edge computing resources and higher-level information systems, and promote the use of standard communication protocols and service-oriented designs as enablers of flexible, connected production environments [17]. Building on these ideas, previous work has proposed component-based microservice frameworks for bioprocess automation, in which control, monitoring, HMI, data logging and higher-level coordination are implemented as independent, containerized components interconnected through a message bus [6,12]. These frameworks have demonstrated how microservices can satisfy real-time and robustness requirements in bioprocess applications while supporting flexible reconfiguration and reuse across equipment such as bioreactors and CIP units. Other approaches have explored multi-agent reinforcement learning for optimizing manufacturing system yields through coordinated agent contributions [18], demonstrating the value of distributed intelligence in complex production environments. However, these and other microservice-oriented approaches largely focus on structural and communication concerns; they do not detail how hybrid deterministic and LLM-based agents can be embedded into the automation stack to support real-time supervision, diagnostic interpretation and preventive maintenance decision-making for concrete operations such as CIP.

The present paper retains a microservice-style architecture but shifts the focus from structural aspects to decision support, diagnostic interpretation and explanation: the agents, reasoning layer and conversational interface are implemented as loosely coupled services that subscribe to enriched data streams and publish supervisory states, diagnostic alerts, maintenance-oriented reports and trend analyses. The evaluation presented here therefore complements prior architectural studies by quantifying how such a service-based, multi-agent layer behaves when deployed over real CIP executions, and by demonstrating its ability to differentiate nominal equipment health, preventive warning patterns and diagnostic alert regimes that are meaningful for maintenance planning.

2.3. LLMs and Conversational Assistants in Industrial and CPS Settings

The emergence of large language models (LLMs) has motivated a growing body of work on industrial and cyber-physical applications. Recent surveys discuss cross-sector industrial use cases of LLMs, including maintenance support, incident analysis and internal knowledge management, and highlight both opportunities and risks in terms of robustness and governance [19,20,21]. Several authors propose architectures where on-premise LLMs are integrated with IoT and cyber-physical systems in Industry 4.0, typically acting as a semantic layer or conversational front-end for heterogeneous data sources [22,23,24,25].

Lim et al. propose frameworks in which LLMs are embedded into industrial automation systems as intelligent orchestration components, enabling multi-agent coordination, natural-language task interpretation and adaptive manufacturing execution [26,27]. These contributions demonstrate a clear interest in bringing LLMs closer to operational technology, but most remain at the level of high-level frameworks or generic use cases rather than detailed, domain-specific instantiations that operate continuously on live process data, track equipment health trends across multiple execution cycles and interact directly with plant operators for maintenance-oriented decision-making.

Existing work on LLM-based assistants for industrial systems largely focuses on providing conversational access to documentation, historical data or SCADA/IoT tags in real time, often evaluating performance in terms of question answering precision and response generation [28]. Recent applications have demonstrated how LLMs can facilitate natural language queries over complex manufacturing process data and real-time sensor streams, enabling production personnel to interact with high-dimensional datasets without requiring specialized analytical expertise [29]. Retrieval-augmented generation (RAG) approaches in industrial settings have shown significant improvements in domain-specific knowledge retrieval, achieving recall rates above 85% for technical service and regulatory documentation [30]. Complementary work on multimodal LLM-based fault detection has shown how GPT-4-based frameworks can improve scalability and generalizability across diverse fault scenarios in Industry 4.0 settings [31], though these approaches primarily target fault classification rather than continuous process supervision and preventive maintenance planning. However, existing LLM-based approaches for manufacturing data exploration typically focus on real-time visualization and anomaly detection in continuous production environments [29], rather than providing lifecycle-aligned supervision for multi-stage batch processes with preventive maintenance decision support. In contrast, this work targets a concrete batch process (CIP) and evaluates a decision-support architecture that integrates rule-based supervision, enriched data streams and language-based interaction within a process-centric context aligned with the lifecycle of CIP executions.

2.4. LLMs for Robotics, Task Planning and Control Logic

A substantial portion of applied LLM research in industry-related domains comes from robotics and task planning. Concrete implementations show LLM-based task and motion planning for construction robots [32], as well as frameworks such as DELTA that decompose long-term robot tasks into sub-problems and translate them into formal planning languages [33]. Recent work on embodied intelligence in manufacturing has demonstrated how LLM agents such as GPT-4 can autonomously design tool paths and execute manufacturing tasks, achieving success rates above 80% in industrial robotics simulations [34], while parallel work on LLM-based mobile robot navigation has shown that models such as Llama 3.1 can dynamically generate collision-free waypoints in response to natural language commands and environmental obstacles [35]. Complementary work on LLM-based digital intelligent assistants in assembly manufacturing has demonstrated significant improvements in operator performance, user experience, and cognitive load reduction [36].

Recent work on human-centric smart manufacturing has shown how LLM-based conversational interfaces can lower the digital literacy barrier for operators by enabling natural-language access to real-time machine data, as demonstrated by the ChatCNC framework for CNC monitoring [37]. Complementarily, ref. [38] propose an LLM-driven industrial automation framework where multi-agent structures, structured prompting, and event-driven information models enable end-to-end control, from interpreting real-time events to generating production plans and issuing control commands [38]. These works focus primarily on code generation, configuration, and verification of control logic prior to deployment, rather than on continuous supervision, diagnostic interpretation, and operator support during runtime. Critically, none of these contributions address the integration of LLM-based analytics with deterministic supervisory logic for preventive maintenance planning through cross-cycle trend analysis and equipment health monitoring—the focus of the present work.

2.5. Positioning of the Present Work

Overall, existing literature shows intense activity around LLMs for documentation retrieval, maintenance support, robotics and control-code generation in industrial and CPS environments. However, there is a noticeable gap regarding batch processes and safety-critical cleaning operations such as CIP, particularly in terms of architectures that combine deterministic supervision with LLM-based diagnostic interpretation to support preventive maintenance decision-making. To the best of our knowledge, no prior work reports an architecture that simultaneously (i) integrates deterministic agents for state estimation and safety monitoring with LLM-based analytics operating over live process buffers; (ii) supports cross-cycle trend analysis for preventive maintenance through enriched data streams; (iii) combines structured process rules with LLM reasoning to ensure process adherence while minimizing hallucinations [39]; and (iv) quantitatively evaluates the system’s ability to differentiate nominal equipment health, preventive warning patterns and diagnostic alert regimes across real industrial executions.

The present work addresses this gap by proposing, implementing and evaluating such an architecture in a real CIP deployment, showing that it is feasible to combine rule-based supervision and language-based assistance in safety-critical batch processes while maintaining coherent supervisory states, data-grounded explanations and actionable diagnostic patterns that support preventive maintenance planning and equipment health monitoring. The case study evaluation demonstrates that the architecture can distinguish between nominal baseline operation, preventive maintenance signals and diagnostic alert conditions, providing operators with contextualized, maintenance-oriented decision support that goes beyond traditional threshold-based alarm systems.

3. Generic Batch Process Supervision Architecture

3.1. Architectural Evolution from Prior Work

Existing industrial automation systems require component-based architectures to achieve scalability, interoperability, and reliability in complex manufacturing environments. Recent work [6] demonstrated that component-based microservices architecture effectively addresses these requirements in batch bioprocess automation, establishing a proven foundation for flexible system composition, real-time coordination and cross-facility scalability.

This work extends that well-established architectural foundation to address a novel integration challenge: incorporating large language models (LLMs) into safety-critical industrial batch process supervision. Rather than proposing a fundamentally different architecture, this approach applies explicit temporal domain separation to the proven component-based microservices pattern: maintaining the microservices scalability, integration capability and reliability properties demonstrated in prior work while introducing LLM-based analytics in a non-deterministic layer carefully isolated from real-time safety-critical control.

The key architectural innovation is not the overall structure (which inherits principles from prior work) but rather the explicit separation of temporal domains—deterministic supervision (occupying the safety-critical layer) and non-deterministic analytics (occupying the LLM reasoning layer)—which enables safe integration of non-deterministic AI into systems requiring bounded real-time guarantees, a challenge not previously addressed in industrial batch process supervision.

Process-Agnostic Design: While the architecture is instantiated and evaluated here in the context of Clean-in-Place (CIP) operations in a beverage plant, its design is process-agnostic: the same component-based pattern applies to fermentation, distillation, pasteurisation, chemical synthesis, and other multi-stage batch procedures by reconfiguring process-specific rules, ontologies, and enrichment logic. This composability—the ability to adapt component-based architectures to new contexts through parameter configuration rather than architectural redesign—is inherited directly from prior work and preserved in this extension.

Presentation Structure: The architecture is presented through two complementary perspectives that build on this foundational understanding:

Conceptual Layered View (Figure 1): Illustrates how temporal domain separation is achieved by decomposing the supervision problem into layers with distinct temporal constraints—deterministic safety-critical supervision (<100 ms), soft real-time coordination (200–500 ms), non-deterministic analytics (1–2 s), and asynchronous persistence. This view shows the separation between cyber (software) and physical (PLC/SCADA) domains, event-driven communication patterns inherited from the microservices foundation, and the role of the orchestration layer in routing operator intents to specialized agents while maintaining safety barriers.
Detailed Component View (Figure 2): Emphasizes the technical depth required for production deployment by specifying real-time constraints for each layer, data transformation pipelines implementing semantic bridging from industrial signals to natural language, temporal domain separation between deterministic and non-deterministic processing, and the three-tier data persistence strategy that scales from high-frequency raw sensor buffers to long-term execution summaries. This view demonstrates how the generic component-based approach is instantiated with concrete technologies and performance characteristics.

Together, these views provide both a high-level understanding of the architectural pattern inherited from prior work and the technical depth required for production deployment in safety-critical industrial environments.

3.2. Conceptual Architecture: Extending Microservices with Temporal Domain Separation

Building on the component-based microservices foundation established by [6] this architecture extends those principles to address AI integration by introducing explicit temporal domain separation. The original microservices approach demonstrated scalability and integration through loosely coupled, independently deployable components. This work applies the same separation-of-concerns principle to time-domain constraints: creating distinct layers optimized for different temporal requirements—from sub-100 ms deterministic safety-critical supervision to 1–2 s non-deterministic LLM analytics—while preserving the composability and flexibility that made the original microservices pattern valuable.

Figure 1 illustrates this extension as a layered, event-driven architecture spanning from operator-facing interfaces down to physical process equipment. At its core, an Event Bus exposes a global data space where agents, user interfaces and the control layer exchange events and enriched data streams, while an intermediate orchestration layer routes operator requests to specialised agents according to user intention and process context. Critically, the temporal domain separation shown in this figure—with separate paths for deterministic supervision (<100 ms) and non-deterministic analytics (1–2 s)—represents the key architectural innovation extending the microservices foundation to safely integrate AI into safety-critical industrial environments.

The main architectural layers visible in this conceptual view are:

User Interaction Layer: Operators interact through a web interface providing natural-language chat, structured alarm panels and visual summaries of process status. Operators can ask questions in free text (e.g., about current batch progress, recent anomalies or historical comparisons), request specific functions (e.g., remaining time for a stage, list of active batch executions) or issue management actions (e.g., creating or starting batch requests). The interface normalizes these inputs and forwards them to the orchestration layer as high-level events containing the user query, role and context. This interface design follows the principle established in prior work: separating operator concerns (intent expression) from system concerns (intent fulfillment).
Orchestration Layer: Acts as a mediator between user intents and the underlying agents, a key component-based microservices pattern from prior work. It receives normalized requests from the interaction layer and classifies them into intent categories such as control (start/stop batch, interact with the master process), configuration (programmes, equipment, permissions), deterministic analysis (current state, diagnostics, time estimates), real-time free analysis (ad hoc queries over the live buffer) andoffline analysis (post-run reports over stored data). Critically, the orchestrator routes safety-critical requests to the Deterministic Supervisor (enforcing <100 ms response time) and advisory requests to the LLM Assistant (permitting 1–2 s soft real time). Based on this intent and the user role, the orchestrator dispatches the request to the appropriate agent via the data and event hub and later aggregates or reformats the response into a coherent message for the operator interface. This routing decision—the core of temporal domain separation—enables the system to leverage LLM capabilities for diagnostics while guaranteeing deterministic safety properties.
Agent Control Layer: Implements specialized microservices for process supervision, configuration management, and decision support, directly inheriting the loosely coupled component philosophy of prior work. Agents include:
- Primary Controller: Coordinates batch execution lifecycles at the physical layer, implementing deterministic state machines proven reliable in industrial environments.
- Administration Service: Manages configuration and governance (recipes, equipment definitions, permissions), enabling the flexibility through parameterization demonstrated as critical in prior work.
- Deterministic Supervisor: Performs rule-based supervision with hard safety alerts operating within <100 ms constraints. This layer preserves all traditional supervision logic, ensuring that AI integration does not compromise safety guarantees established through decades of PLC/SCADA programming.
- Real-time Monitoring Service: Provides flexible operator-driven analysis over live process data using LLM-backed analytics. This new agent (not present in prior work) bridges the gap between deterministic supervision and LLM-based reasoning by consuming enriched semantic representations rather than raw sensor streams.
- Offline Analysis Service: Generates post-run reports and comparative analyses over historical data, operating asynchronously without impacting real-time constraints.
The independence of these agents—each handling distinct concerns—reflects the scalability and composability principles demonstrated in [6].
Event Bus (Global Data Space): Underpins communication between all agents and the physical layer, implementing the inter-component communication pattern proven essential in microservices environments. The Event Bus is realized using Redis 7.2 with:
–
Publish/subscribe channels for real-time events enabling decoupled agent-to-agent communication;
–
Redis Streams for time-series storage of per-batch sensor trajectories, optimized for append-heavy industrial workloads with $O (1)$ latency;
–
Hash structures for configuration and resource status, enabling efficient lookups for context management.
This hub interface decouples producers (PLCs, sensors) from consumers (agents, UI), supporting multiple concurrent batch processes and allowing analytical capabilities to be extended without changes to control hardware—preserving the non-invasive integration pattern central to industrial safety. The choice of Redis (rather than traditional message brokers like RabbitMQ or Kafka) reflects the single-node, brownfield deployment scenario typical of industrial plants: Redis provides stream data optimization and sub-millisecond append latency without the distributed consensus overhead inappropriate for on-premise industrial computing.
Physical Layer: Encompasses PLC/SCADA systems executing deterministic control loops (50–100 ms cyclic execution) and industrial process equipment (reactors, pumps, valves, sensors, actuators). The architecture integrates non-invasively through passive monitoring (read-only sensor access via MODBUS/Serial) and safety-validated command injection (Deterministic Supervisor issues emergency stops after constraint validation). This non-invasive approach—proven in prior work—remains unchanged, ensuring that existing PLC programs and certifications are unaffected by the addition of LLM-based decision support.

Composability and Extensibility Preserved from Prior Work:

A key advantage of adopting an orchestrator-based, service-oriented architecture—established as essential in prior work—is that new decision-support agents can be added incrementally, without modifying the underlying batch programmes or restarting the supervision layer. Each agent subscribes to the same enriched data streams and loads its context according to the active batch execution and stage, producing additional supervisory states, diagnostics, preventive maintenance recommendations or trend analyses. This design makes it possible to combine heterogeneous AI techniques (for example, rule-based agents, fuzzy logic, neural networks, LLM-based assistants) within a common framework and to evolve the decision-support and diagnostic capabilities over time as new agents are deployed. The temporal domain separation introduced here enables this heterogeneity while maintaining safety: deterministic agents always retain veto power over LLM-generated recommendations.

The architecture does not assume a single monolithic AI component, but rather a distributed set of process-aware agents coordinated by an orchestrator—precisely the pattern proven scalable in industrial bioprocess automation [6]. The evaluation presented in this work focuses on one such configuration, where rule-based supervision (Deterministic Supervisor) and a language model assistant (Real-time Monitoring Service and LLM Assistant) share the same batch context for CIP operations, with the Deterministic Supervisor maintaining ultimate control through command validation. However, the same pattern can be used to integrate further agents (for example, predictive models for equipment degradation, optimization modules for chemical consumption, yield forecasting models) without disrupting existing services or batch control programmes. This extensibility through agent addition (rather than architectural redesign) exemplifies the flexibility that made the original microservices approach valuable for industrial environments.

Temporal Domain Separation: The Key Innovation:

The temporal domain separation visible in Figure 1 represents this work’s core extension of the microservices foundation. Whereas traditional microservices focus on separating concerns through function specialization (Master Controller vs. Administration Service), this architecture adds temporal separation: routing requests bound by real-time constraints to the Deterministic Supervisor (enforcing <100 ms worst-case response time) and advisory requests to the LLM Assistant (permitting 1–2 s soft real time). This separation enables the system to leverage non-deterministic AI for operator decision support without compromising the deterministic safety guarantees that industrial plants depend on. The orchestrator implements this routing decision transparently—operators request information or diagnostics through natural language, but the system automatically routes safety-critical information through determinism-preserving paths and advisory information through LLM-capable paths.

Technology Stack Justification and Design Principles

The specific technologies selected for implementation express the architectural principles inherited from prior work. Rather than arbitrary choices, each technology implements a principle proven in component-based microservices for industrial systems. Table 1 provides the mapping from architectural principles to technological instantiation:

Redis Streams for Event Bus:

Redis Streams are selected for the Event Bus specifically because it implements the inter-component communication pattern proven essential in microservices environments. Key design rationale:

Microservices efficiency without distributed complexity: Traditional message brokers (RabbitMQ, Kafka) optimize for fault tolerance across distributed clusters through consensus protocols and persistent replication—appropriate for cloud-scale systems but introducing unnecessary latency in single-node industrial deployments. Redis Streams provide efficient time-series append operations (O(1) latency) without distributed coordination overhead.
Stream data optimization: Industrial systems generate continuous time-series data (10–1000 Hz sensor sampling, 10,000–100,000 measurements per batch cycle). Redis Streams’ representation optimizes for append-heavy workloads typical of sensor data, providing memory-efficient ring buffers and sub-millisecond append latency critical for real-time agent communication.
Single-node brownfield deployment: Most existing industrial facilities operate on-premise with dedicated server hardware (not cloud infrastructure). Redis eliminates distributed consensus overhead, aligning with prior work’s emphasis on operational simplicity—a single Redis instance runs locally, operators understand local-running processes, data sovereignty regulations are satisfied.
Preserves component composability: Each microservice component (Orchestration Agent, Deterministic Supervisor, Enrichment Pipeline) communicates through Redis, maintaining the loose coupling and independent deployability demonstrated as essential in prior work.

Qwen 2.5 (7B parameters) for LLM Inference:

Qwen 2.5 is selected for embedded LLM inference recognizing that edge deployment represents a fundamental constraint in industrial environments—a principle demonstrated throughout prior work on bioprocess automation. Key selection criteria:

Code generation capability for downstream task automation: Unlike base language models (LLaMA 2, Mistral) designed primarily for fine-tuning, Qwen 2.5 includes instruction-tuning optimized for task-specific code generation. This enables integration with downstream analytics tools (e.g., PandasAI for exploratory analysis of enriched time-series) and operator procedure synthesis from natural language requests—capabilities beyond pure conversational reasoning.
Instruction-tuned for industrial domain tasks: Mistral and LLaMA 2 represent base models optimized for general-purpose language understanding. Qwen 2.5’s instruction-tuning targets task-specific reasoning—exactly the diagnostic and predictive analytics required in batch process supervision—improving reliability without requiring domain-specific fine-tuning data often unavailable in brownfield facilities.
Embedded deployment optimization: Memory footprint (7B parameters require ∼14 GB VRAM) and inference latency (<2 s on consumer-grade NVIDIA RTX GPUs typical in industrial settings) compatible with edge hardware constraints. This eliminates dependency on cloud API calls that prior work identified as problematic due to data sovereignty regulations (particularly strict in pharmaceutical manufacturing), network reliability constraints (industrial plants often operate on low-bandwidth networks), and latency requirements for real-time operator decision support.
Multilingual reasoning capability: Global beverage and pharmaceutical manufacturing facilities operate with operators across Spanish, English, and German linguistic communities. Qwen 2.5’s multilingual training ensures diagnostic reasoning remains consistent across operator language preferences—important for international production networks.

Fuzzy Logic for Signal Enrichment (exemplifying semantic bridging):

Data enrichment pipeline implements semantic bridging—the core transformation enabling LLMs to reason about industrial numerical signals. Fuzzy logic chosen as exemplifying this transformation:

Semantic representation over numerical classification: The enrichment goal is not classification accuracy (is this alarm state HIGH or MEDIUM?) but semantic representation converting temperature = 67.3 °C into Temperature: Optimal suitable for natural language LLM reasoning. Fuzzy logic and other interpretable enrichment methods provide this semantic transformation through linguistic variable assignment.
Interpretability critical in safety-sensitive decision support: Industrial decision support requires human oversight and override capability. Unlike black-box methods (SVM, neural networks), fuzzy logic produces explicit linguistic outputs (High, Rising, Concerning) operators can validate, understand, and override when necessary—essential for maintaining operator trust in LLM-assisted supervision where errors have economic (unplanned downtime: USD 10k–USD 100k/hour) and safety implications.
Compatibility with limited historical data: Brownfield facilities rarely maintain extensive historical sensor datasets suitable for neural network training. Fuzzy logic and comparable interpretable methods (rule-based statistical process control, isolation forest anomaly detection) require minimal historical tuning, making them appropriate for facilities where machine learning approaches cannot be trained due to data availability constraints.
Flexibility of enrichment approach: While fuzzy logic exemplifies the semantic enrichment concept, alternative methods providing comparable interpretability and data efficiency (symbolic rule-based classifiers, Bayesian networks for uncertainty quantification, one-class SVM for equipment degradation detection) could provide equivalent enrichment. The specific technology choice is less critical than the semantic bridging principle—converting sensor streams into natural language representations suitable for LLM reasoning.

3.3. Detailed Component Architecture: Real-Time Constraints and Data Flow

While Figure 1 presents the conceptual organization as layered cyber-physical separation, Figure 2 provides a complementary component-level perspective that emphasizes real-time constraints, data transformation pipelines, and the explicit separation between deterministic and non-deterministic processing domains. This view is essential for understanding how the architecture maintains safety guarantees while integrating LLM-based analytics, and for supporting production deployment in environments with bounded computational resources and hard real-time supervision requirements.

3.3.1. Orchestration Layer: Timing and Intent Classification

The Orchestration Layer operates under soft real-time constraints (200–500 ms), implementing three core components:

Context Manager: Maintains process-aware execution state, loading and unloading batch-specific configurations (programmes, equipment mappings, stage definitions) dynamically as executions start and complete. Ensures that each agent operates within the correct process context (CIP cycle, fermentation batch, distillation run) without requiring global state synchronization.
Intent Classifier: Analyzes incoming operator queries and UI events using pattern matching and lightweight natural language understanding (Alert Ack for alarm acknowledgments, user query text for conversational requests) to determine intent categories: control commands, configuration requests, safety queries, diagnostic analysis, or conversational analytics.
Query Router: Dispatches classified intents to specialized agents via named routing channels over the Event Bus (cip.cmd, cip.admin, cip.sensia, cip.Real-time, cip.assistant), ensuring that safety-critical commands bypass non-deterministic components and reach the Deterministic Supervisor directly.

This timing specification ensures that operator interactions receive responses within human perception thresholds (500 ms for interactive applications) while maintaining sufficient decoupling to prevent user interface load from affecting deterministic supervision performance.

3.3.2. Agent Control Layer: Temporal Domain Separation

The Agent Control Layer enforces strict separation between deterministic and non-deterministic processing domains to maintain safety guarantees while supporting flexible AI-enhanced analytics.

Deterministic Domain (Hard Real Time: <100 ms)

Safety-critical agents execute in bounded deterministic time under hard real-time constraints:

Master Controller: Orchestrates batch execution lifecycles, issuing start, stop, pause, and abort commands to the PLC/SCADA layer (cip.cmd channel). Enforces programme sequencing and stage transitions according to predefined recipes and interlock conditions, publishing lifecycle events (batch.start, stage.transition, batch.complete) that other agents subscribe to for context synchronization.
Administration Service: Manages configuration persistence and retrieval (cip.admin channel), including batch programmes, equipment definitions, sensor mappings, and alarm rule specifications. Provides versioning and audit trails for regulatory compliance (FDA 21 CFR Part 11, ISO standards), ensuring traceability of all configuration changes with timestamps, operator identity, and approval workflows.
Deterministic Supervisor (Safety Barrier): Implements the critical safety validation layer that separates non-deterministic AI components from physical equipment control (cip.sensia channel). This agent:
–
Monitors enriched process variables against hard safety limits defined in the batch recipe (e.g., maximum temperature thresholds, minimum flow rate requirements, conductivity envelopes for sanitization stages).
–
Computes discrete supervisory states (NORMAL/WARNING/CRITICAL) using rule-based logic executing in $O (1)$ time per evaluation cycle.
–
Issues emergency stop commands directly to the PLC/SCADA layer when safety thresholds are violated, bypassing all other agents to ensure fail-safe behavior.
–
Logs all safety events (threshold violations, emergency stops, interlock triggers) to the Data Persistence Layer for post-execution forensic analysis.
This component enforces the safety barrier principle: LLM-generated recommendations and diagnostic insights produced by non-deterministic agents can inform operator decisions and trigger alerts, but cannot directly command actuators or override safety interlocks. All control actions must pass through deterministic validation (Safety Validation DETERMINISTIC path in Figure 2) before reaching the Physical Layer.

Non-Deterministic Domain (Soft Real-Time: 1–2 s)

Analytical agents operate under relaxed timing constraints, enabling more computationally intensive AI processing:

Real-time Monitoring Service: Maintains a rolling window of enriched process data in a context buffer suitable for ad hoc queries (cip.Real-time channel). Answers operator queries requiring correlations, aggregate statistics, and trend analysis during the run without strict real-time constraints. Implements deterministic data frame operations (Pandas, Polars) for numerical queries and provides diagnostic pattern summaries for the LLM Assistant.
LLM Assistant Service: Provides conversational analytics over live process buffers and historical execution logs using locally deployed large language models (Qwen 2.5 via Ollama, cip.assistant channel). Operates in a retrieval-augmented generation (RAG) architecture:
–
Maintains enriched process data (linguistic variables, supervisory states, trend indicators) in a context buffer.
–
Answers free-form operator queries (“Why is flow lower than usual?”, “Is this temperature profile normal for Stage 3?”) by grounding responses in enriched time-series data and execution summaries, minimizing hallucination.
–
Operates in read-only mode: cannot issue control commands or modify batch parameters, serving purely as a conversational analytics interface.
This separation ensures that LLM inference latency (typically 1–2 s for 7B parameter models on NVIDIA GPU infrastructure) cannot affect deterministic supervision or safety-critical decision paths.

All agents communicate exclusively through the Event Bus using publish–subscribe channels, enabling horizontal scalability and independent lifecycle management (agents can be started, stopped, or upgraded without disrupting other services or batch executions).

3.3.3. Signal Enrichment Pipeline

The Signal Enrichment Pipeline (visible in the right side of Figure 2) transforms raw sensor streams into semantically meaningful process representations suitable for both deterministic supervision and LLM-based reasoning. This pipeline operates between the Physical Layer and the Event Bus, implementing a three-stage transformation.

Stage 1: Raw Sensor Data Acquisition

High-frequency measurements (temperature, flow, conductivity, pressure, level) are sampled at 100–1000 ms intervals from PLC/SCADA systems via industrial protocols (MODBUS/TCP, Serial RS-232/RS-485, TTY) [40]. The Physical Layer publishes raw sensor readings (“Raw Sensor Data: Temp, Flow, Cond” in Figure 2) to the Event Bus, providing a decoupled interface that isolates the enrichment pipeline from specific PLC vendors or communication protocols.

Stage 2: AI-Driven Enrichment Processing

The Enrichment Agent (green box in Figure 2) applies multiple AI techniques in parallel:

Fuzzy Logic Systems: Transform numerical sensor readings into linguistic variables using trapezoidal membership functions (e.g., temperature ∈ {TooLow, Optimal, SlightlyHigh, Critical}). These linguistic assessments provide intuitive process state descriptions that operators can interpret directly and LLMs can reason over without requiring domain-specific fine-tuning.
Statistical Analysis: Compute rolling statistics (mean, standard deviation, rate of change) over sliding time windows (30 s, 5 min, 15 min) to detect process trends, oscillations, and drift patterns. Generate trend indicators (Stable, Increasing, Decreasing, Oscillating) and deviation scores (normalized distance from expected trajectory based on historical execution profiles).

These enrichment techniques execute in soft real time (processing latency typically 200–500 ms per sensor update), balancing computational cost with diagnostic value.

Stage 3: Enriched Data Publication

Enriched variables (“Enriched Data Streams: Linguistic Variables, Supervisory States” in Figure 2) are published to the Event Bus at 1–5 s intervals via the Publish channel, providing decision-support agents with high-level process representations. The enrichment frequency is deliberately lower than raw sensor sampling to reduce message throughput while still supporting real-time supervision.

This pipeline architecture enables LLM-based agents to reason over semantically meaningful process concepts (“Temperature is slightly high and increasing”) without requiring domain-specific model fine-tuning on raw numerical signals.

3.3.4. Data Persistence Layer: Three-Tier Storage Strategy

The Data Persistence Layer (bottom of Figure 2) implements a hierarchical storage architecture optimized for distinct query patterns, retention policies, and performance requirements.

Layer 1: Raw Time-Series (High-Frequency Operational Buffer)

Storage: Redis Streams with bounded in-memory circular buffers ( $10^{4}$ records per active batch, ∼5 MB).
Retention: Expired immediately upon batch completion.
Frequency: 100–1000 ms (native sensor sampling).
Usage: Real-time anomaly detection, sub-second diagnostic queries, Deterministic Supervisor lookback windows.
Performance: $O (1)$ memory growth, $O (log N)$ time-range queries, deterministic worst-case latency.

Layer 2: Enriched Time-Series (Medium-Term Trend Analysis)

Storage: Redis Streams with configurable retention policies (AOF persistence for warm data recovery).
Retention: Configurable based on real-time window definitions (activity-based: batch completion + TTL, time-based: rolling window with MAXLEN auto-eviction, event-based: context window around alarms).
Frequency: 1–5 s (AI-augmented variables: fuzzy states, trend indicators, anomaly scores).
Usage: Cross-cycle correlation queries (Monitoring Service), historical context queries (LLM Assistant), equipment health dashboards.
Performance: $O (1)$ append (XADD), $O (log N)$ time-range queries (XRANGE), bounded memory footprint via MAXLEN and TTL policies.

Layer 3: Execution Summaries (Long-Term KPI and Compliance Records)

Storage: Redis Hashes (per-batch summaries) and Sorted Sets (time-indexed batch archive), persisted via AOF/RDB snapshots.
Retention: Configurable TTL policies (30–365 days typical, indefinite for regulatory compliance via selective key persistence).
Data: Batch metadata (programme, circuit, timestamps), stage-level KPIs (duration, compliance metrics, set-point tracking error), diagnostic summary (alarm counts, supervisory state distribution), maintenance flags linked to equipment assets.
Usage: Longitudinal trend analysis (6–12 months), equipment lifecycle assessment (1000+ cycles), regulatory audit trails.
Performance: $O (1)$ hash field access (HGETALL), $O (log N + M)$ batch range queries (ZRANGEBYSCORE), efficient aggregations over time ranges without requiring relational joins.

This three-tier persistence strategy (visible as three blue boxes in Figure 2, connected via “Store raw data”, “Store enriched data”, and “Aggregate on completion” paths from the Event Bus) addresses a fundamental tension in industrial AI deployment: real-time supervision requires low-latency access to recent high-resolution data, while long-term analytics and compliance require efficient storage and querying of aggregated historical records.

3.3.5. Configurable Real-Time Windows: Application-Dependent Memory Management

The three-layer persistence strategy implements a more general principle: configurable real-time windows that define what data must reside in memory for fast deterministic decisions versus what can be loaded on-demand from disk-backed persistence. Unlike fixed temporal aggregation schemes (minute/hour/day hierarchies), the architecture allows window definitions to be configured based on process semantics, supporting three complementary patterns (Figure 3):

Pattern A: Activity-Based Windows

For batch processes with discrete execution lifecycles (CIP cleaning cycles, fermentation batches, maintenance tasks), the real-time window corresponds to the active activity duration:

Window Definition: Data during active batch execution (e.g., CIP cycle: ∼2 h, 1 Hz sampling, 7200 samples).
Memory Residency: HOT (in-memory Redis Streams, 5–7 MB per batch).
Transition Trigger: Batch completion event.
Post-Transition: Compute batch summary (HSET to Hash), expire raw stream (TTL 24 h), archive summary to Sorted Set (ZADD with completion timestamp as score).
Query Behavior: Active batch queries use deterministic path (XRANGE, <10 ms). Historical batch queries load summaries from Hashes (HGETALL, <10 ms if cached) or warm data from AOF/RDB (50–200 ms if evicted).

Pattern B: Time-Based Windows

For continuous monitoring applications (cold chain temperature tracking, inventory levels, ambient conditions), the real-time window is a rolling time duration:

Window Definition: Fixed rolling window (e.g., last 24 h, 1 min sampling, 1440 samples).
Memory Residency: HOT (in-memory Redis Stream with MAXLEN 1440, auto-evicting oldest entries, ∼500 KB per sensor).
Transition Trigger: Time-based eviction (oldest sample > 24 h old).
Post-Transition: Compute daily summary (min/max/avg), persist to Hash, expire raw data.
Query Behavior: Queries within 24 h use deterministic path (XRANGE, <10 ms). Queries beyond 24 h load daily/monthly summaries from Hashes or warm data from persistence.

Pattern C: Event-Based Windows

For alarm systems and fault detection (equipment failures, safety events, anomaly tracking), the real-time window is defined by active event context:

Window Definition: Context buffer around active alarms (e.g., last 100 critical events and ±15 min of sensor data).
Memory Residency: HOT (in-memory Redis Stream with MAXLEN 100 events, ∼200 KB).
Transition Trigger: Alarm resolution and cooldown period.
Post-Transition: Persist alarm summary with context snippet, expire full context data.
Query Behavior: Active alarm queries use deterministic path. Forensic analysis of resolved alarms loads context from warm storage.

Hybrid Multi-Pattern Operation

A single application can maintain multiple window patterns simultaneously. For example, the CIP deployment uses:

Activity windows: Active batch monitoring (5 MB per batch, transitions on batch completion).
Time windows: Equipment health trends (500 KB for last 24 h, rolling eviction).
Event windows: Critical alarm tracking (200 KB for last 100 events, eviction on resolution).

Total real-time memory footprint: ∼5.7 MB for concurrent supervision of activity, time, and event patterns, yet supporting multi-year historical queries through transparent warm data loading from Redis AOF/RDB persistence (50–200 ms latency).

Adaptive Memory Tiering

Data residency transitions through three tiers based on window membership, as illustrated in Figure 3:

HOT (In-Memory): Data within active real-time windows, accessed via Redis Streams/Hashes in <10 ms, supports deterministic queries without LLM overhead (e.g., “current batch temperature” → direct XRANGE query).
WARM (Disk-Backed, Auto-Load): Recently transitioned data (e.g., yesterday’s batches, previous week’s equipment health), persisted via Redis AOF (1 s durability) and RDB snapshots (5 min intervals), loaded on-demand in 50–200 ms and cached for subsequent queries.
COLD (Archived): Data beyond retention policy (e.g., >365 days), expired via TTL, manual export for regulatory purposes only.

The query router automatically detects whether a query targets HOT data (deterministic path, <10 ms, no LLM) or WARM data (analytical path, auto-load + LLM, 1–2 s total latency). Application code is unaware of this distinction; the system transparently handles data loading based on window membership checks.

Figure 3 illustrates the complete configurable window architecture, showing the three window patterns (activity-based, time-based, and event-based), the three data residency tiers (HOT, WARM, COLD), and the query routing mechanism that automatically selects between deterministic and analytical paths based on data location.

This configurable window architecture addresses a fundamental design challenge: what constitutes real-time is application-dependent. CIP batches require second-by-second monitoring during active execution, while inventory tracking may only need hourly snapshots. By decoupling window definitions from storage implementation, the architecture supports diverse industrial use cases (batch processing, continuous monitoring, alarm management) within a unified Redis-native persistence framework, achieving bounded memory consumption (5–20 MB typical) regardless of operational history depth while maintaining queryability over multi-year time ranges through automatic warm data loading.

Table 2 summarizes the window pattern configurations for different industrial applications.

3.3.6. Physical Layer Integration

The Physical Layer (bottom-right of Figure 2) encompasses:

PLC/SCADA Systems (Cyclic Execution: 50–100 ms): Execute deterministic control loops governing valve actuation, pump operation, heating/cooling regulation, and safety interlocks. Maintain real-time control authority over physical equipment according to existing plant standards.
Industrial Process Equipment: Physical assets including reactors, heat exchangers, pumps, valves, sensors, and actuators. In the CIP deployment, this comprises cleaning circuits, chemical dosing systems, and sanitization equipment.

The architecture integrates non-invasively through:

Passive Monitoring (read-only): Signal Enrichment Pipeline subscribes to sensor data via MODBUS/Serial (“MODBUS / SERIAL / TTY / RS-232 / RS-485” path in Figure 2), without modifying PLC logic.
Safety-Validated Command Injection (write-restricted): Deterministic Supervisor issues emergency stops via “Safety Validation DETERMINISTIC” path after constraint validation, bypassing non-deterministic agents.

Critically, no PLC firmware modifications or control logic rewriting are required for deployment, enabling AI-enhanced supervision in brownfield plants without triggering regulatory recertification or production downtime.

3.4. Process-Agnostic Design and Instantiation Examples

Table 3 illustrates how the same architectural components map to different batch process types, with process-specific parameters configured at deployment. The CIP case study discussed in this paper corresponds to the first instantiation column; analogous deployments for fermentation or distillation would reuse the same components with different recipes, variables and rules.

This abstraction enables the same codebase and agent logic to supervise multiple process types, with only rules, ontologies and variable mappings reconfigured per application. The CIP evaluation in Section 6 demonstrates this pattern in production; fermentation and distillation deployments would follow the same architecture with domain-specific enrichment and diagnostic rules.

3.5. Rationale for a Layered, Agent-Based Design

Instead of deploying a single, monolithic large language model (LLM) with access to all plant data and control interfaces, the proposed architecture adopts a layered, agent-based design. This choice is motivated by several considerations:

Context management and efficiency: Industrial environments produce high-volume, heterogeneous data across multiple batch lines and equipment. Concentrating all information into a single LLM context would be inefficient and difficult to control. By separating operator interaction, orchestration and specialized agents, each component only handles the subset of data and functions relevant to its role, enabling smaller, faster contexts and more predictable behavior.
Safety and decentralization of intelligence: Safety-critical logic (e.g., interlocks, sequence enforcement, hard alarms) remains in deterministic agents and PLC/SCADA systems, while LLM-based components are confined to explanatory and analytical roles. This decentralization avoids granting a single LLM direct authority over control actions and supports explicit validation paths for any recommendation before it affects the process.
Flexibility and evolution: Food, beverage and chemical plants often operate under maquila-like conditions, with frequent product changes, contract manufacturing and evolving cleaning or processing requirements. A layered architecture with loosely coupled agents and a generic data hub allows new analysis functions, additional lines or updated recipes to be introduced without redesigning the entire system, supporting continuous adaptation and incremental deployment.
Scalability across services and sites: As production scales to multiple lines, services or sites, the same pattern can be replicated: interaction and orchestration remain similar, while additional agent instances are deployed per line or plant. This aligns with microservice and Industry 4.0 principles, enabling horizontal scale-out and reuse of components across different customers and service contracts.

In summary, the layered architecture decentralizes intelligence across specialized agents optimized for specific intentions (control, configuration, deterministic supervision, real-time analytics and offline analysis), reduces the need for large, global LLM contexts, and preserves safety and scalability in settings where processes, products and cleaning or conversion requirements evolve continuously.

3.6. Process-Aware Context Management

The decision-support layer is implemented as a set of loosely coupled agents coordinated by an orchestrator. Each agent is process-aware: upon receiving a request or detecting a new batch execution (for example, a CIP cycle, fermentation batch or distillation run), it loads the corresponding context (programme, equipment configuration, current stage, enriched variables and rule-based specifications) and subscribes to the relevant data streams. Within this context, the agent produces supervisory states, notifications, reports or language-based explanations that are specific to the active batch run.

In the CIP deployment, the context includes cleaning circuit, tank, programme, stage and key variables such as temperature, flow and conductivity. In a fermentation deployment, it would instead comprise vessel, recipe, inoculum, growth phase and variables such as pH, dissolved oxygen and biomass proxies. In distillation, it would include column configuration, feed composition and reflux schedule. This process-centric abstraction enables agents to be added or updated incrementally: new agents—whether based on rules, fuzzy logic, neural networks or large language models—can subscribe to the same process contexts and publish their outputs to the common message bus, without requiring changes to the underlying control logic or to other agents.

3.7. AI-Specific Technical Challenges and Solutions

The integration of LLM-based analytics into industrial batch supervision requires addressing several technical challenges that are specific to AI deployment in safety-critical, real-time operational environments with legacy infrastructure. This section explicitly identifies these challenges and presents the architectural solutions implemented in the proposed system to enable reliable AI-enhanced decision support in production environments.

3.7.1. Challenge 1: Non-Deterministic AI in Real-Time Safety-Critical Contexts

Problem

Large language models exhibit non-deterministic behavior due to temperature-based sampling, context-dependent inference latency (1–5 s), and potential for hallucination. Industrial batch processes require deterministic supervision with bounded response times (<100 ms for safety-critical state evaluation) and guaranteed fail-safe behavior. This timing incompatibility prevents direct integration of LLMs into control loops, as variable inference latency would violate real-time guarantees required by industrial safety standards.

Solution

The architecture enforces strict temporal domain separation (Section 3.3) to isolate safety-critical functions from non-deterministic AI processing:

Deterministic Domain (<100 ms): Safety-critical agents (Master Controller, Deterministic Supervisor) execute rule-based logic with $O (1)$ worst-case complexity, maintaining real-time guarantees independent of LLM availability.
Non-Deterministic Domain (1–2 s): LLM Assistant and Monitoring Service operate under relaxed timing constraints, providing advisory analytics without affecting control loops.
Safety Barrier: The Deterministic Supervisor validates all commands before actuator execution, ensuring that LLM-generated recommendations cannot trigger unsafe operations even in the event of model misbehavior or inference failure.

This architectural pattern enables LLM-based decision support to coexist with safety-critical control without compromising real-time guarantees or requiring modifications to existing PLC/SCADA logic, addressing a key deployment barrier for AI in brownfield industrial facilities.

3.7.2. Challenge 2: Bounded Context Windows for Industrial Time-Series

Problem

Industrial batch processes generate high-frequency time-series data (100–1000 ms sensor sampling) over extended execution periods (30–120 min per cycle). A typical CIP execution produces

10^{4}

–

10^{5}

raw data points, far exceeding the context window limits of current LLMs (4k–128k tokens). Direct consumption of raw sensor streams is computationally infeasible and semantically inappropriate: LLMs trained on natural language lack domain-specific understanding of numerical process signals (e.g., distinguishing a 2 °C temperature spike from normal thermal inertia, interpreting flow oscillations as pump cavitation vs. control valve hunting).

Solution

The Signal Enrichment Pipeline (Section 3.3) transforms raw sensor data into semantically meaningful representations suitable for LLM reasoning:

Fuzzy Logic Transformation: Converts numerical readings into linguistic variables (“Temperature: Optimal”, “Flow: Slightly Low”) that align with natural language semantics.
Statistical Aggregation: Computes trend indicators (Stable/Increasing/Decreasing) and deviation scores over sliding windows, reducing data dimensionality by 10–100× while preserving diagnostic information.
Supervisory State Abstraction: Rule-based agents pre-compute discrete process states (NORMAL/WARNING/CRITICAL) that provide high-level context summaries, enabling LLMs to reason over process trajectories without accessing raw sensor buffers.

This enrichment strategy addresses the fundamental mismatch between industrial data formats (high-frequency numerical time-series) and LLM input representations (tokenized natural language text), enabling conversational analytics without requiring domain-specific model fine-tuning or expensive retraining on plant-specific sensor data.

3.7.3. Challenge 3: Hallucination Prevention in Safety-Critical Environments

Problem

LLMs are prone to generating plausible-sounding but factually incorrect responses when queried beyond their training distribution or when context is insufficient. In industrial settings, hallucinated diagnostic recommendations (e.g., false equipment failure predictions, incorrect procedural advice) can erode operator trust, trigger unnecessary maintenance interventions, or delay appropriate responses to genuine anomalies. Unlike consumer applications where hallucination is an inconvenience, operational decision support requires factual accuracy: incorrect diagnostics have economic consequences (unplanned downtime costs of USD 10k–USD 100k per hour in beverage/pharma plants) and safety implications.

Solution

The architecture implements a Retrieval-Augmented Generation (RAG) pattern with deterministic grounding:

Context Grounding: All LLM queries are augmented with retrieved execution data (enriched time-series, supervisory states, KPI summaries) from the Data Persistence Layer, ensuring that responses reference actual process observations rather than model priors.
Deterministic Fallback: For structured queries (remaining time, stage progress, alarm counts), the Real-time Monitoring Service provides deterministic numerical answers computed via database queries, bypassing the LLM entirely when factual precision is required.
Source Attribution: LLM responses include explicit references to data sources (timestamp ranges, variable names, threshold values), enabling operators to validate claims against raw data and reject unsupported recommendations.

This grounding mechanism enables the architecture to deploy generative AI in operational contexts while mitigating the risk of confabulation in diagnostic workflows. The spot-check evaluation (Section 6) demonstrates median absolute error below 3% between LLM-reported values and ground-truth logs, validating the effectiveness of this approach in practice.

3.7.4. Challenge 4: Resource Constraints in Edge Deployment

Problem

Industrial plants require on-premise AI deployment on edge computing infrastructure (on-premise servers, industrial PCs) with constrained computational resources (16–64 GB RAM, consumer-grade GPUs) due to data sovereignty requirements, network reliability concerns, and regulatory restrictions that preclude cloud API access. Large-scale LLMs (70B+ parameters) requiring distributed inference or cloud API access are incompatible with these constraints. This contrasts with academic AI research and commercial applications that typically assume access to high-performance computing clusters or cloud-hosted inference services.

Solution

The architecture targets locally deployable, resource-efficient LLMs:

Model Selection: Qwen 2.5 (7 billion parameters) running via Ollama provides sub-2 s inference on NVIDIA RTX GPUs while maintaining sufficient reasoning capability for diagnostic analytics.
Bounded Memory Footprint: The three-layer data persistence strategy (Section 3.3) ensures $O (1)$ memory consumption per active batch (∼5 MB raw buffer + 2–10 MB enriched context), enabling concurrent supervision of multiple lines on shared infrastructure.
Asynchronous Inference: LLM queries execute in parallel with deterministic supervision, allowing model inference latency to be absorbed during operator wait times without affecting real-time control performance.

This resource-aware design demonstrates that effective LLM-based decision support does not require frontier model scale, enabling deployment in resource-constrained industrial environments typical of brownfield facilities.

3.7.5. Summary: AI-Specific Contributions

The proposed architecture addresses four fundamental technical challenges that arise specifically from the integration of non-deterministic, context-limited, hallucination-prone generative AI into safety-critical, real-time, resource-constrained industrial environments:

Temporal domain separation to preserve real-time safety guarantees despite non-deterministic AI components.
Signal enrichment pipelines to bridge the semantic gap between numerical process data and natural language LLM inputs.
RAG-based grounding to prevent hallucination and ensure factual accuracy in diagnostic recommendations.
Resource-efficient deployment to enable local inference under edge computing constraints.

These solutions constitute contributions to the intersection of industrial automation and AI, demonstrating how LLM-based decision support can be deployed in production environments with legacy infrastructure, safety requirements, and resource constraints that distinguish industrial contexts from the controlled laboratory or cloud-based settings typical of AI research.

4. Real-Time Data Management and LLM Integration

The proposed architecture explicitly balances memory usage, query latency and quality of assistance by combining compact, enriched data buffers with decentralized analysis agents that support both immediate supervisory decisions and longitudinal trend analysis for preventive maintenance.

4.1. Bounded In-Memory Buffers and Latency

Each active CIP instance maintains an in-memory buffer with a configurable maximum number of records (MAXLEN policy), corresponding to a few hours of operation at the given sampling rate, instead of persisting the entire plant history in the LLM context. This rolling dataset is exposed through multiple logical windows: a short horizon (e.g., last 2 min) for real-time state estimation, a medium horizon (e.g., last 5–10 min) for periodic diagnostics and preventive warning detection, and a full-cycle view for exploratory real-time queries, cross-cycle trend analysis and offline post-run analysis. By keeping buffer size bounded, deterministic statistics and LLM prompts can be computed with near-constant time and memory, even as the plant scales to more CIP circuits or higher sampling frequencies.

4.2. Enriched, Decentralized Data Views

Incoming sensor records are enriched in real time with additional attributes such as CIP program and stage labels, elapsed and remaining time, progression percentage, fuzzy quality indices and short textual descriptors of detected issues, recommended actions and maintenance implications. These enriched records are partitioned per CIP instance and per agent, so that each agent only receives the subset of variables needed for its function (e.g., deterministic supervision windows for immediate alerts, real-time analytics buffers for diagnostic interpretation, offline archives for longitudinal trend correlation with maintenance records), avoiding a single centralized, monolithic dataset. This decentralization reduces contention and enables real-time support for diagnostics, preventive warning generation and maintenance-oriented decisions at the agent level, while still allowing higher-level aggregation when required for cross-cycle trend analysis.

4.3. Compact LLM Contexts and Efficient Queries

To avoid high latency and cost, the LLM never receives raw, unfiltered streams; instead, the real-time analytics agent constructs compact prompts that combine aggregate statistics over the relevant window, a small set of representative samples (e.g., most recent records or identified outliers) and contextual metadata such as program, stage, circuit and recent alert patterns. This strategy keeps context size small and stable while preserving the information required for meaningful explanations, diagnostic pattern recognition and preventive maintenance recommendations, making it feasible to answer natural-language queries with response times compatible with operator workflows in real plants. The integration layer treats the LLM as a pluggable component accessed through a uniform API, so different on-premise or cloud models can be used without changing the surrounding data management strategy.

4.4. Lowering the Expertise Barrier for Real-Time Decisions and Maintenance Planning

Because diagnostic windows and enriched attributes are computed deterministically and exposed in structured form, operators receive interpretable summaries (e.g., state class, quality grade, main issues, suggested actions and maintenance implications) without having to interpret raw trends or write complex analytical queries. The LLM layer then builds on these structured diagnostics to provide natural-language explanations, cross-cycle trend comparisons and preventive maintenance recommendations, allowing less specialized staff to understand CIP performance, identify emerging equipment degradation patterns and take timely decisions without requiring deep training in control theory, statistical analysis or maintenance planning. This combination of bounded, enriched buffers and layered LLM integration directly supports real-time analysis, diagnostic interpretation and maintenance-oriented decision support in complex industrial environments while keeping computational and training costs under control.

4.5. Illustrative Data Views and Query Profiles

Table 4 summarizes the main data windows maintained per CIP instance and their intended use while Table 5 illustrates typical query types, underlying sources and expected response times.

4.6. Fuzzy Enrichment of Real-Time CIP Data

Raw CIP sensor readings alone (e.g., temperature, flow, conductivity, volume) are often difficult to interpret directly in the control room, especially when multiple variables must be considered simultaneously or when subtle degradation patterns must be distinguished from normal process variability. To provide operators and downstream agents with more actionable information, the system applies a fuzzy evaluation layer in real time, which maps continuous variables and their deviations from nominal profiles into linguistic assessments, quality indices and maintenance-oriented diagnostic labels.

For each time-stamped record, the enrichment pipeline:

normalizes the raw measurements with respect to the target CIP program and current stage (e.g., expected temperature, flow and conductivity ranges for an alkaline wash);
evaluates a set of fuzzy rules that capture expert knowledge, such as “temperature slightly low but within tolerance” or “flow persistently below minimum, suggesting pump wear”, combining multiple variables and short-term trends; and
outputs a discrete state label (e.g., NORMAL, WARNING, CRITICAL), a continuous quality grade in $[0, 1]$ , and short textual descriptors summarizing the main issues, recommended actions and maintenance implications.

The resulting enriched record contains, in addition to the raw sensor values and timestamps, fields such as stage, progress_percent, remaining_time, state, quality_grade, motives and actions, as well as equipment and circuit identifiers and auxiliary counters (e.g., number of recent warnings or critical samples). Table 6 shows a simplified example of such a record.

From a resource perspective, the bounded-buffer design keeps the memory footprint per CIP instance relatively small: for example, a buffer of

10^{3}

–

10^{4}

records with a dozen numeric and categorical fields typically remains in the order of a few megabytes in memory, even when multiple CIP lines are monitored concurrently. This is modest compared to the memory requirements of the LLM itself and allows deterministic statistics and prompt construction to execute with predictable latency. At the same time, the enriched representation and pre-computed diagnostics provide enough information for operators to take online decisions directly from the generated summaries and explanations, including identification of preventive maintenance opportunities and equipment health trends, without resorting to external tools or manual data export. In practice, this combination of low in-memory cost and high decision support value—spanning immediate supervisory needs and longer-term diagnostic interpretation—is essential for deploying AI-assisted supervision in industrial environments where hardware resources and real-time constraints are non-negotiable.

4.7. Computational Resource Profile

Table 7 summarizes the computational footprint of the main system components in the deployed configuration. The memory requirements per CIP instance remain modest (order of a few megabytes for the enriched buffer), while the LLM service (Qwen 2.5 7B via Ollama) operates as a shared resource across all active CIP lines, with inference latencies compatible with interactive operator workflows and real-time diagnostic queries.

Because the LLM runs locally on NVIDIA GPU infrastructure, query latencies remain predictable and do not depend on external cloud API availability or network conditions. This edge deployment strategy addresses the computational and energy constraints inherent in cloud-based LLM architectures for IIoT environments [41] while maintaining data sovereignty and reducing communication overhead. Recent work on edge-cloud collaboration for LLM task offloading in industrial settings has demonstrated that local inference can reduce latency by 60–80% compared to cloud-based alternatives [41], supporting the architectural decision to deploy Qwen 2.5:7B on dedicated GPU infrastructure rather than relying on external API services.

5. Evaluation Methodology

5.1. Evaluation Objectives

This study evaluates the architectural feasibility and operational capability of the proposed hybrid AI framework in a real industrial environment, rather than pursuing large-scale statistical validation of process performance improvements.

Accordingly, the evaluation focuses on how the decision-support layer behaves when exposed to authentic CIP executions, and whether it fulfils its intended role as a real-time, operator-oriented supervisory layer that supports diagnostic interpretation and preventive maintenance planning. The study addresses the following research questions:

RQ1: Can the architecture maintain coherent supervisory states across diverse CIP conditions?
RQ2: Do the agents respond within application-level real-time constraints during production operation?
RQ3: Does the LLM-based analytics layer provide grounded, numerically consistent explanations over enriched process data?
RQ4: Is the system deployable and operable in an actual CIP installation without disrupting existing PLC/SCADA programmes?
RQ5: Can the architecture differentiate nominal, preventive and diagnostic alert patterns in a way that is meaningful for maintenance decision-making?

The primary goal is thus to demonstrate that the architecture can be instantiated, operated and queried in a working plant, and that its supervisory outputs remain coherent, stable, data-grounded and diagnostically useful over complete CIP executions.

5.2. Case Selection Rationale

Rather than sampling a large number of nearly identical cycles, the evaluation adopts a case study approach using three representative CIP executions drawn from ongoing production at the VivaWild Beverages plant. These executions were selected to span the typical operational envelope observed in day-to-day operation and to illustrate complementary diagnostic scenarios: (a) a fully nominal baseline, (b) executions with preventive warnings that do not compromise stage success, and (c) executions with more pronounced deviations that generate diagnostic alerts under operational stress.

These three executions were purposively selected from 24 complete CIP runs monitored by the system during a six-month deployment period at the VivaWild Beverages plant. The selection criterion was to span the diagnostic spectrum observed during deployment: nominal baseline, preventive warning patterns, and diagnostic alert conditions that remain operationally successful but provide evidence of the architecture’s ability to differentiate equipment health states meaningful for maintenance planning. Table 8 summarizes the three selected cases and their respective diagnostic rationales.

These cases are representative of normal plant operation. The CIP process at VivaWild is governed by standard operating procedures, automated recipe management and regulatory constraints, and exhibits high repeatability across executions in terms of temperature, flow and conductivity profiles.

In optimized industrial plants, CIP failures are rare by design; most executions complete successfully within specifications. The diagnostic challenge is therefore not detecting catastrophic faults—which traditional threshold-based alarms handle effectively—but rather identifying subtle, longitudinal degradation patterns in executions that still meet regulatory specifications. For example, flow rates declining 2% per cycle over five consecutive executions due to gradual pump wear may generate no critical alarms (each execution remains above the minimum regulatory threshold), yet this pattern signals actionable maintenance opportunities that can prevent unplanned downtime.

From this perspective, analyzing 100 consecutive nominally successful CIP runs would provide strong evidence of system stability and low false-alarm rates, but no evidence of diagnostic capability—it would merely confirm that the system does not fail when the process does not fail. Conversely, analyzing a small number of carefully chosen executions spanning nominal baseline, preventive warning scenarios and diagnostic alert regimes provides meaningful evidence about the architecture’s ability to differentiate operational conditions that are relevant for maintenance decision-making, extract equipment health insights from executions that meet specifications, and distinguish acceptable process variability from systematic equipment drift requiring intervention before operational impact occurs. This case study validation approach is appropriate for establishing architectural feasibility and diagnostic behavior in real industrial conditions.

Importantly, the system is deployed continuously and monitors all CIP executions; the three cases are used for detailed analysis and illustration of the architecture’s diagnostic behavior. In routine operation, the value of the system emerges from the accumulation and inspection of alert patterns across multiple cycles (e.g., repeated flow warnings indicating pump degradation, recurring temperature deviations suggesting boiler maintenance). Extending the present analysis to longitudinal trend quantification and correlation with maintenance records is left for future work.

The objective of this evaluation is therefore architectural and operational validation of the decision-support layer, not statistical inference about long-term process improvements across all historical CIP data.

5.3. Evaluation Setup

The evaluation concentrates on the behavior of the multi-agent decision-support architecture on top of the existing PLC/SCADA control layer, which remains unchanged. For each of the selected CIP executions, the agents subscribe to the live data streams and CIP events, maintain their internal contexts, and produce supervisory states, alerts, diagnostics and language-based explanations as they would during routine plant operation.

The study examines whether the architecture:

tracks the progression of each CIP stage and maintains coherent discrete supervisory labels;
provides timely detection of out-of-spec situations at the supervisory level and differentiates nominal operation from conditions that warrant preventive or more urgent maintenance; and
generates natural-language diagnostics and summaries that remain consistent with the enriched data seen by the agents and are usable for maintenance-oriented decision-making.

No changes are made to the underlying low-level controllers, so any deviations in stage-specification compliance reflect actual plant behavior rather than experimental manipulation.

5.4. Datasets

The experiments use logs collected from complete CIP runs covering alkaline, sanitizing and final rinse stages. Each second, the system records:

process variables such as instantaneous and windowed flow, temperature, conductivity, pH and accumulated volume;
stage-level information (CIP programme, circuit, tank, current stage and progression);
supervisory outputs, including discrete labels (NORMAL, WARNING, CRITICAL), fuzzy quality indices and structured reasons describing the current situation (e.g., parameters within range, low flow conditions, insufficient volume); and
selected responses from the LLM-based analytics agent, including numerical summaries and narrative explanations.

For each run, the enriched records thus combine raw sensor values with derived features such as moving averages, diagnostic counters (e.g., recent warnings or critical samples) and identifiers of the active CIP programme, circuit and tank. This representation allows the evaluation to relate the behavior of the decision-support layer to concrete process trajectories and stage-level specifications and to verify the numerical consistency of language-based outputs.

5.5. Evaluation Metrics

On top of these logs, the methodology defines a set of quantitative metrics that capture complementary aspects of the decision-support behavior of the architecture. The metrics considered in this work cover:

stage-specification compliance (time spent within prescribed operating ranges);
consistency between supervisory labels and process conditions;
temporal stability of the labels; and
consistency between language-based explanations and the underlying enriched data.

Additional metrics such as alert sensitivity, specificity, reaction time and anticipation window are formally defined as part of the framework to support future, larger-scale evaluations. In the present case study campaign, quantitative results focus on compliance, label consistency and stability, while sensitivity-related metrics are inspected qualitatively in the three selected executions.

5.5.1. Stage Specification Compliance

This metric quantifies to what extent each CIP stage operates within its predefined process specification (e.g., flow, temperature and pH ranges). It measures the proportion of time during which the relevant variables remain inside their acceptable bands, providing a stage-level quality indicator that is independent of the underlying controller implementation.

We let a stage s be sampled at discrete time instants

t = 1, \dots, T_{s}

and let

x_{t}

denote the vector of monitored variables at time t. For each stage, a specification function

C_{s} (x_{t})

returns 1 if all relevant variables lie within their prescribed ranges at time t and 0 otherwise. The stage specification compliance is then defined as

Compliance (s) = \frac{1}{T_{s}} \sum_{t = 1}^{T_{s}} C_{s} (x_{t}) .

In the experiments, this metric is computed separately for alkaline, sanitizing and final rinse stages for each of the three representative executions, and then discussed on a case-by-case basis to characterize how the decision-support layer behaves under nominal, variable and more demanding conditions.

5.5.2. Alert Sensitivity and Specificity

This metric assesses how accurately the architecture detects out-of-spec situations by comparing generated alerts against process conditions derived from the logged variables in terms of sensitivity (recall) and specificity.

We let each time instant t within a stage be associated with (i) a binary ground-truth anomaly label

y_{t} \in {0, 1}

, obtained by applying stage-specific rules to the process variables (e.g., low flow, out-of-range temperature or pH) and (ii) a binary alert decision

{\hat{y}}_{t} \in {0, 1}

where

{\hat{y}}_{t} = 1

if the architecture issues a WARNING or CRITICAL label and

{\hat{y}}_{t} = 0

otherwise.

Over a set of time instants

T

, the sensitivity and specificity are defined as

Sensitivity = \frac{\sum_{t \in T} I (y_{t} = 1 \land {\hat{y}}_{t} = 1)}{\sum_{t \in T} I (y_{t} = 1)},

Specificity = \frac{\sum_{t \in T} I (y_{t} = 0 \land {\hat{y}}_{t} = 0)}{\sum_{t \in T} I (y_{t} = 0)},

where

I (\cdot)

denotes the indicator function. These metrics establish a complete evaluation framework for future longitudinal studies. Systematic quantification is deferred to future work due to the limited number of confirmed critical episodes in the present case study deployment and the need for comprehensive ground-truth annotations integrated with maintenance records across larger execution cohorts.

5.5.3. Reaction Time to Anomalies

This metric quantifies how quickly the architecture reacts once a process variable crosses an out-of-spec threshold by measuring the delay between the onset of an anomaly and the first alert (WARNING or CRITICAL) raised by the system.

We let each anomaly episode k be characterized by (i) an onset time

t_{k}^{onset}

, when a ground-truth condition

y_{t} = 1

becomes true (e.g., flow drops below a minimum threshold and remains there) and (ii) an alert time

t_{k}^{alert}

, corresponding to the first instant

t \geq t_{k}^{onset}

for which

{\hat{y}}_{t} = 1

.

The reaction time for episode k is defined as

Δ t_{k}^{react} = t_{k}^{alert} - t_{k}^{onset} .

Over a set of episodes

K

, the average reaction time can be computed as

ReactionTime = \frac{1}{| K |} \sum_{k \in K} Δ t_{k}^{react} .

In the present study, reaction times are inspected qualitatively for the selected cases; a systematic computation across larger datasets is left for future work.

5.5.4. Anticipation Window

This metric evaluates how early the architecture warns about conditions that may compromise the success of a CIP stage, such as insufficient final rinse volume or prolonged low-flow operation. It is defined as the time margin between the first relevant alert and the occurrence of a stage-level failure or constraint violation.

For each stage-level failure event j, we let (i)

t_{j}^{fail}

denote the time at which the failure or constraint violation is detected (e.g., the stage ends without meeting minimum volume or quality criteria) and (ii)

t_{j}^{warn}

denote the earliest time

t \leq t_{j}^{fail}

at which the architecture issues a WARNING or CRITICAL alert related to that failure mode.

The anticipation window for event j is defined as

Δ t_{j}^{anticip} = t_{j}^{fail} - t_{j}^{warn} .

Over a set of failure events

J

, the average anticipation window can be expressed as

AnticipationWindow = \frac{1}{| J |} \sum_{j \in J} Δ t_{j}^{anticip} .

As with reaction time, this metric is defined to complete the evaluation framework, but its systematic quantification is outside the scope of the present case study campaign.

5.5.5. State Specification Consistency

While stage specification compliance focuses on whether the process variables remain within their target ranges, a complementary aspect is whether the discrete supervisory labels (NORMAL, WARNING, CRITICAL) are coherent with those ranges. Intuitively, the architecture should report NORMAL when all monitored variables are within specification, and should only escalate to WARNING or CRITICAL when at least one relevant variable leaves its acceptable band.

We let s be a CIP stage sampled at times

t = 1, \dots, T_{s}

and let

x_{t}

and

C_{s} (x_{t})

be defined as above, with

C_{s} (x_{t}) = 1

indicating that all variables are within specification and

C_{s} (x_{t}) = 0

otherwise. We let

L_{t} \in {NORMAL, WARNING, CRITICAL}

denote the discrete label produced by the supervisory layer at time t. A sample is considered label-consistent if

(L_{t} = NORMAL \land C_{s} (x_{t}) = 1) or (L_{t} \in {WARNING, CRITICAL} \land C_{s} (x_{t}) = 0) .

Defining the indicator

D_{s} (t) = \{\begin{matrix} 1, & if the sample at time t is label–consistent, \\ 0, & otherwise, \end{matrix}

the state specification consistency for stage s over a run is given by

Γ_{s} = \frac{1}{T_{s}} \sum_{t = 1}^{T_{s}} D_{s} (t) .

Values

Γ_{s}

close to 1 indicate that the discrete supervisory state almost always matches the underlying process conditions, whereas lower values reveal mismatches between labels and actual operation. In the experiments, this metric is computed per stage and per execution, and then analyzed qualitatively across the three cases.

5.5.6. Stability of State Labeling

In addition to being semantically coherent, the supervisory labels should exhibit a reasonable degree of temporal stability. Excessive oscillations between NORMAL, WARNING and CRITICAL states, especially in the absence of large changes in the process variables, can overload operators and reduce trust in the system.

For each stage s, we let

{L_{t}}_{t = 1}^{T_{s}}

denote the sequence of discrete labels along the execution and let

Δ_{s}

be the number of state changes within that sequence, i.e.,

Δ_{s} = \sum_{t = 2}^{T_{s}} I (L_{t} \neq L_{t - 1}),

where

I (\cdot)

is the indicator function. We let

T_{s}^{\min}

be the duration of the stage in minutes. The stability of state labeling for stage s is then quantified as

Λ_{s} = \frac{Δ_{s}}{T_{s}^{\min}} [changes / \min] .

Low values of

Λ_{s}

indicate stable supervisory behavior (few label transitions per minute), whereas high values highlight stages where the decision layer oscillates frequently between states, signaling either genuinely unstable conditions or overly sensitive thresholds in the supervision logic. In the evaluation, this metric is interpreted in conjunction with the underlying process trajectories for each case.

5.5.7. Consistency of Language-Based Outputs

To assess whether the language-based explanations remain faithful to the underlying enriched data, the evaluation performs spot checks of numerical summaries and counts reported by the conversational interface. For selected responses, the averages and counts stated in the explanation are compared against the statistics computed directly from the corresponding windows of enriched records. This provides qualitative evidence of data-grounded behavior and helps identify potential hallucinations or inconsistencies in the LLM layer.

5.6. Experimental CIP Runs

Figure 4 provides an overview of the three experimental CIP runs used for evaluation. Each run comprises the same sequence of stages (pre-rinse, alkaline, intermediate rinse, sanitizing and final rinse), executed by the existing CIP programmes with their configured timings. The figure illustrates the relative duration of each stage and shows that, from a timing perspective, all runs follow the expected pattern, with alkaline and sanitizing stages occupying the largest fraction of the cycle. The three runs thus provide complementary views of the same programme executed under different operational conditions, highlighting how the architecture’s supervisory and diagnostic outputs evolve from nominal baselines to preventive warnings and more pronounced alert patterns.

6. Results

This section reports the behavior of the proposed decision-support architecture over three representative CIP executions, using the evaluation metrics defined in Section 5.5. The analysis is organized by CIP stage (alkaline, sanitizing and final rinse) and focuses on stage-level compliance, label coherence, temporal stability, temporal response of alerts and notifications, and the consistency of language-based explanations with the underlying data.

As summarized in Figure 4, all evaluated runs follow the same five-stage structure with comparable stage durations. The subsequent analysis concentrates on how the decision-support architecture behaves within these fixed programmes, under nominal conditions (baseline equipment health), preventive warning scenarios (subtle deviations indicating emerging maintenance needs) and diagnostic alert regimes (more pronounced patterns requiring prioritized attention), rather than on extrapolating statistical properties to large cohorts of executions.

6.1. Stage-Level Performance

Table 9 summarizes the main evaluation metrics for each CIP stage, reporting mean and standard deviation across the three executions to provide a compact overview. Given the small number of runs and their deliberately contrasting conditions, these aggregates are interpreted qualitatively, as descriptors of the examined cases rather than as statistically representative estimates.

Table 10 and Table 11 detail the stage specification compliance per execution and the corresponding aggregates. In the sanitizing stage, all three runs operate entirely within their predefined bands, yielding a compliance of

1.00

. In contrast, the alkaline stage exhibits markedly different behaviors across the three cases: CIP 1 (nominal baseline) remains fully within specification, CIP 2 (preventive warnings) spends 75% of the time within acceptable ranges with occasional excursions signaling pump or temperature regulation drift, and CIP 3 (diagnostic alerts) spends virtually no time inside the prescribed bands due to sustained flow and temperature deviations requiring prioritised maintenance review.

The alkaline spread is therefore intentional: it exposes the decision-support layer to both optimal equipment performance and degraded operational regimes that remain within regulatory safety margins but signal maintenance opportunities. This makes the alkaline stage particularly useful for evaluating whether the supervisory logic and explanations remain coherent when process conditions transition from baseline health to preventive and diagnostic alert patterns.

Beyond the raw time within specification, the state specification consistency metric

Γ_{s}

captures how often the discrete supervisory labels agree with the process conditions. For the alkaline stages across the three executions,

Γ_{s}

attains a mean value of approximately

0.98

with a small dispersion, while the sanitizing stages achieve

Γ_{s} = 1.00

. In other words, even when the alkaline stage spends little or no time within its specification (as in CIP 3), the supervisory layer almost never reports NORMAL when key variables are outside their prescribed bands, nor WARNING/CRITICAL when they remain within target ranges. This reinforces the internal coherence of the rule-based monitoring and labeling logic across contrasting equipment health states and operating regimes.

The stability metric

Λ_{s}

further characterizes how these labels evolve over time. For pre-rinse, alkaline and sanitizing stages, the number of label transitions per minute is either zero or very small, with alkaline stages exhibiting on the order of

0.03

changes per minute. This suggests that, under nominal or preventive warning conditions, the supervisory state does not oscillate excessively and remains easy to interpret by operators as meaningful diagnostic trends rather than spurious noise. An interesting outlier appears in the final-rinse stage of CIP 3, where

Λ_{s}

reaches the order of 10 changes per minute, yielding a stage-level mean of roughly

3.3

changes per minute with a large standard deviation. This behavior corresponds to the deliberately stressed diagnostic scenario with highly variable conditions, in which the supervision layer reacts aggressively as the process repeatedly crosses specification boundaries. While this confirms that supervision remains responsive, it also points to the need for additional hysteresis or smoothing mechanisms in future iterations, to avoid overwhelming operators with rapid state changes in known unstable regimes.

Figure 5 illustrates these patterns per execution and per stage. Even in the alkaline run with virtually no time within specification (CIP 3, diagnostic alerts), the state specification consistency remains close to one, whereas label instability is confined to the stressed final-rinse stage of the same diagnostic execution. This case-by-case view supports the interpretation that the architecture preserves coherent supervisory states across nominal baseline, preventive warning and diagnostic alert regimes, and that observed instabilities are localized, interpretable and correspond to genuinely unstable process conditions rather than algorithmic artifacts.

6.2. Alert and Notification Timing

The architecture not only classifies supervisory states in real time but also timestamps alert and notification events, which enables an initial characterisation of temporal behavior. Among the three representative executions, only CIP 3 (diagnostic alerts) exhibits sustained critical episodes according to the discrete supervisory state, so quantitative reaction times can only be meaningfully computed for that case.

Figure 6 summarizes the median reaction time between the onset of a critical episode and the corresponding alert for each execution. In CIP 3, the median reaction time is approximately 35–36 s. For the considered CIP application, this delay is acceptable: it reflects end-to-end supervisory filtering over slow hydraulic dynamics, rather than the latency of LLM inference, and still leaves sufficient time for corrective actions within the process safety margins while avoiding spurious alerts on transient sensor spikes.

In practice, this reaction time incorporates both the time needed to accumulate enough evidence from noisy process data and the filtering performed by the supervisory logic to ensure that alerts correspond to persistent deviations rather than acceptable process variability. As a result, critical alerts are raised only for sustained equipment degradation or operational stress patterns, and the LLM-based supervisor can generate coherent, context-aware explanations that support maintenance prioritisation, instead of reacting to short-lived fluctuations.

The same framework was used to inspect the anticipation window of warning notifications and the LLM response latency. In CIP 2 (preventive warnings), warnings appeared tens of seconds before any escalation to critical states (when such escalation occurred), providing early visibility of deteriorating conditions and actionable lead time for preventive interventions. End-to-end LLM response times for diagnostic queries remained in the order of a few seconds across all executions. Although these measurements are limited to a small number of episodes, they indicate that language-based explanations do not become a bottleneck in the supervisory loop and that the architecture can provide operators with timely, actionable diagnostic information suitable for maintenance planning.

6.3. Diagnostic Pattern Characterisation Across Executions

To illustrate how the architecture differentiates nominal, preventive and diagnostic regimes, Table 12 summarizes the distribution of supervisory states and alert patterns across the three representative executions.

CIP 1 serves as the reference baseline: all process variables remain within nominal bands, no warnings are issued, and the conversational interface confirms normal operation. CIP 2 exhibits intermittent WARNING-level diagnostics in the alkaline stage (e.g., slightly reduced flow, minor temperature excursions) that do not compromise product quality or regulatory compliance but indicate emerging equipment drift. These warnings complete successfully without escalating to critical states, and the architecture provides natural-language summaries highlighting the trend (e.g., “flow is 10% below optimal across the last three cycles, consider pump inspection”). CIP 3 presents sustained warnings and clusters of CRITICAL labels in alkaline and final-rinse stages, corresponding to more pronounced flow and conductivity deviations. While the execution still completes within regulatory bounds, the alert density and LLM-generated diagnostics signal equipment conditions that warrant prioritized maintenance review to prevent unplanned downtime.

This progression demonstrates that the architecture can distinguish between normal process variability (CIP 1), conditions that benefit from scheduled preventive maintenance (CIP 2), and patterns that call for more urgent diagnostic attention (CIP 3), addressing the supervisory challenge outlined in the introduction: interpreting operational signals rather than merely detecting catastrophic failures.

6.4. Semantic Behavior and Language-Based Outputs

Beyond numeric metrics, the logs capture the natural-language explanations and summaries generated by the conversational interface. Spot checks were performed to verify consistency between language-based outputs and enriched data, focusing on numerical summaries (average temperatures, flow rates, warning counts) reported by the LLM during diagnostic queries. When the corresponding windows of enriched records were extracted from the buffer, the computed values matched those stated in the explanations within small relative tolerances, with median errors below 3% for the audited samples.

For instance, during the alkaline stage of CIP 3 (diagnostic alerts), the assistant reported average values for temperature, pH and flow over a recent time window (e.g., temperature around 73 °C and flow close to

1.0 L / s

), as well as the number of warning and critical samples. Representative examples of these checks are summarized in Table 13, confirming that the conversational interface grounds its summaries on actual buffered data rather than hallucinating numerical figures.

Representative examples of these checks are summarized in Table 13, which illustrates the close match between temperatures, flows and warning counts reported by the conversational interface and those computed directly from the enriched logs.

Beyond answering isolated diagnostic queries, the conversational layer can also generate compact reports that summarize recent CIP behavior based on the enriched data streams, including trend analysis across multiple cycles. In the present experiments, such reports were compared against statistics computed directly from the logs, checking that key figures such as average temperatures, flows, stage durations and warning counts remained within small tolerances. A simple fidelity indicator can be defined as the proportion of numeric quantities in a report that fall within a predefined tolerance (for example, less than 1% relative error or within a small absolute band) with respect to the values recomputed from the logs. Under this indicator, all audited reports in the current case study achieved perfect or near-perfect fidelity for the checked quantities, supporting the use of LLM-generated reports as trustworthy, data-grounded views of recent CIP operation suitable for maintenance decision-making.

Similarly, in another interaction during CIP 3, the assistant answered that the current alkaline stage had accumulated dozens of warning states and explicitly reported the number of warnings observed up to that point. Counting the samples flagged as warnings and critical in the corresponding enriched log around the response timestamp yielded counts that were consistent with the reported figures within the temporal window under inspection. These checks, combined with the high state specification consistency and the low rate of spurious label transitions in nominal and preventive regimes, provide convergent evidence that the multi-agent architecture not only maintains coherent discrete supervision but also exposes that supervision through language in a way that remains faithful to the underlying data and actionable for preventive maintenance planning.

6.5. Operational Supervision Capabilities

Beyond numeric performance metrics, the architecture changes how CIP supervision and decision-making can be carried out in real time. Table 14 contrasts typical capabilities of traditional CIP supervision with those provided by the proposed multi-agent architecture.

Across the three evaluated runs, the conversational interface handled several dozen real-time queries, including requests to inspect ongoing stages, generate charts, compute numerical diagnostics and compare recent executions against historical baselines. Each diagnostic response was based on hundreds to thousands of recent enriched records, effectively externalizing ad hoc analysis that would otherwise require manual navigation of HMI screens, offline tools and cross-referencing of maintenance logs. In combination with the stage-level metrics, alert timing observations, diagnostic pattern characterization and semantic consistency checks, these interactions suggest that the architecture not only preserves a coherent and stable supervisory view of the CIP process but also makes that view more accessible, actionable and maintenance-oriented for human operators in real time.

7. Discussion

Existing LLM-based assistants for industrial systems largely focus on providing conversational access to documentation, historical databases or SCADA/IoT tags in real time, often evaluating performance in terms of question answering or task completion rates. While this line of work has demonstrated that natural-language interfaces can reduce the effort required to retrieve information, it typically leaves the underlying supervisory structure unchanged and rarely quantifies how the assistant behaves with respect to process specifications, stage-level quality metrics or maintenance-oriented diagnostic patterns.

In contrast, the architecture presented here targets a concrete CIP process and evaluates a decision-support layer that sits on top of existing programmes, combining enriched data, rule-based supervision and language-based interaction. Through three representative CIP executions spanning nominal baseline (CIP 1), preventive warning (CIP 2) and diagnostic alert (CIP 3) conditions, the proposed metrics show that the architecture maintains high time within specification for sanitizing stages, that its discrete states are both coherent with the process ranges and temporally stable in most stages and that its language-based summaries remain consistent with the numerical logs during real CIP runs. Furthermore, the case study analysis demonstrates that the architecture can differentiate between normal equipment health, emerging maintenance needs (e.g., pump wear, boiler drift) and conditions requiring prioritized diagnostic review, addressing the supervisory challenge of interpreting operational signals rather than merely detecting catastrophic failures.

From an architectural perspective, the decision-support layer reuses component-based, microservice-style principles explored in previous work for industrial cyber–physical systems and CIP supervision, but extends them with a stronger focus on diagnostic interpretation, preventive maintenance support and natural-language explanation. The agents and reasoning services operate as loosely coupled services that can be replicated and orchestrated across multiple CIP circuits, and the case study evaluation demonstrates that such an architecture can maintain coherent and stable supervisory states, faithful language-based summaries and actionable diagnostic alerts when applied to real plant data. The observed behavior across the three executions—ranging from fully nominal to sustained diagnostic alerts—suggests that the architecture is robust to diverse operational regimes and provides meaningful support for maintenance decision-making.

While several works advocate modular or microservice-based architectures for industrial CPS, and some recent systems allow LLM-based assistants to invoke multiple tools or services, these approaches typically remain agnostic to specific batch processes and rarely treat CIP stages as first-class entities in the agent design. Here, the architecture combines an orchestrator with CIP-aware agents that load their context according to the active programme and stage, enabling new decision-support agents (LLM-based or otherwise) to be added incrementally without modifying the underlying CIP control logic or disrupting existing automation. This process-centric context management enables agents to track equipment health trends across multiple cycles and produce diagnostics that are aligned with maintenance planning horizons, rather than operating as isolated, per-query chatbots.

Most existing LLM-based assistants for industrial environments load context on demand by retrieving tags, time series or documents relevant to a given user query, operating over loosely structured data and remaining largely agnostic to the lifecycle of specific batch processes. In contrast, the agents in the proposed architecture load CIP-aware contexts that explicitly encode the active programme, stage, enriched variables, rule-based specifications and historical alert patterns, and then operate continuously within that context to produce supervisory states, alerts and explanations. This process-centric way of managing context goes beyond per-query retrieval and aligns the behavior of the agents with the execution of real CIP runs and the accumulation of diagnostic trends over time, making stage-level supervision, cross-cycle trend analysis and preventive maintenance recommendations more consistent and actionable.

An important aspect of this work is that the architecture is evaluated in a real production environment, rather than in a laboratory testbed or synthetic simulation. The reported CIP runs correspond to actual executions at the VivaWild Beverages plant, where the decision-support agents operate on the same data streams and timing as the plant’s automation system. This increases the practical relevance of the observed behavior of the supervisory states, alerts, diagnostic patterns and explanations, and shows that the architecture can be deployed alongside existing PLC/SCADA infrastructure without disrupting normal operation or compromising safety-critical control logic.

7.1. Comparison with Related Work

Table 15 positions the present work relative to recent contributions on LLM-based industrial supervision. Unlike generic chatbots over documentation or offline code-generation tools, the proposed architecture integrates real-time analytics, deterministic supervision and conversational assistance in a deployed CIP environment, providing quantifiable process-level metrics and demonstrating diagnostic pattern differentiation across nominal, preventive and critical regimes.

The key distinction is that the proposed system operates continuously on live CIP executions, maintains process-specific contexts per cycle, and combines deterministic safety-critical alerts with flexible LLM-driven analytics, quantifying both technical performance (compliance, consistency, stability) and diagnostic capability (differentiation of nominal, preventive and critical alert patterns). This enables direct comparison with baseline SCADA supervision in terms of operator workload, diagnostic coverage and maintenance planning support, which generic LLM assistants do not address.

7.2. Relation to Commercial CIP Supervision Systems

Commercial CIP supervision and recipe management platforms, such as Siemens Braumat [43] or Rockwell Automation FactoryTalk Batch [44], provide robust HMI, recipe configuration and compliance reporting. These systems excel at deterministic control, audit trails and regulatory documentation, but typically offer limited support for natural-language queries, root-cause exploration, cross-cycle trend analysis and preventive maintenance recommendations during runtime. Operators must rely on predefined screens, fixed alarm thresholds and manual data export for deeper diagnostics and maintenance planning.

The proposed architecture does not replace such commercial platforms; instead, it complements them by adding a conversational analytics and diagnostic interpretation layer on top of the existing CIP control logic. The deterministic supervision and LLM-based analytics agents subscribe to the same data streams that feed the SCADA/HMI but provide richer explanations, on-demand statistical views, cross-execution trend queries and maintenance-oriented diagnostics without requiring modifications to the PLC programs or recipe structures. This hybrid approach preserves the determinism and certification status of the underlying control system while extending its decision-support capabilities through AI-assisted interfaces that support preventive maintenance planning and equipment health monitoring.

7.3. LLM Safety and Hallucination Mitigation

A key architectural decision is the strict separation between deterministic supervision (rule-based state estimation, hard safety alerts) and LLM-driven analytics (exploratory queries, narrative explanations, trend analysis). Safety-critical logic remains entirely within the Deterministic Supervisor and CIP Master Controller agents, which operate on fixed rule sets and do not depend on LLM outputs. The LLM-based Real-time Monitoring Service is confined to an advisory role: it receives enriched data (already validated by deterministic agents), computes aggregate statistics, generates natural-language summaries and highlights diagnostic trends, but it does not issue control commands or override alarms.

This design ensures that even if the LLM hallucinates or produces incorrect summaries, the process remains safe and the operator continues to receive deterministic alerts through the structured notification panel. The spot-check evaluation (Section 6.4) shows median absolute errors below 3% for key variables, confirming that hallucinations are rare in the deployed configuration. The architecture’s three-layer mitigation strategy (deterministic enrichment, compact prompts, safety bypass) preserves process safety and regulatory compliance. Future work will implement automated fidelity audits comparing all LLM outputs against ground-truth logs to quantify hallucination frequency across the full six-month deployment (thousands of queries).

7.4. Evaluation Philosophy: Diagnostic Capability vs. Statistical Generalization

The case-study approach adopted in this work reflects the operational reality of well-optimized industrial plants: CIP failures are rare, and most executions complete successfully by design. In such environments, the value of a decision-support architecture lies not in detecting catastrophic faults—which traditional threshold-based alarms handle effectively—but in identifying subtle, longitudinal equipment degradation patterns in executions that still meet regulatory specifications.

Consider a scenario where flow rates decline 2% per cycle over five executions due to gradual pump wear. Each individual CIP completes successfully within specifications, generating no critical alarms. However, without trend analysis across cycles, maintenance is deferred until flow falls below the regulatory minimum, triggering an unplanned shutdown and reactive maintenance. The proposed architecture addresses this gap by issuing preventive warnings when flow enters the lower tolerance band (e.g., 10% below target but still above minimum), clustering these warnings across stages and cycles, and providing natural-language diagnostic reports that connect the observed pattern to probable equipment causes (pump wear, boiler efficiency degradation, dosing system calibration drift).

From this perspective, analyzing 100 consecutive nominal CIP executions would provide strong evidence of system stability and low false-alarm rates but no evidence of diagnostic capability. The three-case evaluation presented here deliberately spans the operational spectrum—nominal baseline (CIP 1, no warnings), preventive warning scenarios (CIP 2, intermittent warnings indicating emerging drift) and diagnostic alert regimes (CIP 3, sustained warning clusters requiring prioritised review)—to demonstrate that the architecture can differentiate these regimes and provide actionable maintenance insights. This case study validation approach is appropriate for establishing architectural feasibility and diagnostic behavior; longitudinal studies correlating alert patterns with confirmed equipment failures and quantifying maintenance cost reduction are planned as future work once sufficient operational history and CMMS integration are available.

7.5. Limitations and Future Work

This paper presented a real-time decision-support architecture for industrial batch processes, instantiated and evaluated on a CIP use case in an operational beverage plant. The architecture combines enriched data streams, rule-based supervision and LLM-based interaction to support diagnostic interpretation and preventive maintenance planning.

The evaluation analyzed three representative CIP executions—selected from 24 runs monitored during a six-month deployment—spanning nominal baseline conditions (CIP 1), preventive warning scenarios (CIP 2) and diagnostic alert regimes (CIP 3), demonstrating the architecture’s ability to differentiate equipment health states and operational patterns meaningful for maintenance decision-making.

Second, the temporal analysis of alerts and notifications focuses on median reaction and anticipation times for a limited number of episodes, rather than on full distributions across a broad set of events and operating scenarios. A more comprehensive temporal characterization requires a larger pool of annotated anomalies, warning episodes and confirmed maintenance interventions.

Third, the assessment of language-based explanations and reports relies on targeted spot checks and a simple fidelity criterion comparing selected numerical quantities against log-derived values, instead of a systematic, large-scale audit of LLM outputs. More extensive semantic evaluation protocols, including automated consistency checks, longitudinal trend fidelity assessments and user studies on trust, usability and maintenance decision quality, are needed to fully characterize the behavior of the conversational layer in production environments and its impact on operator workload and equipment uptime.

A related limitation concerns the systematic quantification of LLM hallucination rates and semantic fidelity across all production queries. The spot-check evaluation (Section 6.4) demonstrates low numerical error (median absolute deviations below 2–3% for key variables) for the audited queries, but systematic quantification of hallucination rates across the thousands of queries generated during the six-month deployment remains future work. The current architecture mitigates hallucination risk through three complementary mechanisms: (i) deterministic enrichment provides validated aggregate statistics and structured diagnostics before LLM inference, reducing the probability of confabulation; (ii) compact prompts with explicit numerical data anchor the LLM’s responses to ground-truth sensor values; and (iii) safety-critical alerts bypass the LLM entirely and are issued by deterministic agents. Future work will implement automated consistency checks comparing all LLM summaries against ground-truth logs, flagging responses with deviations above a configurable threshold (e.g., 5% relative error) for operator review and retraining of prompt templates.

Fourth, while the architecture successfully differentiates nominal, preventive and diagnostic alert patterns within the three evaluated executions, quantifying the predictive value of these patterns for actual equipment failures, unplanned downtime or maintenance cost reduction requires longitudinal correlation with maintenance records, work orders and equipment replacement logs. Future work will integrate the decision-support layer with computerised maintenance management systems (CMMS) to track alert-to-failure lead times, assess the accuracy of preventive recommendations and measure the impact on overall equipment effectiveness (OEE).

Finally, the current architecture is exercised in an advisory role without closing the loop to automatic control actions, so its impact on overall plant performance, safety margins and operator workload remains to be quantified through controlled studies. Controlled A/B testing comparing operator performance (detection time, decision quality, false alarm acknowledgment rate, workload) between traditional SCADA-based supervision and the proposed architecture is needed to quantify the added value in operational terms. Future work will consider longitudinal deployments covering multiple products and CIP programmes, integration with alternative decision-support or optimization strategies, and controlled comparisons in which operators alternate between the traditional interface and the proposed conversational system, measuring metrics such as time to diagnosis, maintenance planning accuracy, false alarm rate reduction and operator satisfaction.

8. Conclusions

This paper presented a real-time decision-support architecture for industrial batch processes that combines enriched data streams, rule-based supervision and LLM-based interaction to support diagnostic interpretation and preventive maintenance planning. The architecture was instantiated and evaluated in a Clean-in-Place (CIP) system at an operational beverage plant, where it operated continuously for six months without modifications to existing PLC/SCADA infrastructure.

8.1. Demonstrated Contributions in the CIP Deployment

The architecture was evaluated through detailed analysis of three representative CIP executions—selected from 24 runs monitored during the six-month deployment—spanning nominal baseline conditions (CIP 1, no warnings), preventive warning scenarios (CIP 2, intermittent warnings indicating emerging equipment drift such as pump wear or boiler efficiency degradation) and diagnostic alert regimes (CIP 3, sustained warning clusters and critical alerts signaling conditions requiring prioritized maintenance review). This evaluation demonstrates the following technical achievements within the CIP deployment context.

8.1.1. Non-Invasive Integration with Brownfield Infrastructure

The architecture operates as an overlay supervision layer, consuming sensor data via passive monitoring (MODBUS/Serial protocols) and providing safety-validated command injection without requiring PLC firmware modifications, control logic rewriting, or recertification processes. The decision-support layer operates on top of existing CIP programmes without degrading the underlying supervisory logic, providing coherent, actionable diagnostic patterns aligned with equipment health monitoring and maintenance planning needs. This non-invasive integration strategy enabled deployment in a facility with decades-old automation infrastructure, demonstrating practical viability in settings where wholesale control system replacement is economically or operationally infeasible.

8.1.2. Operational Regime Differentiation and Diagnostic Pattern Recognition

The case study analysis demonstrated that the architecture successfully differentiates between nominal equipment operation (CIP 1, no warnings), emerging maintenance needs indicated by intermittent preventive warnings (CIP 2, flow or temperature drift suggesting pump wear or boiler efficiency degradation) and conditions requiring prioritized diagnostic review signaled by sustained warning clusters and critical alerts (CIP 3). This pattern differentiation addresses the supervisory challenge identified in the introduction: interpreting operational signals and subtle equipment degradation trends rather than merely detecting catastrophic failures.

The proposed metrics showed that the architecture maintains high time within specification for sanitizing stages across all evaluated runs (100% compliance), that the discrete supervisory states are both coherent with the process ranges (achieving state specification consistency

Γ_{s} \geq 0.98

across alkaline and sanitizing stages) and temporally stable under nominal and preventive conditions (with label transition rates below

0.03

changes per minute in most stages). The observed median reaction time of approximately 35–36 s for critical alerts in CIP 3, combined with anticipation windows of tens of seconds for preventive warnings in CIP 2, provides operators with actionable lead time for maintenance interventions while avoiding spurious alarms on transient sensor fluctuations.

Label instability was confined to the deliberately stressed diagnostic scenario (CIP 3), where additional hysteresis or smoothing mechanisms could be introduced in future iterations to reduce operator alert fatigue while preserving diagnostic sensitivity.

8.1.3. Grounded Conversational Analytics with Hallucination Mitigation

From an operational perspective, the architecture moves part of the real-time analytical burden from the operator to the agents, enabling more proactive, trend-oriented supervision. Instead of manually correlating alarms, trends and stage timings across multiple HMI screens and offline reports, operators can issue stage-specific queries, request cross-cycle trend analyses and receive numerically grounded summaries that are consistent with the enriched logs and support preventive maintenance decisions.

The spot-check evaluation demonstrated close agreement between values reported in language-based explanations and those recomputed from the logs (median absolute errors below 2–3% for key variables), together with the observed reaction and anticipation times for alerts and notifications, suggesting that the LLM-based components remain faithful to the data, do not hallucinate in production operation, and do not become a bottleneck in the supervisory loop. This grounding mechanism, implemented through retrieval-augmented generation (RAG) with enriched time-series data and execution summaries, successfully supported operator diagnostic queries without generating factually incorrect recommendations that would erode trust or trigger inappropriate interventions.

8.1.4. Process-Aware Agent Architecture and Microservice Modularity

The work illustrates how process-aware agents and LLM-based interfaces can be embedded into existing cyber–physical architectures using modular, microservice-style components. The process-centric approach to context and supervision—where agents load CIP-aware contexts aligned with the active programme, stage and equipment health history—provides a path to extend the architecture to other batch operations and decision-support tasks, such as comparative analysis across runs, cross-cycle trend quantification, what-if evaluations, and integration with computerised maintenance management systems (CMMS), while preserving the separation between established automation logic and higher-level, human-facing decision support.

8.2. Limitations and Scope of the Evaluation

While the architectural pattern is designed to be process-agnostic (Section 3, Table 3), the empirical evaluation presented in this work is limited to a single process type (CIP) at a single industrial site. The following limitations should be considered when interpreting the results.

8.2.1. Limited Sample Size and Process Diversity

The detailed evaluation is based on three representative execution scenarios (CIP 1, CIP 2, CIP 3) selected from a larger corpus of 24 CIP cycles monitored over six months. While these scenarios span the relevant operational regimes (nominal, preventive, diagnostic), they do not constitute a statistically comprehensive sample across all possible equipment configurations, chemical formulations, cleaning protocols, or failure modes. The observed system behavior is validated within the specific CIP deployment context, but may not generalize to:

Other CIP configurations with different circuit topologies, tank geometries, or sanitization chemistry (e.g., peracetic acid-based protocols vs. the alkaline/sanitizer sequence evaluated here).
Fundamentally different batch processes with distinct dynamics, sensor modalities, and operational constraints (fermentation with pH/dissolved oxygen control, distillation with reflux ratio management, chemical synthesis with exothermic reaction monitoring).
Industrial contexts with more stringent safety requirements (pharmaceutical GMP facilities, nuclear applications) where formal verification of deterministic supervision logic may be required beyond the rule-based validation implemented in the Deterministic Supervisor.

Validation across a larger sample of CIP executions (spanning the full 24-run cohort and subsequent production cycles) and extension to other batch process types at the same facility are planned as future work to assess generalizability and quantify configuration effort for new deployments.

8.2.2. Absence of Comparative Baselines and Ablation Studies

The deployment context—a brownfield industrial facility with no pre-existing LLM-based decision support—precluded controlled A/B comparisons with alternative architectures or ablation studies isolating individual component contributions. The evaluation demonstrates that the proposed system functions as intended in production use (maintaining 100% compliance, differentiating operational regimes, grounding LLM outputs with median absolute error below 2–3%) but does not quantitatively compare its effectiveness against:

Traditional supervision approaches (manual operator correlation of alarms and trends without LLM assistance) in terms of time-to-diagnosis, maintenance planning accuracy, or operator cognitive load.
Alternative architectural choices such as monolithic LLM deployment (without temporal domain separation), direct consumption of raw sensor data (without signal enrichment), or ungrounded conversational interfaces (without RAG-based retrieval).
Different LLM architectures, parameter scales (7B vs. 70B+ models), or fine-tuning strategies (domain-adapted models vs. general-purpose language models with prompt engineering).

Controlled experimentation is often incompatible with production continuity requirements, regulatory constraints, and the economic realities of brownfield deployment. The presented work should be interpreted as a demonstration of technical feasibility and operational viability in one industrial context, rather than as definitive proof of superior performance relative to alternative designs. Quantitative comparative evaluation is identified as a priority for future research.

8.2.3. Qualitative Operator Feedback and Human Factors Evaluation

Operator acceptance and perceived utility are assessed qualitatively through informal interviews and observed usage patterns over the six-month deployment period, rather than through formal usability studies, controlled human factors experiments, or quantitative metrics such as System Usability Scale (SUS) scores, NASA Task Load Index (TLX) cognitive load assessments, or time-to-diagnosis measurements. While the continued use of the system by operators suggests practical value, the strength of this evidence is limited by the absence of rigorous human–AI interaction evaluation.

Future work will incorporate structured operator feedback protocols, quantitative usability metrics, and comparative assessment of diagnostic accuracy and decision confidence to strengthen the empirical basis for claims regarding LLM effectiveness and operator productivity improvements.

8.2.4. Transferability to Other Batch Processes: Conceptual vs. Empirical Validation

The architectural pattern (layered microservices, event-driven communication, process-aware context management, temporal domain separation) is presented as transferable to other batch processes by reconfiguring rules, ontologies, and enrichment logic (Section 3, Table 3). This transferability is demonstrated conceptually—through the generic design, instantiation examples comparing CIP and fermentation configurations, and the clear separation between process-agnostic components (orchestration, event bus, data persistence) and process-specific parameters (recipes, thresholds, fuzzy membership functions).

However, this transferability has not been validated empirically through deployment in fermentation, distillation, pasteurization, or other industrial batch contexts. While the architectural abstraction suggests that the same components should remain effective under different process dynamics, the actual configuration effort, operator acceptance, diagnostic accuracy, and maintenance planning value in other process types remain to be demonstrated. Extension to fermentation and distillation processes at the same facility is planned to provide this empirical validation and assess the practical reusability of the architectural pattern.

8.3. Future Work and Research Directions

Several directions for future research emerge from the limitations identified above and from the operational experience gained during the six-month CIP deployment.

8.3.1. Longitudinal Validation and Maintenance Correlation

Future work will extend the evaluation to:

The full cohort of 24 CIP executions monitored during the deployment period, plus subsequent production runs, to assess system stability, false positive/negative rates, and long-term operator acceptance across a larger sample.
Quantitative correlation between alert patterns (preventive warnings, diagnostic clusters) and confirmed equipment failures or maintenance interventions recorded in the plant’s computerized maintenance management system (CMMS), providing empirical evidence of predictive maintenance value.
Automated fidelity audits of all LLM outputs against ground-truth logs, systematically measuring factual accuracy (median absolute error, hallucination rate) across thousands of operator queries rather than relying on spot-check evaluation.

These longitudinal studies will strengthen the empirical basis for claims regarding operational reliability, diagnostic accuracy, and preventive maintenance effectiveness.

8.3.2. Comparative Evaluation and Ablation Studies

To address the absence of comparative baselines, future research will pursue:

Controlled A/B testing comparing traditional supervision (manual operator correlation without LLM assistance) with the proposed architecture, measuring time-to-diagnosis, maintenance planning accuracy, operator workload (NASA-TLX), and overall equipment effectiveness (OEE) improvements.
Component-level ablation studies isolating the contributions of specific design choices: temporal domain separation vs. monolithic LLM deployment, signal enrichment vs. raw data grounding, RAG-based retrieval vs. direct prompting, different LLM parameter scales (7B vs. 13B vs. 70B models).
Comparative assessment of alternative AI techniques for anomaly detection and trend analysis (e.g., recurrent neural networks for equipment degradation forecasting, reinforcement learning for maintenance scheduling optimization), provided that model training and deployment workflows remain compatible with temporal domain separation and safety barrier constraints.

Such experiments would quantitatively establish the relative importance of architectural components and guide optimization for future deployments.

8.3.3. Multi-Site, Multi-Process Validation

To empirically validate the transferability claims, future work will deploy the architecture in additional industrial contexts:

Fermentation processes in biopharma or brewing, where key variables shift from temperature/flow/conductivity to pH, dissolved oxygen, biomass proxies, and where batch durations extend from hours to days.
Distillation processes in petrochemicals or spirits production, where control focuses on reflux ratio management, column pressure/temperature profiles, and product composition tracking.
Pasteurization processes in food manufacturing, where thermal treatment envelopes, hold time compliance, and rapid cooling trajectories require different diagnostic patterns and alarm thresholds.

Comparative analysis across sites and process types would reveal architectural components that generalize reliably vs. those requiring domain-specific customization, informing best practices for industrial AI deployment.

8.3.4. Formal Verification of Safety Properties

While the Deterministic Supervisor enforces a safety barrier through deterministic rule-based validation, formal methods (model checking, theorem proving) could provide mathematical guarantees that the architecture cannot violate specified safety properties, such as:

“LLM-generated recommendations cannot command actuators without deterministic approval passing through the Supervisor’s constraint validation logic.”
“Real-time supervision latency remains bounded (<100 ms worst-case) under LLM inference load, ensuring that safety-critical alarms are not delayed by non-deterministic analytics.”
“The three-layer data persistence strategy maintains $O (1)$ memory consumption per active batch, preventing resource exhaustion under concurrent multi-line operation.”

Such verification would strengthen confidence in deployment for higher-criticality applications (pharmaceutical manufacturing under FDA 21 CFR Part 11, nuclear facilities under IEC 61513 [45]) where informal validation through testing may be insufficient for regulatory approval.

8.3.5. Human Factors and Explainability Research

Systematic evaluation of operator interaction patterns, cognitive load, trust calibration, and explainability preferences would inform interface design and LLM prompt engineering:

Understanding how operators interpret enriched data representations (linguistic variables, trend indicators, supervisory state labels) and what level of granularity best supports diagnostic decision-making without overwhelming with information.
Assessing trust dynamics: under what conditions do operators validate LLM-generated explanations against raw data, and what types of responses (numerical summaries, historical comparisons, causal narratives) most effectively support maintenance planning decisions.
Refining conversational analytics capabilities based on observed query patterns, common misunderstandings, and operator feedback, potentially incorporating active learning to improve response relevance over time.

Human–AI collaboration research in industrial contexts remains underexplored, and insights from this deployment could inform broader design principles for trustworthy AI in operational environments.

8.3.6. Advanced AI Techniques for Predictive Maintenance

The current Monitoring Service implements anomaly detection and trend analysis using statistical methods (rolling means, deviation scores) and lightweight machine learning (one-class SVM, isolation forests for pattern recognition in the Signal Enrichment Pipeline). Integration of more sophisticated predictive models could enhance preventive maintenance capabilities:

Recurrent neural networks (RNNs, LSTMs) for equipment degradation forecasting, trained on historical execution data to predict remaining useful life of pumps, valves, boilers based on observed wear patterns.
Gaussian processes or Bayesian neural networks for uncertainty-aware predictions, enabling maintenance scheduling decisions that balance equipment failure risk against planned downtime costs.
Reinforcement learning agents for optimized maintenance scheduling, learning policies that maximize equipment availability, minimize unscheduled downtime, and reduce maintenance costs based on long-term operational data.

Such extensions would require careful integration to preserve temporal domain separation (predictive models must not delay deterministic supervision) and safety barrier constraints (model recommendations must pass through Supervisor validation before affecting operations).

8.4. Concluding Remarks

This work demonstrated that LLM-enabled decision support can be integrated into industrial batch supervision in a manner that preserves safety, operates under resource constraints typical of edge computing environments and provides practical value to operators in production settings. The architecture successfully addresses AI-specific technical challenges—non-deterministic behavior in real-time contexts, bounded context windows for time-series data, hallucination prevention in safety-critical environments, resource-efficient edge deployment—through temporal domain separation, signal enrichment pipelines, retrieval-augmented generation, and process-aware microservice design.

However, the empirical validation remains limited to a single process type (CIP) and deployment context (beverage plant). The presented results demonstrate technical feasibility and operational viability within this scope but do not constitute definitive proof of superior performance relative to alternative approaches or universal applicability across all batch process types. Future work extending the evaluation to diverse industrial settings, incorporating quantitative comparative metrics, exploring advanced AI techniques for predictive maintenance and conducting rigorous human factors research will strengthen the generalizability and impact of this architectural pattern.

The integration of artificial intelligence into industrial automation is not merely a matter of deploying state-of-the-art models but requires careful co-design of AI components, safety mechanisms, data transformation pipelines and human interfaces to meet the operational, regulatory and economic constraints of real production environments. This work provides one reference architecture and empirical validation case that industrial AI practitioners can adapt, extend, and improve as the technology and deployment practices continue to mature.

Author Contributions

Conceptualization, A.G.-P., D.M.-C. and C.M.P.; methodology, A.G.-P., D.M.-C., L.J.M. and V.G.F.; software, A.G.-P.; validation, A.G.-P., D.M.-C., C.M.P. and L.J.M.; formal analysis, A.G.-P., A.O.-B. and L.J.M.; investigation, A.G.-P.; resources, D.M.-C., L.J.M. and R.A.F.-C.; data curation, A.G.-P. and A.O.-B.; writing—original draft preparation, A.G.-P.; writing—review and editing, A.G.-P., R.M.-P. and V.G.F.; visualization, A.G.-P.; supervision, A.G.-P. and A.O.-B.; project administration, A.G.-P. and D.M.-C.; funding acquisition, D.M.-C., A.O.-B. and R.A.F.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external grant funding. The industrial deployment and operational validation were supported by VivaWild Beverages as part of their internal process improvement initiatives.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The enriched process logs and evaluation scripts generated during this study are available from the corresponding author upon reasonable request. Raw industrial data cannot be shared due to confidentiality agreements with VivaWild Beverages.

Acknowledgments

The authors gratefully acknowledge VivaWild Beverages for industrial access and operational support throughout the six-month validation campaign. This work represents collaborative efforts among the University of Colima, Universidad Politécnica de Sinaloa, and Universidad Autónoma de Occidente, whose contributions in methodological development, system architecture, and experimental analysis were essential. Special thanks to plant operators and maintenance staff for their technical collaboration during deployment.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Mandenius, C.F.; Titchener-Hooker, N.J. Measurement, Monitoring, Modelling and Control of Bioprocesses; Advances in Biochemical Engineering/Biotechnology; Springer: Berlin/Heidelberg, Germany, 2013; Volume 132. [Google Scholar] [CrossRef]
Moerman, F.; Rizoulières, P.; Majoor, F. 10—Cleaning in place (CIP) in food processing. In Hygiene in Food Processing, 2nd ed.; Lelieveld, H., Holah, J., Napper, D., Eds.; Woodhead Publishing Series in Food Science, Technology and Nutrition; Woodhead Publishing: Cambridge, UK, 2014; pp. 305–383. [Google Scholar] [CrossRef]
Van Asselt, A.; Van Houwelingen, G.; Te Giffel, M. Monitoring System for Improving Cleaning Efficiency of Cleaning-in-Place Processes in Dairy Environments. Food Bioprod. Process. 2002, 80, 276–280. [Google Scholar] [CrossRef]
Meneses, Y.E.; Flores, R.A. Feasibility, safety, and economic implications of whey-recovered water in cleaning-in-place systems: A case study on water conservation for the dairy industry. J. Dairy Sci. 2016, 99, 3396–3407. [Google Scholar] [CrossRef]
Adolphs, P.; Bedenbender, H.; Dirzus, D.; Drath, R.; Horch, A.; Zentgraf, P.; Griesbach, D.; Häner, M.; Hoffmeister, M.; Zimmer, J.; et al. Reference Architecture Model Industrie 4.0 (RAMI4.0). Technical Report, ZVEI and VDI, Status Report, Plattform Industrie 4.0. 2015. Available online: https://www.plattform-i40.de (accessed on 21 January 2026).
Ibarra-Junquera, V.; González, A.; Paredes, C.M.; Martínez-Castro, D.; Nuñez-Vizcaino, R.A. Component-Based Microservices for Flexible and Scalable Automation of Industrial Bioprocesses. IEEE Access 2021, 9, 58192–58207. [Google Scholar] [CrossRef]
Galdino, M.; Hamann, T.; Abdelrazeq, A.; Isenhardt, I. Large Language Model-Based Cognitive Assistants for Quality Management Systems in Manufacturing: A Requirement Analysis. Eng. Rep. 2025, 7, e70437. [Google Scholar] [CrossRef]
Chkirbene, Z.; Hamila, R.; Gouissem, A.; Devrim, U. Large Language Models (LLM) in Industry: A Survey of Applications, Challenges, and Trends. In Proceedings of the 2024 IEEE 21st International Conference on Smart Communities: Improving Quality of Life Using AI, Robotics and IoT (HONET), Doha, Qatar, 3–5 December 2024; pp. 229–234. [Google Scholar] [CrossRef]
Mustafa, F.E.; Ahmed, I.; Basit, A.; Alvi, U.E.H.; Malik, S.H.; Mahmood, A.; Ali, P.R. A review on effective alarm management systems for industrial process control: Barriers and opportunities. Int. J. Crit. Infrastruct. Prot. 2023, 41, 100599. [Google Scholar] [CrossRef]
ul Hassan, I.; Panduru, K.; Walsh, J. Predictive Maintenance in Industry 4.0: A Review of Data Processing Methods. Procedia Comput. Sci. 2025, 257, 896–903. [Google Scholar] [CrossRef]
Tiddens, W.; Braaksma, J.; Tinga, T. Exploring predictive maintenance applications in industry. J. Qual. Maint. Eng. 2020, 28, 68–85. [Google Scholar] [CrossRef]
Serrano-Magaña, H.; González-Potes, A.; Ibarra-Junquera, V.; Balbastre, P.; Martínez-Castro, D.; Simó, J. Software Components for Smart Industry Based on Microservices: A Case Study in pH Control Process for the Beverage Industry. Electronics 2021, 10, 763. [Google Scholar] [CrossRef]
Yuan, C.; Xie, Y.; Xie, S.; Tang, Z. Interval type-2 fuzzy stochastic configuration networks for soft sensor modeling of industrial processes. Inf. Sci. 2024, 679, 121073. [Google Scholar] [CrossRef]
Song, X.L.; Zhang, N.; Shi, Y.; He, Y.L.; Xu, Y.; Zhu, Q.X. Quality-driven deep feature representation learning and its industrial application to soft sensors. J. Process Control 2024, 142, 103300. [Google Scholar] [CrossRef]
Zhou, X.; Lu, J.; Ding, J. Fuzzy Hierarchical Stochastic Configuration Networks for Industrial Soft Sensor Modeling. IEEE Trans. Fuzzy Syst. 2025, 33, 2336–2347. [Google Scholar] [CrossRef]
Maschler, B.; Ganssloser, S.; Hablizel, A.; Weyrich, M. Deep learning based soft sensors for industrial machinery. Procedia CIRP 2021, 99, 662–667. [Google Scholar] [CrossRef]
Peres, R.S.; Jia, X.; Lee, J.; Sun, K.; Colombo, A.W.; Barata, J. Industrial Artificial Intelligence in Industry 4.0—Systematic Review, Challenges and Outlook. IEEE Access 2020, 8, 220121–220139. [Google Scholar] [CrossRef]
Li, C.; Chang, Q.; Fan, H.T. Multi-agent reinforcement learning for integrated manufacturing system-process control. J. Manuf. Syst. 2024, 76, 585–598. [Google Scholar] [CrossRef]
Yi, Z.; Ouyang, J.; Xu, Z.; Liu, Y.; Liao, T.; Luo, H.; Shen, Y. A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems. ACM Comput. Surv. 2025, 58, 148. [Google Scholar] [CrossRef]
Li, Z.; Deldari, S.; Chen, L.; Xue, H.; Salim, F.D. SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition. arXiv 2025, arXiv:2410.10624. [Google Scholar]
Mukherjee, A.; Karande, A.; Häfner, P.; Poonia, M.D.; Kimmig, A.; Kreuzwieser, S.; Vlas, R.; Klar, M.; Sykora, T.; Grethler, M. A LLM-based voice user interface for voice dialogues between user and industrial machines. Procedia CIRP 2025, 134, 378–383. [Google Scholar] [CrossRef]
Han, S.; Wang, M.; Zhang, J.; Li, D.; Duan, J. A Review of Large Language Models: Fundamental Architectures, Key Technological Evolutions, Interdisciplinary Technologies Integration, Optimization and Compression Techniques, Applications, and Challenges. Electronics 2024, 13, 5040. [Google Scholar] [CrossRef]
Yang, L.; Su, R. From Machine Learning-Based to LLM-Enhanced: An Application-Focused Analysis of How Social IoT Benefits from LLMs. IoT 2025, 6, 26. [Google Scholar] [CrossRef]
Wang, E.; Xie, W.; Li, S.; Liu, R.; Zhou, Y.; Wang, Z.; Ma, S.; Yang, W.; Wang, B. Large Language Model-Powered Protected Interface Evasion: Automated Discovery of Broken Access Control Vulnerabilities in Internet of Things Devices. Sensors 2025, 25, 2913. [Google Scholar] [CrossRef]
Kreisberg-Nitzav, A.; Kenett, Y.N. Creativeable: Leveraging AI for Personalized Creativity Enhancement. AI 2025, 6, 247. [Google Scholar] [CrossRef]
Lim, J.; Kovalenko, I. A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems. In Proceedings of the 2025 IEEE 21st International Conference on Automation Science and Engineering (CASE), Los Angeles, CA, USA, 17–21 August 2025; pp. 2088–2095. [Google Scholar] [CrossRef]
Zhao, Z.; Tang, D.; Liu, C.; Wang, L.; Zhang, Z.; Zhu, H.; Chen, K.; Nie, Q.; Ji, Y. A Large language model-based multi-agent manufacturing system for intelligent shopfloors. Adv. Eng. Inform. 2026, 69, 103888. [Google Scholar] [CrossRef]
Garcia, C.I.; DiBattista, M.A.; Letelier, T.A.; Halloran, H.D.; Camelio, J.A. Framework for LLM applications in manufacturing. Manuf. Lett. 2024, 41, 253–263. [Google Scholar] [CrossRef]
Keskin, Z.; Joosten, D.; Klasen, N.; Huber, M.; Liu, C.; Drescher, B.; Schmitt, R.H. LLM-Enhanced Human–Machine Interaction for Adaptive Decision-Making in Dynamic Manufacturing Process Environments. IEEE Access 2025, 13, 44650–44661. [Google Scholar] [CrossRef]
Chen, L.C.; Pardeshi, M.S.; Liao, Y.X.; Pai, K.C. Application of retrieval-augmented generation for interactive industrial knowledge management via a large language model. Comput. Stand. Interfaces 2025, 94, 103995. [Google Scholar] [CrossRef]
Alsaif, K.M.; Albeshri, A.A.; Khemakhem, M.A.; Eassa, F.E. Multimodal Large Language Model-Based Fault Detection and Diagnosis in Context of Industry 4.0. Electronics 2024, 13, 4912. [Google Scholar] [CrossRef]
Kim, K.; Ghimire, P.; Huang, P.C. Framework for LLM-Enabled Construction Robot Task Planning: Knowledge Base Preparation and Robot–LLM Dialogue for Interior Wall Painting. Robotics 2025, 14, 117. [Google Scholar] [CrossRef]
Liu, Y.; Palmieri, L.; Koch, S.; Georgievski, I.; Aiello, M. DELTA: Decomposed Efficient Long-Term Robot Task Planning Using Large Language Models. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 19–23 May 2025; pp. 10995–11001. [Google Scholar] [CrossRef]
Fan, H.; Liu, X.; Fuh, J.Y.H.; Lu, W.F.; Li, B. Embodied Intelligence in Manufacturing: Leveraging Large Language Models for Autonomous Industrial Robotics. J. Intell. Manuf. 2025, 36, 1141–1157. [Google Scholar] [CrossRef]
Tariq, M.T.; Hussain, Y.; Wang, C. Robust mobile robot path planning via LLM-based dynamic waypoint generation. Expert Syst. Appl. 2025, 282, 127600. [Google Scholar] [CrossRef]
Colabianchi, S.; Costantino, F.; Sabetta, N. Assessment of a large language model based digital intelligent assistant in assembly manufacturing. Comput. Ind. 2024, 162, 104129. [Google Scholar] [CrossRef]
Jeon, J.; Sim, Y.; Lee, H.; Han, C.; Yun, D.; Kim, E.; Nagendra, S.L.; Jun, M.B.; Kim, Y.; Lee, S.W.; et al. ChatCNC: Conversational machine monitoring via large language model and real-time data retrieval augmented generation. J. Manuf. Syst. 2025, 79, 504–514. [Google Scholar] [CrossRef]
Xia, Y.; Jazdi, N.; Zhang, J.; Shah, C.; Weyrich, M. Control Industrial Automation System with Large Language Models. In Proceedings of the 30th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Porto, Portugal, 9–12 September 2025; IEEE: New York, NY, USA, 2025. [Google Scholar]
Kaltenpoth, S.; Skolik, A.; Müller, O.; Beverungen, D. A Step Towards Cognitive Automation: Integrating LLM Agents with Process Rules. In Proceedings of the Business Process Management: 23rd International Conference, BPM 2025, Seville, Spain, 31 August–5 September 2025; Springer: Berlin/Heidelberg, Germany, 2025; pp. 308–324. [Google Scholar] [CrossRef]
Clarke, G.R.; Reynders, D.; Wright, E. Practical Modern SCADA Protocols: DNP3, 60870.5 and Related Systems, 1st ed.; Newnes: Oxford, UK, 2004; ISBN 978-0-7506-5799-0. [Google Scholar]
Ren, Y.; Zhang, H.; Yu, F.R.; Li, W.; Zhao, P.; He, Y. Industrial Internet of Things With Large Language Models (LLMs): An Intelligence-Based Reinforcement Learning Approach. IEEE Trans. Mob. Comput. 2025, 24, 4136–4152. [Google Scholar] [CrossRef]
Ren, L.; Wang, H.; Dong, J.; Wang, H.; Liu, S.; Laili, Y.; Zhang, L. MetaIndux-PLC: A Control Logic-Guided LLM for PLC code generation in industrial control systems. Appl. Soft Comput. 2025, 184, 113673. [Google Scholar] [CrossRef]
Siemens AG. Siemens Braumat: Brewery Automation and Process Control. 2024. Available online: https://www.siemens.com/braumat (accessed on 12 January 2025).
Rockwell Automation. FactoryTalk Batch: Batch Management Software. 2024. Available online: https://www.rockwellautomation.com/en-gb/products/software/factorytalk/operationsuite/view.html (accessed on 12 January 2025).
IEC 61513:2011; Nuclear Power Plants—Instrumentation and Control Important to Safety—General Requirements for Systems. International Electrotechnical Commission: Geneva, Switzerland, 2011.

Figure 1. Conceptual layered architecture separating cyber and physical domains. The cyber layer comprises user interfaces, an orchestration layer (200–300 ms), and specialized agents coordinated through an Event Bus. The Agent Control Layer separates deterministic hard real-time components from non-deterministic soft real-time analytics (1–2 s). The physical layer maintains PLC/SCADA control authority (50–100 ms) with non-invasive integration via MODBUS/Serial protocols. The Signal Enrichment Pipeline transforms raw sensor data into linguistic variables and supervisory states. The Data Persistence Layer implements a three-tier strategy: Layer 1 (raw time-series, 100–1000 ms), Layer 2 (enriched time-series, 1–5 s), and Layer 3 (execution summaries). (Solid arrows: data flow; dashed arrows: asynchronous communication; gray arrows: monitoring/logging paths.). Through AI techniques before publication to the Event Bus, enabling LLM-based agents to reason over high-level process concepts. In the CIP deployment, these components manage cleaning cycles and support preventive maintenance decision-making; in other batch processes, the same pattern supervises fermentation, distillation or synthesis runs with diagnostic capabilities aligned to equipment health monitoring.

Figure 2. Detailed component architecture with explicit temporal constraints and data flow patterns. Deterministic agents operate under hard real-time constraints (<100 ms); non-deterministic components execute in soft real-time (1–2 s). The Signal Enrichment Pipeline transforms raw sensor streams (100–1000 ms) through fuzzy logic and statistical analysis. Three-tier data persistence: Layer 1 (raw in-memory buffers), Layer 2 (enriched time-series with configurable retention), Layer 3 (aggregated summaries). The architecture integrates non-invasively with the physical layer via standard industrial protocols (MODBUS, RS-232/485 Serial, I2C, ISP). The Deterministic Supervisor enforces safety barriers preventing non-deterministic LLM outputs from directly commanding actuators. (Black arrows: data flow; red arrows: control paths; call-LLM nodes: agent context management.).

Figure 3. Configurable real-time window architecture with adaptive memory management. Three window patterns support different process semantics: activity-based (batch execution), time-based (continuous monitoring), and event-based (alarm tracking). Data reside in three tiers: HOT (in-memory, <10 ms), WARM (disk-backed, 50–200 ms auto-load), COLD (archived, TTL expiry). Query Router directs requests to Deterministic Path (HOT data, no LLM) or Analytical Path (WARM data with LLM inference, 1–2 s). Transition triggers and retention policies are application-defined. (Solid arrows: data flow; dashed arrows: query routing.).

Figure 4. Illustrative timelines of CIP stages for the three experimental runs, showing the sequence and duration of each stage from the start of pre-rinse (time zero).

Figure 5. Per-execution values of state specification consistency

Γ_{s}

and stability of state labeling

Λ_{s}

for each CIP stage. The alkaline stages maintain high

Γ_{s}

across all executions (nominal baseline, preventive warnings, diagnostic alerts), even when time within specification is low, while label instability is concentrated in the stressed final-rinse stage of CIP 3.

Figure 5. Per-execution values of state specification consistency

Γ_{s}

and stability of state labeling

Λ_{s}

for each CIP stage. The alkaline stages maintain high

Γ_{s}

across all executions (nominal baseline, preventive warnings, diagnostic alerts), even when time within specification is low, while label instability is concentrated in the stressed final-rinse stage of CIP 3.

Figure 6. Median reaction time between the onset of a critical episode and the corresponding alert for each CIP execution. Only CIP 3 (diagnostic alerts) exhibits critical episodes followed by alerts, resulting in a non-zero median reaction time of approximately 35–36 s.

Table 1. Design Principles to Implementation Technology Mapping. Italicized text denotes architectural principles from prior work; bold text highlights specific technologies or components implemented in this architecture.

Principle from Prior Work	Architectural Realization	Technology	Rationale
Microservices scalability and inter-component messaging	Layered component architecture with efficient event-driven communication	Redis Streams Event Bus	O(1) append latency for time-series data; single-node deployment efficiency (typical brownfield facilities); preserves composability of independent components without message broker complexity
Edge deployment and resource constraints	Embedded LLM inference on industrial computing hardware	Qwen 2.5 (7B params)	Code generation capability; instruction-tuned for domain tasks; <2 s inference on consumer-grade NVIDIA RTX; multilingual support for global facilities
Reliability and non-invasive integration	Safety-barrier validation preventing non-deterministic AI from compromising control	Deterministic Supervisor + command validation	Inherited safety-first principle; LLM outputs inform decisions but cannot directly command actuators without deterministic constraint checking
Data semantic bridging	Transform industrial numerical signals into natural language representations	Fuzzy Logic (interpretable enrichment exemplar)	Interpretability critical for operator trust in safety-sensitive systems; alternatives (SVM, Isolation Forest) could provide equivalent enrichment
Component composability	Independent layer deployment and rule/ontology reconfiguration	Agent-based architecture + configurable rules	Maintains flexibility from prior work; different processes adopt architecture by reconfiguring domain-specific parameters without architectural redesign

Table 2. Real-time window pattern configurations for representative industrial applications.

Application	Pattern Type	Window Definition	Hot Memory	Transition Trigger
CIP Cleaning	Activity-based	Active batch (2 h, 1 Hz)	5–7 MB	Batch completion
Cold Chain	Time-based	Rolling 24 h (1/min)	500 KB	Data > 24 h old
Alarm Tracking	Event-based	Last 100 events + context	200 KB	Alarm resolution + 15 min
Fermentation	Activity-based	Active batch (72 h, 1/min)	3–5 MB	Harvest event
Equipment Health	Time-based	Rolling 7 d (1/min)	5 MB	Data > 7 d old

Table 3. Generic batch architecture components and process-specific instantiations.

Component	Generic Role	CIP Instance	Fermentation Instance
Batch Master	Orchestrate programme execution	Manage cleaning circuits and stages	Manage inoculation, feeding, harvest
Administration	Manage recipes and equipment	Programmes, circuits, tanks	Recipes, vessels, media prep.
Deterministic Supervisor	Compute states and alarms	Temp., flow, cond. thresholds	pH, DO, temp., biomass thresholds
Real-time Monitoring	LLM analytics	Ad hoc queries, trend analysis on CIP	Ad hoc queries, trend analysis on fermentation
Offline Analysis	Post-run reports	KPI trends, cleaning efficiency, maintenance signals	KPI trends, yield metrics, bioreactor health

Table 4. Main data windows per CIP instance and their intended use.

Data View	Horizon	Primary Use
Short window	≈2 min	Instantaneous state, fast alarms
Medium window	5–10 min	Periodic diagnostics, preventive warnings
Full-cycle buffer	30–90 min	Real-time exploratory analysis, trend queries
Historical archives	Days–weeks	Offline reports, benchmarking, maintenance correlation

Table 5. Example query profiles, data sources and typical response times.

Query Type	Source	Window	Latency
Current state of batch X	Determ. agent	Short	<100 ms
Warnings in last stage	Determ. + LLM	Medium	0.3–0.8 s
Compare current vs. previous	LLM on buffer	Full-cycle	1–2 s
Compare with previous cycles	LLM + archives	Historical	2–5 s
Weekly summary + maintenance recommendations	Offline analysis	Historical	s–min

Table 6. Illustrative example of an enriched CIP data record.

Field	Example Value
`timestamp`	2025-11-19T18:05:50.123Z
`cip_id`	`cip20251119T180530_abc12345`
`program`	`ALKA_STANDARD`
`stage`	`ALKALINE`
`seconds_elapsed`	320.5
`seconds_remaining`	579.5
`progress_percent`	35.6
`temp` [°C]	61.5
`flow` [L/s]	2.8
`cond` [µS/cm]	1657.2
`volume` [L]	204.1
`state`	`NORMAL`
`quality_grade`	0.82
`motives`	`Parameters within target ranges`
`actions`	`Continue current stage`

Table 7. Computational resource profile for the deployed decision-support architecture (single-line CIP monitoring).

Component	Resource Type	Usage
Redis in-memory buffer ( $10^{4}$ records)	RAM	≈5 MB per CIP
Enriched record (per sample)	RAM	≈0.5 kB
Qwen 2.5 7B model (Ollama)	VRAM (GPU)	4.5 GB (shared)
Agent services (Python 3.11 containers)	RAM	100–150 MB per agent
LLM query (median)	Latency	1–2 s
Deterministic state update	Latency	<100 ms

Table 8. Selected CIP executions and rationale.

Case	Type	Rationale
CIP1	Nominal baseline	Baseline behavior under standard operating conditions, with variables close to their nominal profiles and no warnings or alerts. Serves as a reference for normal equipment health and operator performance.
CIP2	Preventive warnings	Typical process variations in alkaline and rinsing stages (e.g., slightly reduced flow or temperature excursions that remain within specification) that trigger `WARNING`-level diagnostics. These runs complete successfully but indicate emerging maintenance needs (e.g., pump wear, boiler efficiency drift).
CIP3	Diagnostic alerts	Presence of more pronounced deviations, including clusters of alerts related to flow or conductivity, used to stress-test alert generation and diagnostic capabilities under more demanding conditions that remain operationally successful but call for prioritized maintenance.

Table 9. Summary of evaluation metrics per CIP execution. Metrics marked with—are computed only for executions exhibiting confirmed critical episodes (CIP 3 only) *.

Metric	Alkaline	Sanitizing	Final Rinse
Stage specification compliance	0.58 ± 0.52	1.00 ± 0.00	n/a
State specification consistency $Γ_{s}$	0.98 ± 0.03	1.00 ± 0.00	n/a
Stability of state labeling $Λ_{s}$ (changes/min)	0.03 ± 0.05	0.00 ± 0.00	3.28 ± 5.68
Alert sensitivity	–	–	–
Alert specificity	–	–	–
Reaction time (s)	–	–	–
Anticipation window (s)	–	–	–

* Sensitivity, specificity, reaction time and anticipation window require ground-truth critical episodes for computation.

Table 10. Stage specification compliance per representative CIP execution.

CIP	Alkaline Stage	Sanitizing Stage
CIP 1 (nominal baseline)	1.00	1.00
CIP 2 (preventive warnings)	0.75	1.00
CIP 3 (diagnostic alerts)	0.00	1.00

Table 11. Stage specification compliance across the three executions (mean ± std).

Stage	Compliance (Mean)	Compliance (Std)
Alkaline	0.58	0.52
Sanitizing	1.00	0.00

Table 12. Diagnostic alert patterns across representative CIP executions, illustrating the progression from nominal baseline (CIP 1) to preventive warnings (CIP 2) and diagnostic alerts (CIP 3).

Execution	Alert Pattern	Maintenance Implication	Operator Action
CIP 1	No warnings/alerts	Baseline equipment health	Routine monitoring
CIP 2	Intermittent warnings	Emerging drift (pump, boiler)	Schedule preventive maintenance
CIP 3	Sustained warnings + critical	Equipment degradation	Prioritize maintenance review

Table 13. Examples of consistency between language-based outputs and enriched logs, illustrating numerical fidelity across nominal (CIP 1), preventive (CIP 2) and diagnostic (CIP 3) executions.

Description	Value Reported in Explanation	Value Computed from Logs
Average temperature in alkaline stage (CIP 3)	≈73 °C	72.96 °C
Average flow in alkaline stage (CIP 3)	≈1.0 $L / s$	$1.00 L / s$
Warnings in current alkaline stage (CIP 3)	“dozens of warnings”/reported count	66 warnings and 2 critical samples

Table 14. Supervision capabilities: traditional CIP supervision vs proposed architecture, highlighting diagnostic and preventive maintenance support.

Aspect	Traditional CIP Supervision	Proposed Multi-Agent Architecture
State representation	Threshold-based alarms on individual variables	Aggregated `NORMAL`/`WARNING`/`CRITICAL` state combining multiple variables and rules
Explanations	Fixed alarm texts, limited context	Contextual explanations linking stage, variables, recent history and maintenance implications
Operator queries	Predefined HMI screens and fixed trends	Natural-language queries over current and past CIP runs, including cross-cycle trend analysis
Stage-level summaries	Manual inspection of logs and reports	Automatic per-stage summaries with statistics, discrete states and preventive recommendations
Decision support	Reactive acknowledgement of alarms	Proactive highlighting of abnormal patterns, emerging equipment drift and maintenance prioritization
Trend analysis	Offline, manual correlation across cycles	On-demand, conversational trend queries (e.g., “show flow degradation over last 10 cycles”)

Table 15. Comparison with related work on LLM integration in industrial automation.

Work	Domain	LLM Integration	Deployment	Metrics
LLM4IAS [38]	Generic automation	Control loop	Lab testbed	No
MetaIndux-PLC [42]	PLC codegen	Offline tool	Simulation	No
[24]	Doc. retrieval	RAG chatbot	Cloud API	No
This work	CIP batch	RT analytics + supervision	Plant	7 + patterns

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

González-Potes, A.; Martínez-Castro, D.; Paredes, C.M.; Ochoa-Brust, A.; Mena, L.J.; Martínez-Peláez, R.; Félix, V.G.; Félix-Cuadras, R.A. Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study. AI 2026, 7, 51. https://doi.org/10.3390/ai7020051

AMA Style

González-Potes A, Martínez-Castro D, Paredes CM, Ochoa-Brust A, Mena LJ, Martínez-Peláez R, Félix VG, Félix-Cuadras RA. Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study. AI. 2026; 7(2):51. https://doi.org/10.3390/ai7020051

Chicago/Turabian Style

González-Potes, Apolinar, Diego Martínez-Castro, Carlos M. Paredes, Alberto Ochoa-Brust, Luis J. Mena, Rafael Martínez-Peláez, Vanessa G. Félix, and Ramón A. Félix-Cuadras. 2026. "Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study" AI 7, no. 2: 51. https://doi.org/10.3390/ai7020051

APA Style

González-Potes, A., Martínez-Castro, D., Paredes, C. M., Ochoa-Brust, A., Mena, L. J., Martínez-Peláez, R., Félix, V. G., & Félix-Cuadras, R. A. (2026). Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study. AI, 7(2), 51. https://doi.org/10.3390/ai7020051

Article Menu

Hybrid AI and LLM-Enabled Agent-Based Real-Time Decision Support Architecture for Industrial Batch Processes: A Clean-in-Place Case Study

Abstract

1. Introduction

1.1. From Reactive Alarms to Diagnostic Intelligence

1.2. Proposed Architecture and Evaluation Approach

1.3. Contributions and Scope

1.4. Paper Organization

2. Related Work

2.1. Bioprocess and CIP Monitoring and Control

2.2. Industry 4.0 Architectures and Microservice-Based Automation

2.3. LLMs and Conversational Assistants in Industrial and CPS Settings

2.4. LLMs for Robotics, Task Planning and Control Logic

2.5. Positioning of the Present Work

3. Generic Batch Process Supervision Architecture

3.1. Architectural Evolution from Prior Work

3.2. Conceptual Architecture: Extending Microservices with Temporal Domain Separation

Technology Stack Justification and Design Principles

3.3. Detailed Component Architecture: Real-Time Constraints and Data Flow

3.3.1. Orchestration Layer: Timing and Intent Classification

3.3.2. Agent Control Layer: Temporal Domain Separation

Deterministic Domain (Hard Real Time: <100 ms)

Non-Deterministic Domain (Soft Real-Time: 1–2 s)

3.3.3. Signal Enrichment Pipeline

Stage 1: Raw Sensor Data Acquisition

Stage 2: AI-Driven Enrichment Processing

Stage 3: Enriched Data Publication

3.3.4. Data Persistence Layer: Three-Tier Storage Strategy

Layer 1: Raw Time-Series (High-Frequency Operational Buffer)

Layer 2: Enriched Time-Series (Medium-Term Trend Analysis)

Layer 3: Execution Summaries (Long-Term KPI and Compliance Records)

3.3.5. Configurable Real-Time Windows: Application-Dependent Memory Management

Pattern A: Activity-Based Windows

Pattern B: Time-Based Windows

Pattern C: Event-Based Windows

Hybrid Multi-Pattern Operation

Adaptive Memory Tiering

3.3.6. Physical Layer Integration

3.4. Process-Agnostic Design and Instantiation Examples

3.5. Rationale for a Layered, Agent-Based Design

3.6. Process-Aware Context Management

3.7. AI-Specific Technical Challenges and Solutions

3.7.1. Challenge 1: Non-Deterministic AI in Real-Time Safety-Critical Contexts

Problem

Solution

3.7.2. Challenge 2: Bounded Context Windows for Industrial Time-Series

Problem

Solution

3.7.3. Challenge 3: Hallucination Prevention in Safety-Critical Environments

Problem

Solution

3.7.4. Challenge 4: Resource Constraints in Edge Deployment

Problem

Solution

3.7.5. Summary: AI-Specific Contributions

4. Real-Time Data Management and LLM Integration

4.1. Bounded In-Memory Buffers and Latency

4.2. Enriched, Decentralized Data Views

4.3. Compact LLM Contexts and Efficient Queries

4.4. Lowering the Expertise Barrier for Real-Time Decisions and Maintenance Planning

4.5. Illustrative Data Views and Query Profiles

4.6. Fuzzy Enrichment of Real-Time CIP Data

4.7. Computational Resource Profile

5. Evaluation Methodology

5.1. Evaluation Objectives

5.2. Case Selection Rationale

5.3. Evaluation Setup

5.4. Datasets

5.5. Evaluation Metrics

5.5.1. Stage Specification Compliance

5.5.2. Alert Sensitivity and Specificity

5.5.3. Reaction Time to Anomalies

5.5.4. Anticipation Window

5.5.5. State Specification Consistency

5.5.6. Stability of State Labeling

5.5.7. Consistency of Language-Based Outputs

5.6. Experimental CIP Runs

6. Results

6.1. Stage-Level Performance

6.2. Alert and Notification Timing