Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case

Moradbeikie, Azin; Ayub da Costa Barbon, Ana Paula; Grigore, Iuliana Malina; Barbin, Douglas Fernandes; Barbon Junior, Sylvio

doi:10.3390/systems13110935

Open AccessArticle

Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case

by

Azin Moradbeikie

¹

,

Ana Paula Ayub da Costa Barbon

²

,

Iuliana Malina Grigore

¹

,

Douglas Fernandes Barbin

²

and

Sylvio Barbon Junior

^1,*

¹

Department of Engineering and Architecture, Università degli Studi di Trieste (UNITS), 34127 Trieste, Italy

²

Department of Food Engineering, Universidade Estadual de Campinas (UNICAMP), Campinas 13083-862, São Paulo, Brazil

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(11), 935; https://doi.org/10.3390/systems13110935

Submission received: 22 September 2025 / Revised: 17 October 2025 / Accepted: 20 October 2025 / Published: 22 October 2025

(This article belongs to the Special Issue Data-Driven Analysis of Industrial Systems Using AI)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Industrial processes governed by food safety regulations, such as high-temperature short-time (HTST) pasteurization, rely on continuous sensor monitoring to ensure compliance with standards like Hazard Analysis and Critical Control Points (HACCP). However, extracting actionable process insights from raw sensor data remains a non-trivial task, largely due to the continuous, multivariate, and often high-frequency characteristics of the signals, which can obscure clear activity boundaries and introduce significant variability in temporal patterns. This paper proposes a process mining framework to extract activity-based representations from multivariate sensor data in a pasteurization scenario. By modelling temperature, pH, conductivity, viscosity, turbidity, flow, and pressure signals, the approach segments continuous data into discrete operational phases and generates event logs aligned with domain semantics. Unsupervised learning techniques, including Hidden Markov Models (HMMs), are used to infer latent process stages, while domain knowledge guides their interpretation in accordance with critical control points (CCPs). The extracted models support conformance checking against HACCP-based procedures and enable predictive process-monitoring tasks such as next-activity prediction and remaining time estimation. Experimental results on synthetic (literature-grounded data) demonstrated the method’s ability to enhance safety, compliance, and operational efficiency. This study illustrates how integrating process mining with regulatory principles can bridge the gap between continuous sensor streams and structured process analysis in food manufacturing.

Keywords:

hidden markov models; multivariate time series; critical control points; food manufacturing; activity recognition

1. Introduction

Industrial processes across sectors such as food, chemical, pharmaceutical, and energy production are usually monitored and controlled through industrial control systems (ICSs), which often align operations into phases for practical supervision and analysis [1]. Although in most cases, phases are formally defined in control logic, their boundaries are not always sharply delimited in practice. Transitions must often be inferred from overlapping or combined behaviour of multiple sensor signals. Cyber–physical systems are therefore required to interpret the temporal dynamics of heterogeneous measurements to infer stage transitions and segment processes into meaningful phases suitable for analysis [2]. Pasteurization in the dairy industry exemplifies this challenge. Operators typically refer to phases such as heating and cooling, yet these transitions cannot be captured by fixed thresholds alone and must instead be recognised through the combined dynamics of multiple sensor signals [3,4].

In practice, this difficulty is further accentuated by regulatory frameworks such as Hazard Analysis and Critical Control Points (HACCP) [5], which mandate the identification and monitoring of critical control points (CCPs) [6] rather than prescribing rigid operational stages. In the case of pasteurization, CCPs are defined by specific time–temperature combinations that must be continuously respected to ensure food safety. Consequently, the ability to segment continuous multivariate sensor data into meaningful phases is essential not only for accurate process analysis but also for ensuring regulatory compliance, reducing waste, and safeguarding product quality.

Traditional approaches in ICSs often rely on mathematical formalisms, such as differential equations, to capture system dynamics. While accurate, these models require deep domain expertise and are difficult to scale in complex, heterogeneous environments [7]. To overcome these limitations, data-driven methods are increasingly adopted. They exploit historical sensor data to characterize normal dynamics and detect deviations in real-time. Machine learning and deep learning have proven to be effective in capturing non-linear dependencies and long-term patterns, but they face drawbacks such as long training times, high computational cost, and limited interpretability [8,9].

Process mining (PM) offers a complementary paradigm that bridges process-based and data-driven perspectives. Unlike purely statistical or black-box learning approaches, PM explicitly connects raw data to operational activities, enabling process discovery, conformance checking, and predictive monitoring. This connection allows us to uncover valuable insights into system behaviour that are difficult to obtain with traditional PM or machine learning alone, while also ensuring results remain explainable and actionable [10,11].

A central challenge in bridging industrial sensor data with PM lies in the fundamental difference in how information is represented. Sensor data is inherently continuous and univariate, producing streams of measurements such as temperature, pH, and conductivity. These signals evolve, often with overlapping dynamics, noise, and context-dependent variability. In contrast, PM frameworks operate on discrete, activity-based perspectives, where operations are modeled as well-defined stages with explicit start and end points. The divergence between these two representations complicates the direct mapping of raw measurements to process activities [12].

To operationalize this mapping, we employed digital signal-processing techniques and Hidden Markov Models (HMM) [13,14], extending the approach introduced by Tokatli et al. [15], who focused on fault detection using temperature and pressure sensors. The choice of HMMs in this study was motivated by their ability to explicitly model sequential dependencies between latent operational stages and observable sensor dynamics. Unlike clustering-based techniques such as k-means or Gaussian Mixture Models, which assume independent samples, HMMs capture the probabilistic transitions between process states (an essential property for continuous, stage-driven industrial processes such as pasteurization). In contrast, deep sequence models (e.g., recurrent or convolutional neural networks) can also represent temporal behaviour but require large labelled datasets, entail higher computational cost, and offer limited interpretability, which constrains their applicability in HACCP-regulated environments. HMMs therefore provide a balanced solution that combines unsupervised segmentation, temporal awareness, and transparent probabilistic interpretation of the process flow. For these reasons, they were adopted as the core modelling approach in this work. Building on this foundation, we incorporated additional signals, including pH, conductivity, viscosity, and turbidity, to address further operational tasks. In the present study, we report the results of transforming raw sensor streams into a structured process of pasteurization and demonstrate how this enables the application of recent advances in Predictive Process Monitoring (PPM), thereby supporting proactive interventions and compliance with HACCP-based regulations.

In this paper, considering the regulatory requirements established by HACCP and the advantages provided by PM, we explored the high-temperature short-time (HTST) pasteurization scenario, described in [3], as a case study for mapping raw sensor data into defined process activities. Regarding CCP, the time–temperature combinations in pasteurization are explicitly mentioned to drive the control.

Pasteurization represents one of the most widespread and safety-critical thermal treatments in the food industry, particularly in the dairy sector. While it is a well-established process governed by standardized regulations—such as those defined by HACCP protocols, the Codex Alimentarius, and the U.S. Pasteurized Milk Ordinance—its execution involves multiple tightly controlled operational stages, including heating, holding, cooling, and product discharge. These stages must be continuously monitored to ensure compliance with temperature–time specifications and to maintain product quality and safety. Precisely due to its regulatory importance and reliance on multivariate sensor data, pasteurization offers an ideal testbed for exploring the integration of process mining and predictive monitoring techniques. It provides a realistic and structured scenario in which to investigate how continuous signals can be transformed into interpretable process representations that support conformance analysis, traceability, and real-time forecasting.

Within the scope of PM, a process is understood as a set of coordinated activities that transform inputs into outputs under specific constraints, whereas an activity denotes a distinct operational unit characterized by a start and an end, typically associated with observable changes in stage or resources [11]. In the case of pasteurization, these activities include operational phases pragmatically inferred from sensor dynamics. The main contributions of this work are as follows:

We propose a methodological framework that bridges continuous, multivariate sensor data with the activity-oriented perspective required in process mining, supporting compliance with HACCP for CCP.
We incorporate multiple sensors (temperature, pH, conductivity, viscosity, turbidity, flow, and pressure) into a simulated setting, thereby enabling the monitoring and analysis of the HTST pasteurization process.
We demonstrate how the transformation of raw sensor streams into structured process representations facilitates the application of recent PPM techniques, enabling proactive interventions in the dairy industry.

The remainder of this paper is structured as follows: Section 2 defines the problem and outlines the research questions; Section 3 presents the theoretical background on pasteurization, HMM, and PM; Section 4 describes the proposed methodology, including the sensor configuration, activity modelling, and event log construction; Section 5 reports the experimental results and evaluates the performance of the proposed framework in both process discovery and predictive monitoring tasks; Section 6 discusses the conclusions and identifies directions for future work.

2. Problem Definition

Ensuring food safety in the dairy industry requires robust monitoring of processes such as pasteurization, where biological hazards (e.g., bacteria, viruses, parasites) must be effectively controlled [16,17]. HTST pasteurization is widely adopted because it guarantees microbial inactivation while preserving nutritional quality and extending shelf life [18]. The process relies on continuous monitoring of multivariate sensor data—temperature, pH, conductivity, viscosity, turbidity, and pressure—to ensure that the critical time–temperature combinations mandated by HACCP are satisfied.

Machine learning techniques (supervised and unsupervised) offer models for monitoring the process by learning directly from data. Supervised methods require labeled ground truth, which is rarely available in industrial practice, whereas unsupervised methods can reveal hidden patterns but often lack interpretability. HMMs offer a principled framework for bridging this gap, as they infer latent operational stages from observable sensor data streams and allow for the segmentation of continuous processes into discrete phases.

However, even when using HMMs, two challenges persist. First, the phases discovered from raw sensor data must be meaningfully aligned with regulatory concepts such as CCP. Second, data-driven monitoring must not only identify stages but also verify whether the observed process conforms to its intended design and safety constraints. PM provides a promising paradigm to address these challenges, as it connects sensor-derived data with process-based representations, enabling discovery, conformance checking, and PPM. In this context, our study focuses on the HTST pasteurization process as a representative case to investigate the integration of sensor-driven modelling with PM. Specifically, we address the following research questions:

(1): RQ1: How can process activities be reliably identified from multivariate sensor integration, considering the absence of formally defined stage boundaries combining sensors data?
(2): RQ2: How does PM provide added value compared to conventional threshold-based monitoring in terms of capturing process dynamics and conformance, in our case HACCP?
(3): RQ3: Can PPM derived from PM enhance safety, regulatory compliance, and operational efficiency in pasteurization?

While our method applies unsupervised modelling techniques, specifically, HMMs, to discover latent process stages from multivariate sensor signals, we acknowledge that domain-specific thresholds, derived from HACCP regulations (e.g., time–temperature requirements), inform both the simulation and interpretation stages of the pipeline. This does not imply supervision in the conventional sense (i.e., the presence of labeled training data), but rather reflects the incorporation of domain knowledge to map inferred stages to interpretable process activities. In this respect, the approach may be more accurately described as unsupervised segmentation with domain-informed post-processing. This balance allows the method to remain data-driven while maintaining regulatory alignment, a necessity in safety-critical environments such as food manufacturing.

3. Background

3.1. Pasteurization

Pasteurization is a thermal treatment designed to inactivate non–spore-forming pathogenic microorganisms in milk while, as far as possible, preserving its technological and sensory properties. For milk, the classical safety benchmark is high-temperature, short-time (HTST; ≥72 °C for ≥15 s) or low-temperature, long-time (LTLT; ≥63 °C for ≥30 min), or equivalent time–temperature combinations, provided that efficacy is demonstrated by a negative alkaline phosphatase (ALP) test immediately after treatment. This framework is set out in Regulation (EC) No 853/2004 and Commission Implementing Regulation (EU) 2019/627, which also allows for validated alternatives based on HACCP where ALP is unsuitable [19,20,21].

From a microbiological perspective, the historical design organism is Coxiella burnetii, regarded as the most heat-resistant non-spore-forming pathogen in milk; a ≥5-log₁₀ reduction during pasteurization is the commonly accepted performance target reflected in Codex-aligned literature and recent kinetic studies [22].

On the quality side, under HTST conditions, nutrient and sensory impacts tend to be modest when the process is well controlled; shelf life is often limited by post-process contamination, psychrotrophic/spore-forming bacteria, and cold-chain management rather than the heating step itself [23,24]. We mention this for context; investigating post-pasteurization dynamics is beyond the scope of this work, which focuses on the thermal CCP.

In plate or tubular pasteurizers, raw milk is preheated in the regenerator, brought to the setpoint, and held in a holding tube for the minimum time; if criteria are not met, a flow-diversion valve routes product to recirculation/discharge. Approved product is then pre-cooled in the regenerator and finally cooled (typically to refrigeration temperatures, e.g., ≤6 °C). Sanitary integrity in the regenerator is maintained via positive differential pressure (

Δ P

) with the pasteurized side > raw side; process conformity is verified by ALP-negative results and by meeting process–hygiene criteria (e.g., Enterobacteriaceae at the end of the manufacturing process for pasteurized milk) [21,25]. To ensure that these hazards and barriers are identified, controlled, and evidenced consistently, EU hygiene legislation requires a HACCP-based food safety management system.

HACCP is a preventive, risk-based system that identifies hazards, defines CCPs, sets measurable critical limits, and requires continuous monitoring, verification, and documentation [26]. In fluid-milk HTST pasteurization, the thermal step is the core CCP with legally grounded time–temperature criteria; engineering safeguards such as automatic flow diversion and maintaining positive

Δ P

(pasteurized > raw) protect sanitary integrity across the heat exchanger. The HACCP program comprises seven basic principles; for pasteurized milk (HTST), they can be staged as follows:

Conduct a hazard analysis. Identify biological hazards relevant to milk and to the process, with C. burnetii as the reference organism for thermal lethality.
Determine the CCPs. Define the pasteurization stage as the primary CCP considering time–temperature criteria.
Establish critical limits. Specify time–temperature combinations that ensure lethality at the CCP.
Establish corrective actions. Pre-define actions for any deviation or predicted deviation: keep product in recirculation/activate diversion, adjust heat input or residence time, segregate the affected volume, investigate root causes (including sensor faults), and document the decision before release.
Establish verification procedures. Verify that the system consistently achieves lethality and preserves sanitary barriers: routine ALP testing, calibration, and functional checks of sensors/recorders and the flow-diversion valve (FDV), verification of positive $Δ P$ , and periodic review of monitoring records and corrective actions.
Establish documentation and record-keeping. Maintain complete, tamper-evident records linking raw sensor traces, derived process stages, alarms, interventions, verification outcomes, and release decisions for each batch/lot.

In sum, pasteurization is a thermal CCP with a robust legal basis (time–temperature). Its effectiveness and final quality depend on tight control of the thermal profile and residence time, sanitary barriers in the heat exchanger (

Δ P

), automatic diversion, rapid cooling, and cold-chain discipline. Current evidence underscores the primacy of continuous temperature control as well as post-process hygiene and cold-chain management for safety and shelf life in HTST systems, and shows how data-driven monitoring can complement HACCP to support proactive interventions [23,24].

3.2. Hidden Markov Models

HMM describes the joint probability distribution of a sequence of hidden variables and observed variables under the Markov assumption, which means the hidden stage at time t depends only on the hidden stage at time

t - 1

, and the current observation depends only on the current hidden stage. In the proposed method, the hidden stages correspond to operational phases of the process (e.g., Idle, Fill, Heat-up, Hold, Cool, Discharge), while the observations correspond to vectors of multivariate sensor readings (temperature, pH, conductivity, viscosity, turbidity, and pressure/flow). Let

X_{t}

be the hidden stage at time t, with N possible values

s_{1}, \dots, s_{N}

representing pasteurization phases. The transition probabilities (

a_{i, j}

) between two stages (i and j) are assumed to be time-independent, as shown in Equation (1).

\begin{matrix} a_{i j} = P (X_{t} = s_{j} ∣ X_{t - 1} = s_{i}), A = [a_{i j}] \in R^{N \times N} . \end{matrix}

(1)

where the matrix of A shows the transition probability matrix of the system. The distribution probability for stage i in time t is shown as

π_{i}^{(t)}

and initial stage distribution is equal to

π_{i}^{(0)} = P (X_{1} = s_{i}),

where

\sum_{i = 1}^{N} π_{i}^{(0)} = 1

. At each time t, we observe a multivariate measurement

Y_{t} = y_{t} \in R^{d}

that is the observed multivariate sensor vector at time t. The emission probability of observing

y_{t}

given stage

s_{j}

is modeled as a Gaussian distribution shown in Equation (2).

\begin{matrix} b_{j} (y_{t}) = P (Y_{t} = y_{t} ∣ X_{t} = s_{j}) = N (y_{t} ∣ μ_{j}, Σ_{j}) . \end{matrix}

(2)

In the proposed method, we used the unsupervised HMM, which has two steps of (i) initialize before training, and (ii) re-estimate during Expectation-Maximization. In the first step, the parameters are randomly given to keep the following rules:

$π_{i}^{(0)} = \sum_{i}^{n} π_{i}^{(0)} = 1$ ;
$a_{i j}^{(0)} = \sum_{j}^{n} a_{i j}^{(0)} = 1 i, j = 1, \dots, N$ ;
$μ_{j}^{(0)} \sim$ sampled from empirical distribution of observations;
$\sum_{j}^{(0)} = diag (σ_{1}^{2}, \dots, σ_{d}^{2}),$ where $σ_{k}^{2}$ denotes the empirical variance of the k-th sensor dimension.

In the next step, the parameters

θ = (A, μ, σ, π)

are estimated using the Baum–Welch algorithm, an Expectation–Maximization procedure that alternates between computing posterior probabilities and updating model parameters by using the following equations.

π_{i}^{*} = γ_{i} (1),

(3)

a_{i j}^{*} = \frac{\sum_{t = 1}^{T - 1} ξ_{i j} (t)}{\sum_{t = 1}^{T - 1} γ_{i} (t)},

(4)

μ_{j}^{*} = \frac{\sum_{t = 1}^{T} γ_{j} (t) y_{t}}{\sum_{t = 1}^{T} γ_{j} (t)},

(5)

Σ_{j}^{*} = \frac{\sum_{t = 1}^{T} γ_{j} (t) (y_{t} - μ_{j}^{*}) {(y_{t} - μ_{j}^{*})}^{⊤}}{\sum_{t = 1}^{T} γ_{j} (t)} .

(6)

where

γ_{i} (t)

is the probability of being in stage

s_{i}

at time t and

ξ_{i j} (t)

is the probability of transitioning from

s_{i}

to

s_{j}

at time t. This iterative process continues until convergence, identifying pasteurization stages directly from sensor dynamics without requiring labeled training data, enabling subsequent process mining analysis. This unsupervised approach enables the automatic segmentation of raw sensor data into meaningful process phases, without requiring labeled annotations of stage transitions.

3.3. Process Mining

PM [27] represents an area of data science that integrates model-oriented and data-oriented perspectives and offers methods and techniques to investigate how business processes are actually carried out in organisations. Specifically, it aims to extract knowledge from event logs generated by various IT systems to uncover process behaviour, evaluate performance, and ultimately transform raw execution data into actionable insights for organisational improvement.

In this context, a process is a set of coordinated activities that collectively pursue a business objective. Each execution of a process is referred to as a case, which represents a specific instance of the process (e.g., a customer order or a patient treatment). A case unfolds through a sequence of activities, corresponding to the observable steps executed during the process. At a finer level of granularity, each activity can be decomposed into tasks, which are the smallest operational units, executed either by human actors or automated systems.

An event log, such as the example reported in Table 1, is a structured collection of events that records process executions. Each event must contain at least three attributes: a case identifier, to distinguish process instances and group their events; an activity, which specifies the executed step; and a timestamp, which defines the chronological order of events and enables performance-related analyses. Beyond these, additional attributes such as resources, costs, or locations may be included to enrich the analysis.

Definition 1

(Event Log [28]). An event log is a tuple

L = (E, #, ≺)

where:

E \subseteq U_{e v}

is a finite set of events;

# : E \to U_{map}

is a mapping assigning attribute-value mappings to events;

≺ \subseteq E \times E

is a strict partial order over events.

For any

e \in E

and

a t t \in dom (# (e))

,

# a t t (e) = # (e) (a t t)

denotes the value of attribute

a t t

for event e. For instance,

# a c t (e)

,

# c a s e (e)

, and

# t i m e (e)

represent the activity, case, and timestamp of event e, respectively.

The ordering of events respects time, i.e., if

e_{1}, e_{2} \in E

with

# t i m e (e_{1}) \neq ⊥

,

# t i m e (e_{2}) \neq ⊥

, and

# t i m e (e_{1}) < # t i m e (e_{2})

, then

e_{2} ⊀ e_{1}

.

The most common applications of PM include process discovery, conformance checking, process enhancement, and PPM [28]. Process discovery focuses on automatically deriving a process model directly from an event log and provides a transparent view of how processes are executed in reality without the need for prior knowledge. Conformance checking instead compares a predefined process model with the actual behaviour recorded in the event log, making it possible to detect, locate, and explain deviations, measure their severity, and even detect potential fraud. Process enhancement goes one step further by enhancing or enriching existing process models with insights from the event data, e.g., by identifying bottlenecks, analysing performance, or highlighting resource allocation patterns. Finally, PPM uses historical event logs to forecast the future behaviour of ongoing operations, for example, by predicting the next activity, estimating the remaining time to completion, or assessing the likelihood of a delay or missed deadline.

Building on these foundations, recent research has extended the scope of PM beyond structured event logs to unstructured and sensor-generated data, where the continuous nature of the signals requires additional pre-processing and interpretation. In this context, Brzychczy et al. [12] provide a comprehensive overview of sensor-based PM by classifying sensor types, domains, and approaches and identifying two key challenges: data abstraction, i.e., transforming noisy continuous data streams into discrete events at the right level of granularity, and data contextualisation, i.e., enriching these events with semantic meaning and case associations.

Similarly, Elkodssi et al. [29] emphasise the importance of developing robust transformation pipelines capable of dealing with heterogeneous logs and propose comparative criteria—system type, log nature, transformation method, and PM domain—to evaluate existing approaches. These contributions show that the integration of sensor data with PM is not only feasible but also increasingly necessary for industrial applications where processes are characterised by high variability, overlapping signals, and strict compliance requirements.

These contributions underline that while sensor-to-event abstraction and contextualisation are essential for enabling PM on industrial data [12,29], their full potential unfolds when coupled with PPM. By exploiting historical sensor-derived events, PPM not only supports compliance in highly regulated domains but also provides early warnings of deviations, enhances product quality, and improves operational efficiency. In this way, PPM extends the value of sensor data integration from retrospective analysis to proactive decision-making, positioning it as a cornerstone for industrial process optimization and smart manufacturing [10,11].

4. Pasteurization as a Process

Pasteurization, particularly in HTST systems, is not a monolithic operation but a sequence of coordinated stages, such as filling, heating, holding, cooling, and discharging, each governed by specific physical and regulatory conditions. Modeling pasteurization as a structured process enables the application of process-aware analysis techniques, which are essential for ensuring compliance with food safety standards such as HACCP. From a data-driven perspective, this abstraction facilitates the transformation of continuous multivariate sensor streams into interpretable process states, allowing for both real-time monitoring and retrospective process mining. Recognizing pasteurization as a process, rather than merely a temperature–time function, thus provides a conceptual foundation for integrating control systems, sensor data, and predictive analytics in a unified monitoring framework.

Figure 1 presents an overview of the proposed framework, which models pasteurization as a multistage process driven by multivariate sensor data. The diagram illustrates how raw measurements from inline sensors—monitoring variables such as temperature, pH, conductivity, viscosity, turbidity, flow, and pressure—are processed through a HMM to detect process stages, align with CCPs, and enable the construction of interpretable event logs. These logs are then used for process discovery and online predictive monitoring, including estimations of batch progression, stage durations, transition forecasting, and SLA compliance.

4.1. Expected Process Activities (Stages)

In our paper, the context of HTST pasteurization is pragmatically described as a sequence of stages: Idle, where the system is in standby; Fill, characterized by the inflow of raw milk; Heat-up, where the product temperature increases toward the target setpoint; Hold, the CCP during which milk is maintained at 72 °C for at least 15 s; Cooling, where the product is rapidly lowered to 4–10 °C; and Discharge, in which the pasteurized milk is transferred to storage or packaging. While these stages are not formally defined with strict boundaries, they can be inferred from the joint behaviour of the sensor signals. This multivariate integration provides the foundation for segmenting continuous sensor data into meaningful activities, enabling their abstraction into event logs suitable for discovery, conformance checking, and PPM for a wide range of industrial processes.

Figure 2 illustrates the temporal evolution of sensor signals across the different process stages. Each sensor exhibits distinct dynamics, both in relation to the other sensors and depending on the operational stage. Among them,

Q_{o u t}

,

Q_{i n}

, and T show partially comparable patterns, though with notable differences. Before the onset of certain stages, the temperature remains stable; once the stage is entered, however, specific dynamics become evident.

For

Q_{i n}

and T, the temperature decreases after the corresponding stage is reached and then stabilizes during the subsequent phases. In particular,

Q_{i n}

increases during the Heat-up stage, with minor oscillations, before dropping to its baseline level. Conversely, T rises continuously from Heat-up through Hold, and subsequently decreases, returning to its initial range.

The behaviour of

Q_{o u t}

is more irregular. Its signal remains almost constant throughout most stages, except for a sharp increase at the end of Cool. During Discharge,

Q_{o u t}

alternates between increases and decreases, showing higher variability compared to the other flow-related sensors.

Sensors P and

P h

follow similar trajectories, characterized by persistent oscillations across all stages. Sensor K also presents oscillatory dynamics, although with slightly reduced intensity. A distinct deviation occurs at the end of Fill, where a sharp decrease is followed by an increase during Heat-up, after which K resumes its general oscillatory pattern.

Finally, sensors

μ

and

τ

display differentiated behaviours. Sensor

μ

combines mild oscillations with a gradual decrease starting midway through Heat-up and continuing into Hold; toward the end of this stage, the signal rises again and stabilizes during Discharge. In contrast,

τ

remains stable until the end of Fill, then increases and maintains a relatively constant level throughout Hold and Cool, before steadily decreasing during Discharge.

In summary, the joint analysis of these sensor signals highlights the heterogeneous yet complementary dynamics that characterize each stage of pasteurization, providing the basis for reliable stage segmentation.

4.2. Sensors

The pasteurization process can be monitored using six inline sensors integrated into the tank system [4]. In our study, each sensor provided continuous measurements sampled at a fixed interval of

Δ t = 1

s, corresponding to a frequency of 1 Hz. These signals, as Table 2 shows, capture complementary aspects of the process, enabling the recognition of operational stages and their subsequent conversion into activity-based representations for process mining.

Each of these signals exhibits distinct temporal patterns that correspond to specific process stages. Temperature and viscosity capture thermal dynamics, pH and conductivity provide information on chemical stability, turbidity reflects transitions between product and water interfaces, while flow and pressure characterize hydraulic activity. Taken together, these sensors provide a multivariate perspective sufficient to delineate the key operational stages of HTST pasteurization.

4.3. Sensor-to-Activity Mapping

As mentioned above, each sensor signal exhibits characteristic ranges and dynamics that correspond to specific process activities. Activities were not defined purely on absolute values but on transitions between regimes. These transitions form the basis for discrete event generation, each associated with a start and an end timestamp. While acceptance bands capture domain knowledge, they are sensitive to noise and small fluctuations. To ensure robust segmentation, we integrated HMMs as probabilistic models that infer hidden operational stages from noisy multivariate signals.

As a result, continuous multivariate signals are probabilistically segmented into discrete activities. The output of this step is a sequence of timestamped activity labels that form the basis of the event log construction in Section 4.4.

4.4. Event Log Construction

The output of the sensor-to-activity mapping is a sequence of hidden stages

x_{1 : T}^{*}

, where each stage corresponds to a process activity

a \in A

. To integrate this representation with PM techniques, the sequence must be converted into an event log (L), which is a collection of cases composed of temporally ordered events with associated attributes. In our study, the case is the batch (

b \in B

) of milk processed through the system. Each batch constitutes one process instance (or trace), with its corresponding sequence of activities. An event (

e_{i} = (a_{i}, t_{i}^{start}, t_{i}^{end}

) is generated whenever an activity ends. This transformation converts contiguous time steps of identical hidden stages into single event instances, thereby aligning with the activity-based abstraction of process mining.

4.5. Process Discovery and Monitoring

PM enables the extraction of structured process knowledge from event data, offering a framework for analysing the behaviour of physical processes captured through sensors. In this work, we apply the two primary tasks of PM: Process Discovery and PPM, as defined in [28], to map sensor data into interpretable process models and derive regulatory and operational insights.

Event logs derived from multivariate sensor signals are used to reconstruct the underlying workflow of the pasteurization process. Using unsupervised activity recognition, we segment raw data streams into discrete stages—such as Fill, Heat-up, Hold and Cool—which are then treated as process activities.

Process discovery is the first step of PM and is responsible for transforming event logs, consisting of case identifiers, activity labels, and timestamps, into data-driven models that capture the actual execution of processes. In our case study, the event logs derived from sensor signals are analysed using discovery algorithms to create process models that reflect the observed batch behaviour of pasteurisation. These models, expressed through notations such as Petri nets or Business Process Model and Notation (BPMN) diagrams, enable transparent visualisation of workflows, highlighting variability between different executions and revealing inefficiencies or deviations that are not apparent from the raw data. By making complex operations easier to interpret, they enable a clearer understanding of process dynamics and support both compliance checking and process optimisation, which is in line with the principles of process mining [28].

Instead, PPM extends traditional process mining by forecasting future process behaviour from partially executed traces [30]. The ability to foresee future events, the remaining time, or the outcome of a case can be crucially important during decision-making in industrial processes. PPM builds models from previous executions and applies them to ongoing cases in real time. Conceptually, the workflow unfolds in two phases. In the model construction phase, a predictive model is learned from a historical event log containing completed cases. In the model application phase, the trained model is deployed on running, previously unseen cases to generate predictions in real time. These predictions can then be presented to operators and decision-makers, enabling proactive interventions in accordance with regulatory protocols (such as HACCP). Time-related prediction is particularly relevant for pasteurization. It includes estimating the timestamp of the next activity transition, the completion time of a case, or the remaining duration of an ongoing stage. Such time-aware monitoring enables proactive actions.

To implement PPM, we employed SkPM1 (Scikit-learn Extension for Process Mining), which integrates event log–based process representations with machine learning pipelines from the scikit-learn ecosystem [31]. In this study, we provided three predictions using SKPM, explained in the following. To implement SKPM, 80% of the dataset is used for the learning step and 20% as the test.

Remaining Time of the Batch: It estimates the remaining duration until the batch is completed. For a partial trace

σ_{1 : k}

of a batch b, the remaining time prediction is computed by using Equation (7).

{\hat{T}}_{rem} (b, k) = {\hat{T}}_{case} (b) - T_{elapsed} (b, k),

(7)

where

{\hat{T}}_{case} (b)

is the total predicted duration of the batch and

T_{elapsed} (b, k)

is the observed execution time of the prefix.

Remaining Time in the Current stage: It estimates the remaining time until completion. For an ongoing activity a that started at

t^{start}

, the goal is to predict the value of Equation (8).

{\hat{T}}_{rem} (a, t) = E [T (a)] - (t - t^{start}),

(8)

where

E [T (a)]

is the expected duration of the activity learned from historical logs. This is particularly important for monitoring the Hold stage to ensure HACCP compliance.

Next Transition Timestamp: The timestamp of the next activity change is defined as:

{\hat{t}}_{next} = t + {\hat{T}}_{rem} (a, t) .

(9)

This enables forecasting the precise moment when the process will move from one stage (e.g., Heat-up) to the next (e.g., Hold), providing operators with an anticipatory signal of upcoming stage changes.

5. Experimental Results

5.1. Synthetic Sensor Data Generation

Developing and validating process mining frameworks for industrial applications often faces practical constraints due to the limited availability of high-quality, fully annotated real-world datasets. These limitations are particularly evident in safety-critical domains such as food processing, where data access is restricted for confidentiality and traceability reasons, and manually labeled ground truth is seldom available due to the high cost and operational disruption required for data annotation.

To overcome these challenges, we constructed a synthetic dataset grounded in domain knowledge and supported by validated literature on HTST pasteurization dynamics [4,15]. This simulation-based approach enables reproducible experimentation in a controlled, parameterized environment that mimics real operational complexity—including multivariate sensor behavior, activity transitions, and regulatory thresholds. By modeling realistic temperature, pH, conductivity, viscosity, turbidity, flow, and pressure signals over full batch cycles, we ensured that the synthetic data reflects not only nominal behavior but also variability and overlaps typical of industrial settings.

Such an approach is common in the study of process mining from sensor data (e.g., [12,29]), where synthetic traces serve both to test methodological assumptions and to provide a benchmark for evaluating segmentation and prediction techniques. Moreover, it allows researchers to include additional variables and controlled anomalies that may not be simultaneously accessible in a single installation, thereby supporting generalization across multiple monitoring scenarios.

To evaluate the proposed approach, we constructed a synthetic dataset that replicates the batch-based operation of an HTST pasteurizer. The simulator models production cycles, consisting of stages such as Idle, Fill, Heat-up, Hold, Cool, and Discharge. Each stage is associated with specific operational signatures in terms of process variables, including temperature (°C), pH, electrical conductivity (mS/cm), flow rate (L/min), pressure (bar), and viscosity (Pa·s). An illustration of the time series of all sensors, together with the mean behaviour and the standard deviation, is shown in Figure 3.

The dataset is generated at a temporal resolution of one second, yielding multivariate time series for each batch. A total of 1000 production batches were generated, resulting in approximately 90 h of operation, 324,514 rows, and 2,569,112 sensor readings across all variables. For unsupervised HMM, 60% of the batches were used to estimate the model parameters, while the remaining 40% were reserved for independent evaluation against ground truth labels. From this 40% evaluation subset, we further split 80/20 for predictive PPM.

5.2. From Sensors to Process Discovery (Goal 1 – RQ1)

To address the first research question, we applied an unsupervised HMM on the multivariate dataset, mapping hidden stages to the six operational stages. In the implemented unsupervised HMM, the parameters (

θ = (A, μ, σ, π)

) are initialized randomly and learned entirely from the unlabeled sequence data via the Expectation-Maximization algorithm. The base input signals to HMM included temperature (T), electrical conductivity (

κ

), and the inflow/outflow rates (

Q_{i n}

,

Q_{o u t}

). In the next step, the normalized value of T and

Q_{i n}

, and the logarithmic flow ratio

\log (1 + Q_{i n}) - \log (1 + Q_{o u t})

, were introduced to capture nonlinear dependencies between thermal and hydraulic behaviour. From these, we extracted temporal derivatives (

\frac{d T}{d t}

and

\frac{d^{2} T}{d t^{2}}

) to capture heating and cooling dynamics, which are helpful to distinguish between the Heat-up, Hold, and Cool stages. In the next step, phase-specific indicators were constructed to emphasize stability near the critical threshold of the Hold stage, while binary indicators such as strong heating (

d T / d t > 0.1

) and heating phase (

d T / d t > 0.05

and

T < 72

) explicitly encoded signatures of the Heat-up phase. Rolling-window statistics (short-term standard deviation of T) and positional features within each batch were also included to capture local variability and process progression.

The model achieved an overall accuracy of 97%, with macro-averaged precision, recall, and F1-score around 0.94–0.95, demonstrating the ability of the HMM to reliably segment continuous signals into discrete activities. Table 3 reports the performance metrics obtained in the different stages of the test set.

The results show that the model reached a high overall accuracy of 97%. Idle, Fill, and Discharge were classified with very high agreement. Both Heat-up and Cool were classified with strong precision but slightly reduced recall, reflecting their partial overlap with Hold. The Hold stage itself was recalled with perfect sensitivity (recall = 1.0), but precision was lower, since some Heat-up and Cool intervals were conservatively assigned to Hold.

It is worth mentioning that Hold achieved perfect recall (1.0) but lower precision (0.66), as some Heat-up and Cool intervals were misclassified as Hold. Importantly, this conservative misclassification ensures that true Hold intervals were never missed, aligning with HACCP’s safety-critical requirement.

Figure 4 presents the confusion matrix, which makes these patterns explicit. Misclassifications are primarily concentrated around the transitions into and out of the Hold phase, where temperature plateaus at approximately 72 °C and viscosity stabilizes. In contrast, Idle, Fill, and Discharge show virtually no confusion, since their signatures in flow and turbidity are distinct and unambiguous.

Taken together, these results demonstrate that the HMM can reliably segment continuous multivariate sensor streams into discrete pasteurization activities, even in the absence of formally defined stage boundaries, thus positively answering RQ1.

Once the continuous space is segmented and the corresponding time intervals of change are identified, each interval can be associated with a specific activity, thereby enabling the construction of the sensor-based event log. In line with the standard definition of PM, we obtain the event log where each sensor is assigned a unique Case ID. The activities discovered through the HMM are recorded in the Activity column, and the corresponding time intervals are stored in the Timestamp column. Table 4 presents an illustrative portion of the achieved event log.

A closer inspection of Table 4 reveals two timestamp columns. This representation remains consistent with the standard event log definition in PM; the explicit distinction between start and end timestamps is necessary to support the application of PPM techniques, which require precise knowledge of activity durations. The column Activity represents the detected stage by HHM.

As previously outlined, the event log obtained from sensor data is used as input for process discovery to derive an interpretable process model. Among the available algorithms, we adopted the Directly-Follows Graph (DFG), a directed graph that represents sequential relationships between activities. In this representation, nodes correspond to activities, edges denote the directly-following relations observed in the log, and edge weights capture their frequency, thus highlighting the most common paths and exposing potential bottlenecks or inefficiencies.

Figure 5 shows the discovered DFG for the pasteurization process. The six rectangles represent the identified activities, which correspond to the six operational stages of the process. The start and end points follow standard DFG notation: a single circle with a central dot for the start, and two concentric circles enclosing a rectangle for the end. Weighted edges illustrate the execution frequencies of activity transitions.

In this case, transition frequencies are largely uniform (except for a single edge), indicating that the workflow is stable and balanced, without evident bottlenecks. Most cases follow the expected path, beginning with Idle and ending with Discharge. However, five cases deviate, starting with Heat-up, followed by Idle and Fill, before returning to Heat-up and continuing along the same trajectory as the majority. This behaviour suggests that the sensor captured a temperature pattern distinct from the standard Idle stage, pointing to a possible anomaly or variation in the initial process conditions.

5.3. Monitoring Pasteurization Using the Discovered Model

Conventional threshold-based monitoring methods rely on predefined static limits to enforce compliance with CCPs. Our proposal is that process mining leverages multivariate sensor data and event logs to capture the temporal dependencies between activities, track compliance across the entire workflow, and support predictive foresight. To demonstrate this added value, we applied PPM using SkPM (Section 4.5) to event logs derived from the HMM-based activity segmentation. In the predictive component, a Random Forest Regressor (RF) was employed to train a separate model for each of the following tasks: (i) estimating the remaining time of a batch, (ii) estimating the remaining duration of the current activity, and (iii) predicting the timestamp of the next activity transition. All models were trained using the default hyperparameters provided by the scikit-learn implementation.

5.3.1. Remaining Time of the Batch

It was estimated that the remaining duration of an entire pasteurization batch given a partially executed trace. The RF model achieved an

R^{2}

score of 0.94 for this task, with a Mean Absolute Error (MAE) of 15.49 s and a Root Mean Square Error (RMSE) of 19.58 s. Considering that batch durations are typically on the order of several minutes, this level of accuracy is sufficient to anticipate completion times and adjust downstream operations such as storage preparation or packaging.

To further analyse prediction quality, we evaluated the remaining batch time predictions conditioned on current activity (Table 5).

Results show that the model achieved consistent accuracy across all stages, with MAE between 6.24 s and 18.94 s, corresponding to relative mean errors below 12%. These results confirm that the predicted average remaining times were always close to the actual values.

5.3.2. Remaining Time of the Current Activity

The second task focuses on predicting the residual time of the ongoing activity. This is particularly critical for the Hold stage to ensure microbial safety. The model achieved an

R^{2}

score of 0.96, with an MAE of only 3.76 s and an RMSE of 5.45 s. The results of predicting the remaining time of the current activity for different activities are presented in Table 6.

Based on the results, the RF model achieved high accuracy across all stages, with MAE values ranging from 0.43 s (Idle) to 5.94 s (Discharge). Relative mean errors remain below 10% for all activities except Heat-up and Cool, where the rapid temperature dynamics increase variability. These results demonstrate that predictive monitoring can reliably anticipate the completion time of each activity with sub-second to a few-second precision, enabling proactive assurance that critical stages such as Hold satisfy HACCP time requirements.

5.3.3. Next Activity Transition Timestamp

The final task involved predicting the exact remaining time at which the process transitions from the current stage to the next. The model achieved highly accurate results, with an

R^{2}

score of 0.99, an MAE of 0.41 s, and an RMSE of 0.96 s. This performance enables operators to anticipate critical transitions (e.g., from Heat-up to Hold or from Cool to Discharge) almost in real time. The results for predicting the timestamp of the next activity transition are shown in Table 7.

Based on the results, the model achieved near-real-time accuracy across all transitions, with MAE values ranging from 0.20 s for the Idle to Fill transition to 1.92 s for the Heat-up to Hold transition, enabling operators to synchronize downstream tasks almost perfectly with actual process execution.

The summary of the results is shown in Table 8. Taken together, they show that PM not only enforces CCPs but also integrates predictive capabilities that conventional threshold monitoring cannot provide.

5.3.4. Explaining Monitoring Prediction

As the last step, to provide transparency and explainability, and allow the process experts to validate whether the model is based on predictions on semantically meaningful process variables, we applied Decision Predicate Graph (DPG) analysis on the RF modelsn. DPG is a graph-based representation that summarizes the logical decision boundaries extracted from ensemble models [32]. Table 9 presents the key results of the DPG analysis.

Based on the results, by converting sensor data into a process and implementing PPM, three important interpretations can be achieved (important information, most used features, and decision boundaries). The important information reveals which attributes are first used by the model to guide predictions. The DPG analysis report revealed that the most influential nodes in prediction included the activity_sequence, minute, and event_count features. The most used features section further confirms that the model relied primarily on cumulative process time and execution length to provide a prediction. Finally, the extracted decision boundaries provide concrete thresholds used by the model. For instance, the model often splits on duration_seconds values between 35–120 s and on event_count values in a similar range, reflecting the temporal structure of pasteurization batches. Likewise, stage-specific thresholds such as activity_HeatUp ≤ 0.25 or activity_Hold ≤ 0.14 report the decision threshold frequently used by RF as cut-off points (not rule confidence) and capture the transition signatures between heating, holding, and cooling.

5.4. Predictive Monitoring for Safety, Compliance, and Efficiency (Goal 3 – RQ3)

In our context, two key predictive tasks were implemented: next activity prediction and remaining time estimation. These tasks allow the system to forecast not only the forthcoming operational stage but also the time until its occurrence or completion.

From a safety standpoint, the ability to predict the remaining duration of the Hold stage helps ensure that the legal time–temperature limit for the thermal CCP will be met, enabling pre-deviation interventions when a shortfall is forecast. This contributes to the proactive enforcement of CCPs, reducing the risk of under-processing and contamination.

From a compliance perspective, PPM acts as a complement to conformance checking by identifying potential violations before they manifest. Anticipating abnormal deviations—such as an unexpectedly short Hold duration or skipped transitions—enables early interventions, thereby supporting traceability and regulatory accountability.

From an efficiency viewpoint, accurate estimates of the time to next activity transitions (e.g., from Cool to Discharge) allow for optimized scheduling of downstream operations such as cleaning, packaging, or storage allocation. This reduces idle time, mitigates bottlenecks, and supports better resource planning. These capabilities reinforce food safety, regulatory conformance, and operational performance, thereby addressing RQ3 and demonstrating the added value of our approach.

Building on the foundational elements discussed in Section 3.1—namely the legal time–temperature requirements, FDV operation, and the maintenance of a positive pressure differential (

Δ P

)—we demonstrate how PPM can be effectively operationalized within a HACCP framework to support real-time decision-making and regulatory compliance.

5.4.1. Critical Control Point (CCP) Guardrails

When the predicted residual duration of the Hold phase—maintained at or above the validated temperature setpoint—approaches the legally mandated minimum (e.g., 15 s at 72 °C), the system proactively issues an early warning. This pre-deviation alert allows operators to take timely corrective measures, such as diverting or recirculating the product flow, or adjusting the heating profile to reestablish compliance. Conversely, if forecasts indicate unnecessary overprocessing, the system can issue symmetric notifications to optimize energy efficiency and preserve product quality.

5.4.2. Diversion Assistance and Delta P Integrity

Predictive models also assess the likelihood of sanitary barrier failures by analysing FDV behaviour and pressure differential (

Δ P

) dynamics. A declining or negative

Δ P

signal, for instance, may indicate a potential cross-contamination risk between pasteurized and raw milk streams. In such cases, the system can preemptively initiate product recirculation and delay discharge to prevent non-compliant product from entering downstream processing or distribution.

5.4.3. Traceability and Auditability (HACCP Principles 4–7)

All predictive inferences, alarms, operator responses, and automated interventions are recorded in structured event logs, which are securely time-stamped and tamper-evident. This logging infrastructure supports multiple HACCP principles: monitoring (Principle 4), corrective action (Principle 5), verification (Principle 6), and record-keeping (Principle 7). Moreover, by linking predictive analytics with conformance checking, the system enhances the traceability and auditability of process executions, facilitating both internal reviews and external inspections.

5.4.4. Limitations

While the proposed framework demonstrates the feasibility and advantages of applying PM to multivariate sensor data in pasteurization, certain limitations must be acknowledged. First, the approach assumes that sensor data are well-calibrated and consistent across batches. In real-world industrial environments, however, sensor drift, noise, and calibration errors can introduce variability that may impact the accuracy of activity recognition and predictive models. Mitigating these effects would require periodic recalibration procedures, signal-preprocessing techniques, or the incorporation of sensor confidence measures.

Second, the generalizability of the framework to other processes or installations depends on the availability and comparability of sensor configurations and process semantics. Although the simulation reflects realistic HTST dynamics based on literature and HACCP requirements, transferability to other equipment, product types, or regulatory regimes may require retraining of models and adaptation of activity definitions. Extending the framework to work across heterogeneous settings—possibly through domain adaptation or transfer learning—remains a promising direction for future work.

Finally, while synthetic data allow for controlled experimentation and reproducibility, they may not fully capture the complexity, noise characteristics, or unexpected behavior present in real industrial environments, which could affect the performance of the proposed methods when deployed in practice.

6. Conclusions and Future Work

In this work, we proposed a methodological framework for bridging several continuous univariate sensor data with activity-oriented PM in the context of HTST pasteurization. The results show that process activities can be reliably identified from multivariate sensor integration, achieving an overall accuracy of 97% with HMMs. The model successfully captured activity boundaries even in the absence of formally defined stage transitions. In the next step, we demonstrated that PM captures complex multivariate dynamics, verifies conformance across the full workflow, and enables predictive foresight. Using SkPM, we achieved accurate predictions for batch remaining time, activity remaining time, and next transition timestamp, illustrating how PPM enriches compliance verification. As a result, we explained how PPM enhances safety by ensuring that HACCP critical limits are consistently met, supports compliance through early warnings of deviations, and improves efficiency by enabling proactive scheduling of downstream operations.

While the study was based on a synthetic dataset, it provided a controlled environment to validate the proposed framework. Future work will focus on the integration of real-time streaming PM, enabling online monitoring and early intervention in live systems. Furthermore, combining PPM with causal explainability methods would provide operators with not only forecasts but also interpretable root causes of deviations. Finally, the methodology can be generalized to other food safety-critical processes beyond pasteurization, supporting broader adoption of PM in the food industry.

Author Contributions

Conceptualization: A.M., A.P.A.d.C.B., S.B.J., I.M.G.; methodology: A.M., A.P.A.d.C.B., I.M.G., S.B.J.; software: A.M., A.P.A.d.C.B., I.M.G., S.B.J.; writing—original draft preparation: A.M., A.P.A.d.C.B., S.B.J., I.M.G.; supervision: D.F.B. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge financial support under the National Recovery and Resilience Plan (NRRP), M4C2I1.1, funded by the European Union–NextGenerationEU–Project Title aRtificial intElligence for Process Analytics (REPA)–Grant Assignment Decree No. 2022CJWPNA by the Italian Ministry of Ministry of University and Research (MUR). Also, this research was carried out in the framework of the PORTRAIT project (Port to Rail Digital Twin in the Adriatic Region), funded under the PR FESR 2021–2027, Action A1.1.2, the funding was awarded through the DGR 784/2023 of the FVG Region (Italy).

Data Availability Statement

The dataset is available on Zenodo with the DOI: https://doi.org/10.5281/zenodo.17412725, accessed on 19 October 2025. The latest version of the data can also be accessed via the GitHub repository: https://github.com/azinmoradbeikie/PM_HACCP_PASTEURIZATION, accessed on 19 October 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

Note

1	https://github.com/raseidi/skpm?tab=readme-ov-file; accessed on 22 October 2025.

References

García, A. Methodology and Tools to Integrate Industry 4.0 Cyber–Physical Systems into Process Design and Management: The ISA-88 Use Case. Information 2022, 13, 226. [Google Scholar] [CrossRef]
Nenzi, L.; Bartocci, E.; Bortolussi, L.; Loreti, M. A Logic for Monitoring Dynamic Networks of Spatially-distributed Cyber-Physical Systems. arXiv 2021, arXiv:2105.11400. [Google Scholar] [CrossRef]
Kosebalaban, F.; Cinar, A. Integration of multivariate SPM and FDD by parity space technique for a food pasteurization process. Comput. Chem. Eng. 2001, 25, 473–491. [Google Scholar] [CrossRef]
Tokatli, F.K.; Cinar, A.; Schlesser, J.E. HACCP with multivariate process monitoring and fault diagnosis techniques: Application to a food pasteurization process. Food Control 2005, 16, 411–422. [Google Scholar] [CrossRef]
U.S. Food & Drug Administration. HACCP Principles & Application Guidelines. Available online: https://www.fda.gov/food/hazard-analysis-critical-control-point-haccp/haccp-principles-application-guidelines (accessed on 29 August 2025).
National Advisory Committee on Microbiological Criteria for Foods (NACMCF). HACCP Principles & Application Guidelines. J. Food Prot. 1998, 61, 1246–1259. [Google Scholar]
Vitale, F.; Guarino, S.; Flammini, F.; Faramondi, L.; Mazzocca, N.; Setola, R. Process mining for digital twin development of industrial cyber-physical systems. IEEE Trans. Ind. Inform. 2024, 21, 866–875. [Google Scholar] [CrossRef]
Fahim, M.; Sharma, V.; Cao, T.V.; Canberk, B.; Duong, T.Q. Machine learning-based digital twin for predictive modeling in wind turbines. IEEE Access 2022, 10, 14184–14194. [Google Scholar] [CrossRef]
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. Statistical and Machine Learning forecasting methods: Concerns and ways forward. PLoS ONE 2018, 13, e0194889. [Google Scholar] [CrossRef]
Ceravolo, P.; Tavares, G.M.; Junior, S.B.; Damiani, E. Evaluation goals for online process mining: A concept drift perspective. IEEE Trans. Serv. Comput. 2020, 15, 2473–2489. [Google Scholar] [CrossRef]
Janssen, D.; Mannhardt, F.; Koschmider, A.; van Zelst, S.J. Process model discovery from sensor event data. In Proceedings of the International Conference on Process Mining 2020, Padua, Italy, 5–8 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 69–81. [Google Scholar] [CrossRef]
Brzychczy, E.; Aleknonyte-Resch, M.; Janssen, D.; Koschmider, A. Process mining on sensor data: A review of related works. Knowl. Inf. Syst. 2025, 67, 4915–4948. [Google Scholar] [CrossRef]
Rabiner, L.R. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proc. IEEE 1989, 77, 257–286. [Google Scholar] [CrossRef]
Afzal, M.S. Forecasting in Industrial Process Control: A Hidden Markov Model Approach. Procedia Comput. Sci. 2017, 115, 404–412. [Google Scholar] [CrossRef]
Tokatli, F.K.; Cinar, A. Fault detection and diagnosis in a food pasteurization process with hidden Markov models. Can. J. Chem. Eng. 2004, 82, 1252–1262. [Google Scholar] [CrossRef]
Yu, Z.; Jung, D.; Park, S.; Hu, Y.; Huang, K.; Rasco, B.A.; Wang, S.; Ronholm, J.; Lu, X.; Chen, J. Smart traceability for food safety. Crit. Rev. Food Sci. Nutr. 2022, 62, 905–916. [Google Scholar] [CrossRef]
Liao, H.; Hu, Z.; Zhang, Z.; Tang, M.; Banaitis, A. Outranking-based failure mode and effects analysis considering interactions between risk factors and its application to food cold chain management. Eng. Appl. Artif. Intell. 2023, 126, 106831. [Google Scholar] [CrossRef]
Zhao, Z.; Dong, J.; Qi, B.; Duan, N.; Qian, H. A survey on machine learning methods for food safety risk assessment: Approaches, challenges, and future outlook. Eng. Appl. Artif. Intell. 2025, 154, 110960. [Google Scholar] [CrossRef]
European Food Safety Authority (EFSA); Clawin-Rädecker, I.; De Block, J.; Egger, L.; Willis, C.; Da Silva Felicio, M.T.; Messens, W. The use of alkaline phosphatase and possible alternative testing to verify pasteurisation of raw milk, colostrum, dairy and colostrum-based products. EFSA J. 2021, 19, e06576. [Google Scholar] [CrossRef]
Komorowski, E.S. New dairy hygiene legislation. Int. J. Dairy Technol. 2006, 59, 97–101. [Google Scholar] [CrossRef]
European Commission. Commission Implementing Regulation (EU) 2019/627 of 15 March 2019 laying down uniform practical arrangements for the performance of official controls on products of animal origin intended for human consumption in accordance with Regulation (EU) 2017/625 of the European Parliament and of the Council and amending Commission Regulation (EC) No 2074/2005 as regards official controls. Off. J. Eur. Union 2019, 131, 51–100. [Google Scholar]
Wittwer, M.; Hammer, P.; Runge, M.; Valentin-Weigand, P.; Neubauer, H.; Henning, K.; Mertens-Scholz, K. Inactivation kinetics of Coxiella burnetii during high-temperature short-time pasteurization of milk. Front. Microbiol. 2022, 12, 753871. [Google Scholar] [CrossRef]
Lott, T.; Wiedmann, M.; Martin, N. Shelf-life storage temperature has a considerably larger effect than high-temperature, short-time pasteurization temperature on the growth of spore-forming bacteria in fluid milk. J. Dairy Sci. 2023, 106, 3838–3855. [Google Scholar] [CrossRef] [PubMed]
Qian, C.; Murphy, S.; Lott, T.; Martin, N.; Wiedmann, M. Development and deployment of a supply-chain digital tool to predict fluid-milk spoilage due to psychrotolerant sporeformers. J. Dairy Sci. 2023, 106, 8415–8433. [Google Scholar] [CrossRef]
Verdure, C. The (EC) Regulation on microbiological Criteria: A general Overview. Eur. Food Feed Law Rev. 2008, 3, 172–177. [Google Scholar]
Lupien, J.R. The Codex Alimentarius Commission: International science-based standards, guidelines and recommendations. AgBioForum 2000, 3, 192–196. [Google Scholar]
van der Aalst, W. Process Mining—Data Science in Action; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar] [CrossRef]
van der Aalst, W.M.P. Process Mining: A 360 Degree Overview. In Process Mining Handbook; van der Aalst, W.M.P., Carmona, J., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2022. [Google Scholar] [CrossRef]
Elkodssi, I.; Driss, L.M.; Hanae, S. Applying Process Mining to Sensor Data in Smart Environment: A Comparative Study. In Innovations in Smart Cities Applications Volume 6; Springer International Publishing: Berlin/Heidelberg, Germany, 2023. [Google Scholar] [CrossRef]
Ceravolo, P.; Comuzzi, M.; De Weerdt, J.; Di Francescomarino, C.; Maggi, F.M. Predictive process monitoring: Concepts, challenges, and future research directions. Process Sci. 2024, 1, 2. [Google Scholar] [CrossRef]
Oyamada, R.S.; Marques Tavares, G.; Ceravolo, P. A scikit-learn extension dedicated to process mining purposes. Ceur Workshop Proc. 2023, 3552, 11–15. [Google Scholar]
Arrighi, L.; Pennella, L.; Marques Tavares, G.; Barbon Junior, S. Decision predicate graphs: Enhancing interpretability in tree ensembles. In Proceedings of the World Conference on Explainable Artificial Intelligence 2024, Valletta, Malta, 17–19 July 2024; Springer: Berlin/Heidelberg, Germany, 2024; p. 311. [Google Scholar]

Figure 1. Overview of the proposed framework for process-oriented sensor integration in HTST pasteurization. Multivariate sensor data—including temperature, pH, conductivity, viscosity, turbidity, flow, and pressure—are continuously collected and processed through a Hidden Markov Model (HMM) to detect process stages. The identified stages are aligned with HACCP-based Critical Control Points (CCPs), enabling event log generation, process model discovery, and online predictive monitoring. The framework supports downstream tasks such as estimating remaining time, forecasting transitions, and verifying compliance with regulatory and operational constraints.

Figure 2. Expected process stages that must be discovered through sensor data processing as identifiable activities. The blue lines represent the value of each sensor during the time.

Figure 3. The time series of all sensors with the mean and the standard deviation behaviour.

Figure 4. Confusion matrix illustrating the performance of the HMM-based stage detection used for event log generation.

Figure 5. Process model discovered from sensor-derived event logs using a Directly-Follows Graph (DFG). Each node represents a detected process stage, while edges indicate the observed transitions between stages, annotated with frequency counts. This visualization demonstrates how multivariate sensor data can be structured into an interpretable process model aligned with the operational flow of pasteurization.

Table 1. Illustrative example of a traditional event log, showing three Case IDs with activities, execution timestamps, and resources involved (optional).

Case ID	Activity	Timestamp	Resource
1	A	01-09-2025 09:00	John
2	A	01-09-2025 09:05	Mike
3	A	01-09-2025 09:10	Sarah
1	B	01-09-2025 09:15	Anna
3	B	01-09-2025 09:25	Sarah
2	C	01-09-2025 09:30	John
1	C	01-09-2025 09:45	John
3	D	01-09-2025 09:50	John
2	D	01-09-2025 10:00	Anna

Table 2. Summary of simulated inline sensors and their setup during HTST pasteurization.

Sensor	Setup	Expected Behaviour in Pasteurization
Temperature (T)	Setpoint 72 °C; band ±0.2 °C; cooling target 4–10 °C	Increases during Heat-up; stabilizes at 72 ± 0.2 °C during Hold (≥15 s); decreases rapidly during Cooling to ≤10 °C.
pH	Baseline 6.65 ± 0.01	Stable around 6.6–6.8, typical of raw milk; deviations may indicate anomalies in milk quality or contamination.
Conductivity ( $κ$ )	Baseline 4.8 ± 0.1 mS/cm	Stabilizes at ∼4.8 mS/cm during production; used to support evaluation of milk quality and detect product-to-water interfaces.
Viscosity ( $μ$ )	Cold ∼2.2 cP at 10 °C; hot ∼1.6 cP at 72 °C	Decreases with increasing temperature; remains relatively stable during Hold and approximates the viscosity of water at higher temperatures.
Turbidity ( $τ$ )	Baseline 1.0 ± 0.05 NTU	Stabilizes around 1.0–1.2 NTU; indicates presence of milk solids; sensitive to product-to-water transitions at Fill and Discharge.
Flow/Pressure (Q/P)	Nominal flow 1.5 L/min (inlet/outlet); baseline pressure 1.2 ± 0.1 bar	Flow increases during Fill and Discharge; difference remains close to zero in Idle and Hold; pressure follows pump operation at ∼1.2 bar during circulation and drops during idle phases.

Table 3. Performance of the unsupervised HMM in identifying pasteurization process stages from multivariate sensor data. Precision, Recall, and F1-score are computed by comparing the inferred stages with ground-truth labels from the simulated dataset.

Activity	Precision	Recall	F1-Score	Support
Idle	0.99	1.00	0.99	35,143
Fill	1.00	1.00	1.00	17,620
Heat-up	1.00	0.85	0.92	14,709
Hold	0.66	1.00	0.80	7841
Cool	1.00	0.88	0.94	18,172
Discharge	1.00	1.00	1.00	35,734

Table 4. Event log derived from HMM-segmented sensor signals, structured with Case ID, Activity, Start_Timestamp, and End_Timestamp.

Case ID	Activity	Start_Timestamp	End_Timestamp
601	Idle	01-01-2023 00:00:00	01-01-2023 00:01:20
601	Fill	01-01-2023 00:01:21	01-01-2023 00:02:15
601	Heat-up	01-01-2023 00:02:16	01-01-2023 00:02:52
601	Hold	01-01-2023 00:02:53	01-01-2023 00:03:22
601	Cool	01-01-2023 00:03:23	01-01-2023 00:03:54
⋮	⋮	⋮	⋮
1000	Fill	01-01-2023 00:01:55	01-01-2023 00:02:44
1000	Heat-up	01-01-2023 00:02:45	01-01-2023 00:03:20
1000	Hold	01-01-2023 00:03:21	01-01-2023 00:03:47
1000	Cool	01-01-2023 00:03:48	01-01-2023 00:04:19
1000	Discharge	01-01-2023 00:04:20	01-01-2023 00:05:46

Table 5. Prediction performance of remaining batch time per activity.

Activity	# Events	MAE (s)	Mean Error (%)	Avg. Actual (s)	Avg. Predicted (s)
Fill	80	18.94	7.91	237.66	232.89
Idle	80	17.67	5.39	325.68	320.50
Heat-up	80	16.95	8.61	193.29	187.74
Hold	80	16.80	10.18	162.22	156.72
Cool	80	16.31	12.11	132.54	126.74
Discharge	80	6.24	7.02	91.58	89.21

Table 6. Prediction performance of the remaining time of the current activity.

Activity	# Events	MAE (s)	Mean Error (%)	Avg. Actual (s)	Avg. Predicted (s)
Discharge	80	5.94	6.46	91.58	88.87
Cool	80	3.55	9.14	39.96	38.74
Heat-up	80	2.79	9.53	30.06	30.26
Fill	80	2.71	6.35	43.38	42.55
Hold	80	2.31	8.14	28.69	28.46
Idle	80	0.43	0.61	87.01	86.85

Table 7. Prediction performance of the next activity transition timestamp.

Activity	Next Activity	N_Transitions	MAE (s)	Std_AE (s)	Mean Error (%)
Idle	Fill	80	0.20	0.24	0.24
Fill	Heat-up	80	0.68	0.87	0.51
Heat-up	Hold	80	1.92	1.89	1.16
Hold	Cool	80	1.61	2.21	0.84
Cool	Discharge	80	1.40	1.90	0.59

Table 8. PPM results obtained with SkPM for three tasks.

	$R^{2}$	MAE	RMSE
Remaining time of the batch	0.94	15.49	19.58
Remaining time of the current activity	0.96	3.76	5.45
Next activity transition timestamp	0.99	0.41	0.96

Table 9. Key analysis results of DPG from the RF model used for pasteurization process monitoring.

Category	Finding
Important information	`ctivity_sequence`, `minute`, `event_count`
Most used features	`duration_seconds` (303), `event_count` (301), `minute` (41)
Stage-specific signals	`activity_HeatUp ≤ 0.25`, `activity_Hold ≤ 0.14`, `activity_Idle ≤ 0.35`
Decision boundaries	`duration_seconds: 35–117 s`, `event_count: 16–118 s`

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moradbeikie, A.; Ayub da Costa Barbon, A.P.; Grigore, I.M.; Barbin, D.F.; Barbon Junior, S. Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case. Systems 2025, 13, 935. https://doi.org/10.3390/systems13110935

AMA Style

Moradbeikie A, Ayub da Costa Barbon AP, Grigore IM, Barbin DF, Barbon Junior S. Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case. Systems. 2025; 13(11):935. https://doi.org/10.3390/systems13110935

Chicago/Turabian Style

Moradbeikie, Azin, Ana Paula Ayub da Costa Barbon, Iuliana Malina Grigore, Douglas Fernandes Barbin, and Sylvio Barbon Junior. 2025. "Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case" Systems 13, no. 11: 935. https://doi.org/10.3390/systems13110935

APA Style

Moradbeikie, A., Ayub da Costa Barbon, A. P., Grigore, I. M., Barbin, D. F., & Barbon Junior, S. (2025). Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case. Systems, 13(11), 935. https://doi.org/10.3390/systems13110935

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Process Mining of Sensor Data for Predictive Process Monitoring: A HACCP-Guided Pasteurization Study Case

Abstract

1. Introduction

2. Problem Definition

3. Background

3.1. Pasteurization

3.2. Hidden Markov Models

3.3. Process Mining

4. Pasteurization as a Process

4.1. Expected Process Activities (Stages)

4.2. Sensors

4.3. Sensor-to-Activity Mapping

4.4. Event Log Construction

4.5. Process Discovery and Monitoring

5. Experimental Results

5.1. Synthetic Sensor Data Generation

5.2. From Sensors to Process Discovery (Goal 1 – RQ1)

5.3. Monitoring Pasteurization Using the Discovered Model

5.3.1. Remaining Time of the Batch

5.3.2. Remaining Time of the Current Activity

5.3.3. Next Activity Transition Timestamp

5.3.4. Explaining Monitoring Prediction

5.4. Predictive Monitoring for Safety, Compliance, and Efficiency (Goal 3 – RQ3)

5.4.1. Critical Control Point (CCP) Guardrails

5.4.2. Diversion Assistance and Delta P Integrity

5.4.3. Traceability and Auditability (HACCP Principles 4–7)

5.4.4. Limitations

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Note

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI