2. State of the Art
In smart manufacturing, CPSs combine shop-floor assets, IIoT sensors, and edge–cloud computing to support automated QC and process optimization [
8,
9]. These systems increasingly adopt service-oriented and event-driven architectures that integrate AI-based analytics and support responsive, data-driven decision-making [
10,
11]. Reference architecture models for digital manufacturing, such as RAMI 4.0, provide the groundwork for orchestrating interoperable workflows across enterprise and shop-floor levels [
12].
Operationally, smart manufacturing relies on edge–cloud cooperation so that latency-critical control and QC tasks can be handled at the edge, while scalable analytics and reporting are performed in the cloud [
13]. In this context, Nain G. et al. outline patterns for distributed edge-based deployment of AI systems and integrating their outcomes into the MIS to improve responsiveness and reliability in dynamic production settings [
14].
Digital twins have emerged as a key enabler in this architecture, providing synchronized virtual representations of machines and processes. Embedded into production workflows, digital twins can improve synchronization of process parameters and quality-related indicators, creating a more robust link between operating conditions and product properties [
15,
16]. By fusing physics-based models with data-driven learners, digital twins support proactive QC by forecasting deviations in process variables and triggering early corrective actions [
17]. Edge–cloud digital-twin architectures place sensing, inference, and feedback control near machines, while coordinating model retraining and analytics in the cloud to minimize latency and scale oversight of quality-related KPIs [
18].
At the same time, AI-based QC and digital-twin pipelines must be hardened against adversarial perturbations and data poisoning that can mislead classifiers and control policies operating on edge IIoT streams [
19]. Deep neural networks are known to be vulnerable to small, carefully crafted perturbations that induce misclassifications [
20], and empirical studies on time-series data have shown that such perturbations can degrade fault detection, underscoring the need for robust defenses integrated into the manufacturing data pipeline [
21]. Digital-twin-driven security frameworks propose the use of physics and AI residuals to detect inconsistencies between expected and measured behavior, enabling online attack detection and mitigation within IIoT infrastructures [
18]. However, digital-twin-driven security frameworks in IIoT typically focus on detection and alerting without addressing in-band purification of corrupted or malicious data streams [
19,
22]. In addition, they are often validated on generic CPS benchmarks rather than within production-grade QC infrastructures, and they do not exploit twin states to guide generative reconstruction of inputs [
23,
24]. In parallel, adversarial purification methods are largely developed outside manufacturing contexts, leaving a gap between robust purification techniques and digital-twin-aware deployment in smart factories [
25].
Modern approaches for adversarial purification are typically diffusion-based [
26], which treats defenses as a generative denoising step that maps a perturbed input back to the clean data manifold before classification [
27]. Diffusion-based purification methods exploit a forward noising process and a learned reverse denoising trajectory to remove perturbations while approximately preserving the statistical characteristics of the data [
28]. Compared to adversarial training, such purification operates as a model-agnostic preprocessing pipeline that can improve robustness without modifying or retraining downstream AI models [
29].
In image-based domains, guided pixel-space diffusion defenses have been proposed that reverse diffusion with auxiliary signals to better preserve label-discriminative content during purification [
30]. In a similar manner, Du Yiru, et. al., used task-conditioned variants to adapt the denoising path to downstream objectives such as position-sensitive conditioning for crowd counting through vision-based AI approaches to resist localized perturbations [
31]. Beyond vision, diffusion-based purification has been tailored for automatic modulation classification in communications, improving robustness without altering the recognition network [
32,
33].
Outside of vision-based attacks, diffusion-based adversarial purification has been extended to sequential modalities. As presented by Zhu Baofeng et. al., time-series signals act as model-agnostic pre-processing that, when corrupted, can negatively impact downstream AI inference [
34]. In this context, diffusion models for time-series data streams have demonstrated high-fidelity time-series reconstruction and noise suppression while preserving temporal dependencies, underscoring their suitability as purification frontends for sequence AI classifiers [
35,
36,
37]. In RF pipelines, diffusion has also been used directly on raw data streams for data augmentation and restoration, validating that diffusion-based models naturally operate in time-series spaces relevant to defense-side pre-processing [
38]. However, it is evident that diffusion-based purification approaches remain largely unexplored in the manufacturing domain, with studies focusing more on the healthcare and medical sector, creating a significant gap in smart manufacturing research [
39].
In addition, across existing diffusion-based purification studies, guidance is typically task or label-oriented, while the notion of process feasibility is rarely enforced during sampling [
40]. In contrast, manufacturing digital twins already provide synchronized state estimates and constraint residuals that encode the physical aspects of a process [
41]. Conditioning latent diffusion on these twin signals can offer a direct way to reduce post-purification semantic drift and maintain the validity of reconstructions [
42].
Despite their promise, generative purification methods present several limitations. Pre-trained generators may project inputs onto the training manifold and inadvertently alter class-relevant details, leading to post-purification semantic drift and misclassification [
43]. Standard diffusion sampling relies on iterative denoising, which can introduce non-trivial latency and energy overheads at the edge, motivating the use of one-step or few-steps guidance as a mitigation strategy [
44]. Generative purification remains susceptible to adaptive attacks that differentiate through the defense pipeline [
45], and a persistent trade-off between adversarial robustness and accuracy on clean data has been reported, alongside generalization gaps across datasets and threat models [
46,
47].
To emphasize the limitations of existing approaches on diffusion-based purification and robustness, diffusion models are applied to industrial time-series denoising and digital-twin- and physics-based security monitoring,
Table 1 contrasts DTCDP against recent studies. This highlights that prior works typically use diffusion or digital twins in isolation, whereas DTCDP uses digital-twin physics residuals as an explicit guidance signal that shapes the reverse denoising path during purification.
To address the aforementioned observations while remaining compatible with existing QC models and production monitoring workflows, this work proposes a DTCDP framework, which combines latent diffusion-based denoising with guidance derived from a lightweight digital twin of the manufacturing process, as detailed in the next section.
3. Methodology
3.1. The Architecture of the Digital-Twin-Conditioned Diffusion Purification Framework
This work introduces the DTCDP framework (
Figure 1) that combines a latent diffusion purifier with physics-aware guidance provided by a lightweight digital twin. The twin exposes process states and constraint residuals that are translated into soft penalties shaping the denoising trajectory, so that purified signals remain on a physically feasible manifold while adversarial perturbations are removed. DTCDP is designed for execution at the network edge and can integrate with the MIS to expose auditable events, latency metrics and QC-related KPIs.
Latent diffusion contributes a data-manifold prior that removes high-dimensional adversarial noise, but on its own, it can introduce semantic drift by projecting inputs toward statistically plausible, yet physically infeasible trajectories. Digital-twin residual guidance complements this by acting as a feasibility prior, continuously steering denoising toward reconstructions that remain consistent with process physics and operating constraints.
The framework comprises three main layers:
The data integration and digital-twin layer acquires production signals, coupled with their timestamps and metadata. It also provides online estimations of process and system states, and derivation of physics bounds and residuals that quantify violations relevant to the QC process of the line.
The diffusion-based purification layer is composed of four modules:
- ○
A latent representation module (encoder/decoder) that maps inputs to a compact latent space where purification is performed.
- ○
A diffusion purifier, implemented as a noise predictor with a deterministic, one-step or few-steps sampler to meet real-time constraints.
- ○
A twin-guided conditioner that injects a soft physics penalty between denoising steps to keep reconstructions within operational manifolds.
- ○
A closed-loop purification and validation pipeline that iteratively purifies and evaluates signals against the digital twin before releasing them to the downstream AI components.
DTCDP extends standard latent diffusion purification by injecting digital-twin-derived physics residuals into the sampling dynamics and by using one-step or few-steps sampling tailored to edge latency constraints.
3.2. Data Integration and Digital-Twin Layer
The data integration and digital-twin layer unifies heterogeneous sensor streams into event packets that couple each observation with a synchronized state estimate from the digital twin and associated physics residuals. Its role is to (i) align multi-rate data with consistent timing and metadata, (ii) estimate latent operational states in real time, and (iii) derive compact, physically meaningful residuals that guide the diffusion-based purification layer.
Let
denote a raw sensor measurement at time
, and let
be the associated metadata vector, including information such as sensor ID, sampling rate, and calibration tags. For each timestamp
, the layer builds an event based on the following equation.
In the above equation
is a state estimate obtained from the digital twin and synchronized to
. The twin provides state estimates
at times
, which may not coincide with measurement timestamps. For
, temporal alignment is obtained by linear interpolation as presented in Equation (2).
The digital twin follows the following discrete-time state-space formulation:
where
is the latent state of operating conditions,
denotes exogenous inputs (e.g., operator adjustments),
is measurable quantities expected under state
,
and
describe process and measurement mappings, and
represent process and measurement disturbances. The twin is assumed to be identified and calibrated prior to deployment, and provides filtered state estimates
and expected measurements
during operation.
To supply physics information suitable for purification, the twin specifies a set of constraints, as seen in Equation (4)
These constraints are encoding physically feasible behavior, such as bounds on process variables, geometric or temporal consistency, conservation relationships, or domain-specific policies. For each event
, non-negative residuals are computed using Equation (5).
Also, for each event, the layer exposes the physics-residual map based on Equation (6).
Then, a scalar physics penalty is defined based on Equation (7):
where
is the stacked residual vector and
is a non-negative weight matrix, for example, reflecting state uncertainty or constraint importance. Equations (5)–(7) convert constraint satisfaction into a piecewise-differentiable objective where residuals quantify constraint violations and
aggregates them into a scalar penalty used for guidance in Equation (14), providing a differentiable constraint proxy that limits drift during purification by penalizing violations of bounds, rare limits, and consistency relations encoded by the twin. By utilizing Equations (1)–(6), the layer emits a tuple of the type seen in Equation (8).
This tuple provides the normalized data stream, the synchronized state estimate, residuals that encode physical feasibility, and an auditable context that can be provided to legacy and supervisory systems falling under the umbrella of the MIS.
Next, the tuple is consumed by the diffusion purification layer, where the residual map and scalar penalty steer denoising towards physically plausible reconstructions.
3.3. Diffusion-Based Purification Layer
The diffusion-based purification layer performs generative denoising in a compact latent space, adjusted by physics residuals from the digital twin. It is modality-agnostic and operates as a preprocessing defense before sensor streams are ingested by downstream AI components for in-line QC.
Let
x be an input observation (e.g., a multivariate time-series window) and let
and
be an encoder and a decoder, respectively. The encoder maps
to a latent representation (Equation (9)).
The decoder reconstructs an observation from a latent . Diffusion and purification are carried out in latent space to reduce dimensionality and sampling costs, while decoding is used to evaluate physics residuals and obtain the final purified output.
During training, clean latents are progressively noised according to a standard diffusion process. For a timestep
, the forward process can be written based on Equation (10).
With , , and for a variance schedule . Equation (10) defines the forward latent noising process , which produces progressively more corrupted latents as increases and yields a Gaussian reference distribution for large .
A time-conditioned noise-prediction network
is trained to approximate the noise added at each step by minimizing Equation (11).
where
: The neural network that predicts the noise at step t,
: Trainable weights of the neural network,
: Expectation over timesteps, data latents and Gaussian noise.
Minimizing Equation (11) trains to estimate the injected noise at each timestep, which provides the denoising direction used during reverse-time sampling.
At purification time, a noisy latent
is mapped back toward a clean latent using a deterministic sampler. A per-step estimate of the clean latent is obtained using Equation (12).
Lastly, the latent space is estimated according to Equation (13).
where
: the per-step estimate of the clean latent,
: the cumulative coefficient at t − 1,
: The neural network that predicts the noise at step t.
Equations (12) and (13) implement the reverse update by forming a clean-latent estimate and propagating to a less noisy latent using . The number of steps controls the accuracy-to-latency trade-off at inference, where denotes the number of reverse updates executed during purification.
To ensure that the purification process remains within physically feasible regions defined by the digital twin, a soft guidance step is applied between sampler updates. Let
be the decoded observation at step
, and let
be the physics penalty from
Section 3.2 computed using the residuals
. Guidance is implemented as presented in Equation (14).
where
: a guidance strength chosen to balance latency and constraint adherence,
: the gradient of the penalty computed by backpropagating through the decoder .
Equation (14) adds physics guidance by modifying the reverse update with a gradient term derived from , steering reconstructions toward constraint satisfaction; sets the strength of this feasibility-to-fidelity trade-off.
The closed-loop purification module orchestrates denoising and guidance. For each event , it
encodes into an initial latent ,
performs deterministic denoising updates using Equations (12) and (13),
applies physics guidance based on Equation (14),
decodes the resulting latent into a purified observation ,
recomputes residuals to verify constraint satisfaction.
The guidance term presented in Equation (14) shapes the reverse denoising trajectory itself rather than performing a post-hoc filtering, projecting each latent update toward the twin-feasible set. As a result, short-lived but physically plausible transients are less likely to be over-smoothed, while adversarial components that increase physics residuals are actively suppressed.
Depending on latency requirements, the module can operate in one-step mode or a few-steps mode. The layer returns the purified observation , its latent , and audit parameters including the used timesteps, number of sampler steps, residual statistics and guidance strength , which are forwarded to the integration layer for MIS traceability.
3.4. Integration with Downstream AI-Based Components and MIS
The integration layer serves as middleware between purification and enterprise oversight. After DTCDP has produced purified data streams and associated audit metadata, downstream AI components already present in the manufacturing environment (e.g., defect classifiers, anomaly detectors) consume the purified signals and return decision packages. The internal design of these AI components is out of scope, and DTCDP treats them as black-box consumers.
Decision packages, together with purification metadata, are ingested into the MIS for traceability and quality-related KPI calculation. To minimize data exposure, raw signals ingested into the framework remain on edge nodes and are released to higher-level systems only on escalation or for offline investigation. All decisions are accompanied by versioned provenance, and physics-guidance diagnostics, including residual statistics, guidance strength, and sampler steps. This enables reproducibility, change control, and systematic comparisons of alternative purifier and threshold configurations within existing MIS procedures. In this way, robustness improvements remain decoupled from any specific AI model while being appropriate for enterprise reporting, audit, and risk management workflows.
4. Implementation
The DTCDP framework was implemented as an edge-ready pipeline for in-line QC. The prototype runs on a Windows 10 workstation with an Intel Core i9 CPU, 32 GB RAM and an NVIDIA RTX 2070 GPU. The data flow follows the logical architecture of
Section 3.1 and is summarized in the UML sequence diagram shown in
Figure 2. At the integration layer, an Apache Kafka messaging bus receives raw sensor and production data from the shop floor. A Node-RED instance subscribes to the relevant topics, time-synchronizes streams using a UTC time server, and forwards events to the digital twin via RESTful HTTP POST requests implemented in FastAPI on Python (version 3.9.12). The twin is deployed as a set of Python microservices and returns a physics residual map in JSON format (
Table 2), including timestamp, state, residuals, weights, the scalar
, and metadata that are later used to guide purification.
The diffusion-based purification layer is implemented as a temporal convolutional autoencoder in TensorFlow, exposed via a FastAPI HTTP interface. Its architecture is shown in
Figure 3. The encoder maps each normalized data window to a compact latent representation, while the decoder reconstructs perturbed signals so that physics residuals can be evaluated on the decoded data. During inference, a Python microservice applies a fast one-step guidance update following the guidance equation in
Section 3.3 to obtain a purified observation in near real time.
In addition,
Table 3 includes a pseudocode block that summarizes the inference-time physics guidance used in DTCDP. After encoding the normalized window into a latent vector, the service performs a deterministic denoising update and then applies a single gradient step that minimizes the digital-twin physics penalty presented in Equation (14) by backpropagating through the decoder.
As seen in
Table 3, the physics-guidance step uses a single gradient update with guidance strength λ. Guidance strength is set to 0.05 on the validation split from {0.01, 0.02, 0.05, 0.10} as the best robustness trade-off without over-constraining reconstructions. In long-term edge deployments,
along with the residual weights in
, may require adaptive re-tuning as operating regimes and twin fidelity change.
The temporal convolutional autoencoder uses an encoder with three Conv1D blocks (64/128/128 filters, kernel sizes 5/5/3, stride 2, and ReLU), producing a latent tensor of approximately 75 × 128 and a mirrored decoder with up-sampling convolutions to reconstruct the data window. The model is trained with Mean Squared Error MSE reconstruction loss using Adam and a learning rate of 1 × 10−3, batch size 256, up to 100 epochs with early stopping (patience is set to 10), and light regularization with dropout set to 0.1.
The diffusion model is trained in the autoencoder latent space using ε-prediction with an MSE objective over timesteps . The noise predictor is a lightweight 1D U-NET/TCN in latent space with two residual blocks per level, using sinusoidal time embeddings of 128 dimensions. The noise predictor operates on the 75 × 128 latent tensor and follows a three-level 1D U-Net with channel widths [128, 256, 256]. Each level contains two residual temporal blocks with kernel size 3 and dilations {1, 2, 4, 8} cycled across blocks and GroupNorm and SiLU activations. It uses stride-2 down-sampling/nearest-neighbor up-sampling with 1D convolutions on the skip path.
The latent diffusion process uses T = 200 noise levels with a linear variance schedule from to , with and . The noise predictor is trained with AdamW (learning rate 1 × 10−4), batch size 128, gradient clipping at 1.0, and a decay of 0.999 for up to 250 epochs with early stopping on validation loss with a patience of 20.
All services run on an edge node connected to sensor devices over low-latency industrial Ethernet. After purification, the edge node publishes compact quality-control records on an Apache Kafka topic, including identifiers, defect type, classifier confidence, a purified/not-purified flag, and end-to-end latency. These records are consumed by the MIS to compute shift-level indicators such as first-pass yield, scrap mass, false reject/accept rates, and latency service-level objectives. Each decision is also logged with a small provenance vector, including attack flags, twin configuration hash, guidance parameters, and model version, to support governance and audit. Low confidence or latency violations trigger abstention and escalation to a manual review, while raw sensor streams remain on the edge node and are exported only upon escalation. In this way, both digital-twin inference and purification stay close to the line, and higher-level systems receive only summarized results and metadata.
5. Use Case
The DTCDP framework was applied to a hot-forming use case for steel parts. Steel bars are heated in an induction furnace, descaled with pressurized water and then transferred to a forming station. The use case aims to test the capacity of physics-guided diffusion purification to improve the robustness of time-series-based quality estimation under adversarial attacks and realistic data disturbances.
The DTCDP was deployed in the real-world environment of the use case, which already possesses a wide range of sensors collecting process-related data that support the pre-existing QC pipeline. Edge sensors such as pyrometers stream data to the DTCDP framework through the Apache Kafka bus. The main signals from the heating and descaling processes monitored by DTCDP are summarized in
Table 4.
A dataset was collected over four weeks of operation of the hot-forming process. Each steel bar is represented by a multivariate time window of 6 s centered at the furnace exit, resampled at 100 Hz, resulting in sequences of 600 timesteps with the signals in
Table 4. Quality labels of the product are derived from MIS historical data, with classes being OK and NOK, with NOK signifying the presence of a defect. With a production cycle time of approximately 45 s and an average daily production of 1.500 steel bars, a dataset containing over 40,000 per-product data signals was made. Across the four weeks, the class distribution is imbalanced, with OK being approximately 91% of the data and NOK being approximately 9% of the data. This class imbalance has resulted in the use of class weighting in the downstream LSTM classifier. The dataset was split at the batch level into training (65%), validation (15%), and test (20%) subsets to avoid data leakage. This batch-level split serves as a cross-batch evaluation, as the test set contains unseen production batches and associated operating condition variability, rather than random windows from the same batches. The collected windows span both nominal and boundary operating regimes, covering the full constraint envelope used by the twin. In the test split, approximately 10% of its included signals fall within the outer 5% of at least one channel’s admissible range, representing near-boundary operating conditions.
In the hot-forming use case, the preexistence of the study digital twin is instantiated as a low-order discrete-time state-space model following Equation (3), whose state captures the dominant thermal and electro-hydraulic dynamics of induction heating and descaling. The twin state is defined as , where is the effective bar thermal state near the furnace exit, is an effective water-system load proxy (pressure- and flow-driven), is the effective electrical power state and is a slowly varying disturbance term due to material emissivity and induction coil coupling drift. Exogenous inputs are , corresponding to line speed, water pressure, and descaling flow. The model uses a = 0.1 s update rate and is linearly interpolated to the aggregated 100 Hz sensor grid when forming event tuples.
Currently, the twin dynamics are modeled with a linear discrete-time update
, where F captures dominant thermal inertia and slow drift (
), and G captures the direct influence of line speed, water pressure, and descaling flow on the latent process state. The matrix F is denoted in Equation (15), while the matrix G is denoted in Equation (16).
In addition, the measurement mapping
produces twin-expected quantities used by the energy residuals. The matrix H is presented in Equation (17), while
, with the R matrix being presented in Equation (18). Lastly,
, which is also used to parametrize the twin transition, is described as
, with the Q matrix being described in Equation (19).
For each timestep in the 6 s window, the physics residual map is built from (i) per-channel amplitude constraints, (ii) per-channel rate constraints, and (iii) two global consistency constraints, yielding
L = 3
C + 2 residuals for
C = 9 channels. Specifically,
includes bounds
and rate limits
for all signals in
Table 4, plus an energy constituency term over the window and an attraction term. The diagonal weighting matrix
A (with
) prioritizes thermally critical constraints (temperature and energy) over auxiliary hydraulics, while keeping all constraints active during guidance. The constraint sets are presented in
Table 5.
The two global constraints include the energy consistency , with = 100 kJ per 6 s windows (approximately 5% of the nominal energy at 280 kW), and the twin attraction , with = 1.0 in normalized units (approximately 1σ average deviation). In terms of weights, the diagonal entries of the matrix A were set piecewise as (i) = 1.0 for all amplitude residuals, (ii) = 0.7 for all rate residuals, = 2.0 for all energy residuals and = 1.5 for all attraction residuals.
It should be pointed out that the matrices F, G, H and noise covariances Q, R were identified on the training split using prediction-error minimization on windows flagged as normal operation with a batch-level split to avoid leakage, after z-score normalization per channel. Constraint bounds were initialized from the [0.5, 99.5] percentiles of the training data per channel and expanded by a safety margin of approximately 3%, while rate limits were set to the 99th percentile of under normal operations. Lastly, and were set from the 95th percentile of their respective residual distributions.
In terms of preprocessing, before window extraction, raw streams are time-aligned and resampled to 100 Hz. Short missing segments of below 0.2 s are filled by linear interpolation per channel. To suppress acquisition glitches, impulsive spikes were detected using a median filter and replaced with the local median, while any remaining extreme values were clipped to the admissible per-signal bounds. Lastly, all channels were z-score normalized using training-set statistics, with the same parameters applied to validation and test splits.
The autoencoder used in the diffusion layer was trained on the training set and validated on the validation set. Its reconstruction performance is reported in
Table 6.
These metrics indicate that the autoencoder captures the relevant heating and descaling dynamics without severe overfitting, providing a suitable latent representation for subsequent diffusion-based purification.
The downstream QC module is an LSTM-based classifier that consumes the per-bar time-series signals and outputs one of the two quality labels. The model consumes the normalized 600 × 9 windows and outputs OK/NOK probabilities via a two-layer LSTM with an MLP head. The classifier is trained with binary cross-entropy using Adam (learning rate 1 × 10−3), batch size 256, up to 50 epochs using a patience of 7, and class weighting to account for NOK imbalance. Lastly, this model preexists in this study. The LSTM QC model was retained to remain fully compatible with the plant’s deployed QC pipeline and edge latency constraints, and to isolate the impact of DTCDP as a drop-in defense approach. More complex sequence models were not adopted as they would require redesigning the legacy QC stack and potentially increase the inference cost, without being necessary for the study’s scope.
To probe robustness, white-box ℓ∞ attacks, created using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD), are generated on the test signals with perturbation budgets corresponding to 1%, 2%, and 3% of the physical range per data signal, constrained to remain within actuator limits so that perturbed traces are visually plausible. A test-time evasion threat model is adopted, where an adversary with white-box knowledge of the QC classifier tampers with the multivariate sensor streams at the edge. Perturbations are modeled as additive changes on the 6 s per-bar window and are bounded in the ℓ∞ norm; thus, the maximum change to any single sample in each channel is limited to 1 to 3% of its physical range. This captures small yet worst-case sensor manipulations that remain within actuator limits while being optimized to misguide the QC model.
White-box, untargeted test-time evasion attacks (FGSM/PGD) are evaluated against the downstream QC classifier, with perturbations bounded in ℓ∞ and scaled as per channel for . For PGD, each iteration applies projection onto the ℓ∞ around the original window and clipping to the admissible bounds.
FGSM is applied as a single-step ℓ∞ attack with
times the physical range of each channel as presented in
Table 4, using the classifier loss gradient and clipping the perturbed signal to the admissible bounds
as seen in
Table 5. PGD uses projected iterative updates with 20 iterations, step size
, and a random start uniformly sampled within the ℓ∞ ball.
In addition, non-adversarial disturbances are injected, including small calibration shifts, missing samples reconstructed by interpolation and isolated spikes up to 50% of the signal range to mimic sensor glitches. This ensures that, besides calibration shifts due to sensor drifts, missing-sample dropouts (interpolated) are injected and isolated spikes that approximate common sensor failures and glitches are observed in edge data streams.
Four purification strategies are tested to highlight the advantages of the proposed DTCDP framework. These include the following: (1) no defense, where the LSTM operates directly on normalized signals, (2) simplified preprocessing that consists of data signal-wise low-pass filtering and clipping, (3) unconditioned diffusion, where the latent diffusion purifier is used without physics guidance ( = 0), and (4) the DTCDP framework, where the latent diffusion is guided by the residuals produced by the existing hot-forming digital twin.
For the two diffusion-based strategies (unconditioned diffusion and DTCDP), inference-time sampling was run in a one-step mode (K = 1) using a deterministic reverse update from a fixed intermediate timestep, followed by the physics-guidance update. A few-steps setting (K = 10) was also tested during tuning; however, due to edge latency constraints imposed by the use case owner and the need for real-time rather than near-real-time inference, the evaluation was performed using the one-step mode.
The evaluation of the proposed framework is based on accuracy-related metrics of the QC module. The evaluation focuses on a clean classification accuracy, robust accuracy under attack, first-pass yield (FPY) on clean and under-attack data, false reject rate under attack, and the 95th percentile end-to-end latency per produced product. The evaluation results can be seen in
Table 7.
As seen in
Table 7, on clean data, all strategies achieve comparable performance, with DTCDP maintaining classification accuracy and FPY within one percentage point of the no-defense baseline. It also introduces a latency overhead of approximately 193 ms, which is below the 45 s task time of the hot-forming process. Under adversarial perturbations, the unprotected baseline exhibits a drop in robust accuracy and FPY, as well as a corresponding increase in false rejects, particularly around edge cases where the classification probability of defective products is borderline. Lastly, simplified pre-processing recovers part of this loss but remains vulnerable to attacks that preserve low-frequency trends while altering transient behavior.
In the hot-forming process, the most damaging corruptions are (i) impulsive spikes on the energy of the process and (ii) slow, structured drifts that preserve low-frequency trends while violating rate limits. DTCDP suppresses these modes as the guidance backpropagates a weighted physics-residual penalty during denoising, preferentially damping changes that increase consistency residuals while maintaining intact physically plausible transients. This effect is more pronounced for iterative PGD than for single-step FGSM, since multi-step attacks can accumulate small but systematic constraint violations that are counteracted by per-step feasibility projection. Lastly, the guidance strength λ controls how aggressively the sampler is pulled toward the twin-feasible manifold; a larger strength improves constraint adherence but can lead to over-constraining sharp yet valid transitions. Hence, a strength of 0.05 was selected as the most optimal robustness and clean-accuracy compromise.
To illustrate robustness trends, a robustness curve has been computed that shows the downstream QC accuracy under attack as a function of the perturbation budget (
ε = 1%, 2%, and
3% of the per-signal physical range).
Figure 4 presents the robustness curves, demonstrating that DTCDP degrades less as
ε increases.
To characterize error modes under attack, the row-normalized confusion matrices for OK vs. NOK at the default decision threshold of 0.5 are reported in
Figure 5. The matrices highlight that DTCDP reduces false rejects while also improving NOK recall compared to the no-defense baselines.
To quantify uncertainty, all reported rates in
Table 7 have been computed on the held-out test split and are accompanied by 95% confidence intervals estimated via non-parametric bootstrap (1000 resamples at bar level). Pairwise method comparisons were evaluated on the same test windows using a paired test on predictions via a McNemar test for accuracy/robust accuracy and FPY-derived rates and Holm correction for multiple comparisons. ROC-AUC differences were assessed using a paired AUC test on the classifier probabilities. Threshold-swept ROC and precision-recall (PR) curves are computed from the downstream classifier probabilities for the same attack settings used in evaluation (PGD/FGSM attacks). The ROC curves are illustrated in
Figure 6, with the dashed line representing the random-guess baseline, while PR curves are illustrated in
Figure 7.
As seen from
Figure 6 and
Figure 7 the ROC/PR curves confirm that DTCDP improves separability under attack across a wide range of thresholds. In particular, the physics-guided pipeline maintains higher true-positive rates at comparable false-positive rates, consistent with the reduced false-reject metrics under attack.
Increasing
K yields diminishing returns in robust accuracy, while latency scales approximately linearly, which justifies the selection of the one-step mode (K = 1). The comparative results of changes in steps, latency and robustness are reported in
Figure 8.
Both diffusion-based approaches improve robustness, with the physics-guided DTCDP yielding the highest robust accuracy and FPY under attack. From an MIS perspective, this translates into fewer bars unnecessarily routed to reheating or manual inspection and more stable quality indicators, without sacrificing responsiveness or requiring changes to the downstream QC logic.
Furthermore, to assess dependence on digital-twin accuracy, the DTCDP evaluation was repeated while perturbing the twin outputs and state alignment. Such results are presented in
Table 8. As seen in
Table 8, the robustness degrades as twin fidelity decreases; however, the DTCDP framework remains above the unprotected baseline. Lastly, the fluctuations in the p95 latency are statistically insignificant.
To address deployability under different compute resources and transient burst loads, the end-to-end inference latency was computed for each purification strategy with GPU acceleration enabled and in CPU-only mode. A load was emulated by replaying windows with increasing concurrent requests to the purification service to represent short backlogs or multi-stream operations, while reporting median (p50) and tail (p95) latency. The median results are reported in
Figure 9, while
Figure 10 reports the tail latency results.
As seen from
Figure 9 and
Figure 10, latency increases with load, and the effect is most pronounced in the diffusion-based pipelines, where denoising and guidance highly increase the computing time. GPU acceleration consistently reduces tail latency, keeping DTCDP within a sub-second threshold even in moderate burst loads, whereas CPU-only execution shows a steeper degradation as concurrency increases. Lastly, in terms of memory consumption, peak memory consumption with the GPU-accelerated scenario was 1.2 GB in 1 concurrency, rising to 1.6 GB with 8 concurrencies, while in CPU-only execution, RAM usage was 0.9 GB in 1 concurrency, rising to 1.4 GB with 8.
In the context of per-window runtime and throughput, each product corresponds to a 6 s window (600 samples at 100 Hz). The sample-level runtime can be approximated as end-to-end latency when divided by 600. Throughput is computed as concurrency divided by the median latency. For DTCDP in a one-step mode, this corresponds to approximately 6 windows per second on the GPU at a concurrency equal to 1, resulting in 0.3 ms per sample.
As DTCDP is deployed on the edge, the critical loop of purification and QC inference remains local, while only compact QC records or metadata are published to the MIS; therefore, degraded uplink connectivity primarily affects reporting rather than inference. To probe resilience to weak network conditions within the edge stack, a small-scale experiment was carried out where a synthetic delay and packet loss on the HTTP path between Node-RED and the twin service were injected (1% packet loss and 100 ms delay), with a 150 ms timeout and reuse of the last valid twin state. In this scenario, DTCDP maintained stable outputs with a slight tail-latency increase of approximately 79 ms in one-mode concurrency.
Following up on this small-scale experiment, a similar one was carried out to assess scalability to larger sensing configurations. The signals were expanded from 9 to 18 and 27 channels, preserving sampling rate and window length and keeping the same latent architecture. Runtime scaled approximately linearly with channel count, with p95 latency rising from approximately 193 ms in 9 channels to 240 ms in 18 channels and 339 ms in 27 channels on GPU, pinpointing to feasibility for moderately expanded sensor suites, but further future experimentation is needed.
Lastly, as generative purification can be bypassed by gradient-based adaptive attacks, a lightweight Backwards Pass Differentiable Approximation (BPDA)-style evaluation for DTCDP in a one-step mode was performed, where gradients are backpropagated through the purification stage using a straight-through approximation and expectation-over-transformation with Expectation Over Transformation (EOT) being set to five samples and the same PGD settings. The resulting robustness is lower than the non-adaptive PGD evaluation, with a drop in robust accuracy of approximately 4.5% at .
6. Conclusions
This work introduced a DTCDP framework that combines latent diffusion with physics-aware guidance to mitigate adversarial perturbations in manufacturing QC. The approach couples a data integration and digital-twin layer, which derives synchronized states and physics residuals from shop floor time-series data, with a latent diffusion purifier and a downstream AI-based classifier deployed at the edge and integrated with MIS workflows.
A pilot application in a hot-forming station for steel parts was used to test the framework in a real-world production environment. Time-series signals were fed to a temporal autoencoder that provided a compact latent representation of the diffusion. Given the low reconstruction errors of the autoencoder, the latent space preserves the relevant data dynamics for subsequent purification. On top of this representation, DTCDP improved the robustness of an LSTM-based quality classifier under attacks and disturbances, raising robust accuracy from 61% for the no-defense baseline to 81.5%.
The pilot also highlights several practical considerations. In real-world deployments, edge nodes may operate under tight power, thermal, and memory budgets and may not include a dedicated GPU, which can increase tail latency under burst loads. Consequently, the DTCDP should be configured in a one-step mode to maintain latency SLOs on CPU-only hardware.
In addition, the effectiveness of physics guidance depends on the fidelity and maintenance of the digital twin and on how well the constraint set captures feasible operating regimes, suggesting the need for mechanisms to detect and compensate for model drift in both the twin and the autoencoder. Digital-twin mismatch can bias feasibility-guided sampling toward twin-consistent yet physically incorrect reconstructions, particularly under regime changes or unmodeled dynamics. Performance is also sensitive to the selection of guidance weights, such as and residual weights, which govern the trade-off between constraint satisfaction and reconstruction fidelity. In particular, overly strong guidance may over-regularize valid transients, whereas weak guidance may leave violations unresolved. A practical extension is adaptive or data-driven tuning of these parameters under varying operating conditions, for example, via uncertainty-aware scaling based on twin state-estimation confidence, closed-loop adjustment to keep within a target range learned from nominal windows, or periodic re-calibration using trusted data to mitigate long-horizon drift in edge deployments. Furthermore, the current evaluation focuses on white-box attacks and a specific hot-forming process. This study has performed a limited quantification of robustness under adaptive attackers that differentiate through purification, and does not evaluate long-term distribution shifts. Therefore, future work will extend the analysis to a wider range of adaptive attackers, alternative threat models, and sensor failures, and investigate robustness across different production lines, through its application on other manufacturing systems such as EV battery assembly.
Lastly, DTCDP is modality-agnostic by design, which opens further extensions to multi-modal data streams combining time-series and image data, as well as coupling purification with online adversarial training of downstream models. Longer-term studies could also quantify the impact of physics-guided diffusion purification on MIS-level indicators such as scrap-related costs, stability of quality metrics, and effectiveness of escalation procedures in smart manufacturing environments. Future work will also focus on deployment hardening, including model compression and quantization, and automated monitoring of constraint drift and guidance-weight stability to support reliable long-term operation.