1. Introduction
Hot strip rolling is a critical technological process in steel strip production, comprising several interconnected stages such as reheating, rough rolling, high-pressure descaling, finish rolling, run-out table cooling, and coiling. As a major production method for high-quality steel plates in modern steel manufacturing, hot-rolled strip products are widely used in the automotive, construction, household appliance, and energy industries. With increasing demand for superior dimensional accuracy and surface quality in downstream applications, thickness control precision and strip flatness have become key indicators for evaluating the performance of hot rolling technology.
However, fluctuations in strip thickness and flatness defects (such as edge waves, center waves, and warpage) occur frequently in practical production. These issues not only reduce yield but also lead to deterioration of subsequent processing performance and compromise the end-use quality of the products. Therefore, systematically analyzing the key factors influencing thickness and flatness quality in the hot rolling process and establishing an effective root-cause tracing mechanism are of great significance for optimizing process parameters and improving overall product quality [
1,
2,
3,
4,
5,
6].
While contemporary studies on thickness and flatness control in hot strip rolling have achieved notable advancements in process parameter optimization, rolling mill dynamic compensation, and roll condition monitoring, prevailing approaches predominantly concentrate on isolated factor analysis or segmental process control, while systemic revelation of quality evolution mechanisms involving multi-stage synergy and parameter interdependencies is lacking [
7,
8].
The intricate causal relationships among process parameters (rolling force, tension profile, thermal gradient), machine conditions (roll thermo-mechanical performance, bearing dynamic characteristics), and material behaviors (strain resistance evolution, transformation kinetics) have not been fully elucidated, particularly under extreme operating conditions featuring elevated temperatures, severe deformation, and high-speed rolling. Consequently, the fundamental causes of quality deviations cannot be precisely determined via conventional correlation-based analytical approaches [
9,
10,
11].
Such knowledge gaps prevent current methodologies from effectively reconstructing causality chains of quality defects during trans-process and multi-temporal scale variations in manufacturing systems, much less enabling causality-informed process optimization strategies. The development of causal analysis frameworks capable of deconvoluting nonlinear interactions among multivariate factors now constitutes a pivotal scientific challenge for overcoming quality control limitations in hot strip rolling processes.
The paradigm of causal inference and root-cause diagnostics [
12] has gained significant traction in the excavation of associative information in process manufacturing data as well as in the establishing of variable causality to identify thickness/flatness-related information propagation pathways. A root-cause localization framework was developed in [
13] to resolve the critical challenge of pinpointing underlying failure origins during fault incidents, encompassing both stationary and non-stationary fault scenarios. In [
14], the authors constructed a research framework of “causal analysis–performance prediction–process optimization” that reconstructs causal networks from data to support decision-making and break the dual black-box dilemma in complex industrial processes. Recent advancements in root-cause analysis methodologies, including Granger causality, transfer entropy, and Bayesian networks, have enabled the characterization of causally related physical variables through causal topology structures extracted from process mechanics and domain knowledge. In complex systems requiring concurrent analysis of fault features and their interdependencies, transfer entropy and Granger causality have emerged as predominant causal inference technologies [
15].
In [
16], the Normalized Transfer Entropy (NTE) and Normalized Direct Transfer Entropy (NDTE) were established as core statistical metrics. This was accompanied by an enhanced statistical verification method to ascertain significance thresholds, facilitating robust causality determination. In [
17], the authors established a data-driven correlated fault propagation pathway recognition model integrating KPCA for fault detection with an innovative transfer entropy algorithm for causal graph formulation, culminating in a kernel extreme learning machine-enabled fault path tracing methodology. In [
18], a fault knowledge graph was constructed utilizing operational/maintenance logs, complemented by a lightweight graph neural network architecture for concurrent fault detection within graph-structured data. In [
19], a gated regression neural network architecture was engineered to refine conditional Granger causality modeling, enabling precise diagnosis of quality-relevant fault origins and propagation route identification. The authors of [
20] proposed a unified model that integrates Granger causality-based causal discovery with fault diagnosis within a single framework, enhancing the traceability of diagnostic results. Finally, [
21] integrated physics-informed constraints with Graph Neural Networks (GNNs); by employing entropy-enhanced sampling and conformal learning, the authors were able to improve the accuracy of causal discovery and reduce spurious connections.
For robust reconstruction of weighted Granger causality networks in stationary multivariate linear dynamics, [
22] devised a systematic methodology integrating sparse optimization with novel scalar correlation functions to achieve parsimonious model selection. An innovative fusion of PCA and Granger causality yielded a PCA-enhanced fault magnification algorithm for root-cause detection and propagation path tracing, with multivariate Granger analysis decoding causal relationships from algorithmic outputs [
23].
In addition, some studies have conducted causal analysis on multi-factor fault diagnosis problems. In [
24], the authors designed a modular structure called Sparse Causal Residual Neural Network (SCRNN), which uses a prediction target with layered sparse constraints and extracts multiple lagged linear and nonlinear causal relationships between variables to analyze the complete topology of fault propagation. On this basis, an R-value measurement is introduced to quantify the impact of each fault variable and accurately locate the root cause.
The majority of current research concentrates on static association analysis or univariate temporal causality reasoning, lacking comprehensive incorporation of the distinctive multi-lag coupling dynamics prevalent in industrial processes. This limitation becomes particularly pronounced in manufacturing systems such as hot strip rolling that exhibit substantial inter-stand interaction effects, where prevailing causal analysis methodologies demonstrate notable constraints. In multi-stand hot rolling operations, process perturbations originating from upstream stands generally exhibit delayed propagation characteristics, becoming detectable in downstream quality metrics only after several sampling intervals. These deterministic/stochastic time delays may induce conventional causality analysis approaches (e.g., Granger causality tests) to generate erroneous causal directionality determinations or attenuated causality magnitude estimations.
Consequently, this study investigates the hot strip rolling process, centering especially on thickness and flatness quality challenges, by using integrated theoretical analysis, computational modeling, and industrial data analytics to perform time-delayed root-cause diagnostics.
This paper’s primary contributions and innovations are as follows:
Critical process parameters, mill conditions, and material properties affecting slab thickness and flatness are systematically identified and analyzed. The dynamic interactions and temporal dependencies among these factors are elucidated, providing a theoretical basis for subsequent causal inference.
A correlation matrix computation method integrating Dynamic Time Warping (DTW) and Mutual Information (MI) is proposed to effectively address temporal misalignment in heterogeneous industrial time series. This method quantitatively characterizes cross-factor association strengths, enabling efficient preselection of candidate causal variable pairs.
A transformer-based framework is developed that approximates time-varying transfer entropy using attention mechanisms and input masking techniques. Without requiring explicit probabilistic modeling, this framework directly extracts dynamic information transfer patterns among process variables from operational data, enabling interpretable data-driven causal reasoning.
Instead of simply combining existing techniques, this work introduces a novel causal-reasoning paradigm, providing a mathematically novel approach that goes beyond existing techniques such as DTW, MI, and transformers. The difference between information gain approximation from conventional causal inference approaches is that in contrast to conventional causal inference methods (e.g., Granger causality and transfer entropy), the model computes information gain implicitly by learning attention response differences under masked perturbations of each variable, which provides a mathematically interpretable surrogate of causal strength rather than a general statistical association.
The rest of this article is organized as follows:
Section 2 systematically examines critical quality determinants, establishing the theoretical framework for later investigations;
Section 3 develops DTW-MI-based correlation matrices, achieving reduced data dimensionality while deriving undirected association networks among process variables;
Section 4 proposes a transformer-empowered information gain approximation approach to delineate causal information transfer mechanisms;
Section 5 conducts comprehensive validation of the proposed framework with real-world industrial production datasets; finally,
Section 6 concludes with the key findings and contributions of this research.
4. Transformer-Based Information Gain Approximation Reasoning
Traditional probabilistic causal modeling approaches such as Bayesian networks or Gaussian processes face inherent limitations when handling high-dimensional time-varying industrial data with complex dynamic interactions. In contrast, the proposed causal inference framework is grounded in information-theoretic causality, which emphasizes directed and lag-dependent information flow rather than symmetric correlation relationships. The transformer-based structure enables direct learning of dynamic dependencies from data without explicitly constructing joint probability density models or performing likelihood calculations. Its attention mechanism not only captures temporal dependencies but also reveals the relative information contribution of each variable and its historical time slices to the prediction target. Therefore, the learned attention weights provide interpretable visual evidence of dynamic causal relationships among process variables.
4.1. Time-Varying Transfer Entropy
Although mutual information can measure correlations between variables, its symmetry limits its application in causal direction identification. Therefore, this paper introduces the Time-Varying Transfer Entropy (TVTE), which estimates information flow between variables across different time intervals using a sliding window technique, thereby characterizing the dynamic evolution of causal relationships.
Transfer entropy is an asymmetric information measure based on conditional probability, defined as
where
are historical state vectors of
, respectively, with
being the history orders. To capture the dynamic evolution of causal relationships, this study employs a sliding window mechanism to partition the time series into sub-intervals, computing the transfer entropy independently for each window, defined as follows:
where
W is the window length and
w denotes the window index. For each candidate variable pair
, bidirectional transfer entropy is computed within each window:
and we define the causal strength difference as
If
, then a causal direction
is considered to exist in the current window. By counting the occurrence frequency of causal directions across all windows for each variable pair, the directional prevalence ratio is obtained:
where
denotes the cardinality of set
. Let
be the threshold; if
, then a stable causal relationship
is considered to exist.
Explicit probability distribution estimation using Equation (
11) requires computing the high-dimensional joint probability
and conditional probability
; however, sliding window methods struggle to adapt to rapidly-changing dynamic systems.
Therefore, this paper introduces a transformer to directly learn dynamic dependencies between variables without explicit probability computation. Traditional methods typically use Kernel Density Estimation (KDE) for conditional probability estimation; however KDE suffers from the curse of dimensionality in high-dimensional state spaces, causing large estimation bias and high computational cost.
Thus, this paper proposes a transformer-based deep model to replace traditional explicit probability modeling, directly learning dynamic information flow relationships between variables from the data.
4.2. End-to-End Joint Training
The time-varying transfer entropy measures the additional contribution of ’s history to ’s future information given ’s own history. The estimation of these conditional probabilities is highly challenging.
Instead of manually computing TVTE, we train a transformer to predict target variable and infer the contribution of variable x to the prediction via information gain, thereby obtaining the approximate TVTE. In short, improved prediction capability ≈ enhanced information flow ≈ increased TVTE.
For any target variable
, this framework constructs a transformer-based prediction model
trained using a regular supervised learning objective function
In our proposed transformer-based causal modeling framework, the attention mechanism not only learns temporal dependency features but also reflects the “information contribution degree” of input variables to the target variable. Specifically, when inputs consist of multiple time series variables, the attention weights in the transformer can be interpreted as the model’s reliance on different variables and their temporal slices, providing visual evidence for causal relationships.
The attention layer in the transformer calculates attention weights for all input tokens
where
Q(Query),
K(Key), and
V(Value) are representations obtained from the input sequence through different linear mappings,
is the dimension of the key used for scaling to avoid gradient explosion, and softmax is used to normalize attention scores into a probability distribution. The weight matrix
(or cross-variable dimension) output by each attention head represents the current token’s level of attention to all input tokens. By aggregating these weights, we can calculate the attention contribution of a target location (such as the predicted
) to each input variable at each time step.
We aggregate the attention weights into variable dimensions (e.g., sum by time):
where a larger
indicates that the transformer relies more on
for prediction, suggesting a potential causal influence
to some extent.
4.3. Information Gain Approximation Reasoning
To evaluate the marginal contribution of each input variable to target variable prediction, this framework introduces an input masking mechanism. This mechanism constructs masked input models by artificially “blocking” or “perturbing” partial input variables to observe changes in model prediction performance. Specifically, we replace a variable
in the original input sequence with fixed values (e.g., zeros, mean, noise, or learned tokens) to create input samples lacking that variable’s information. We construct the control model
Thus, the approximate Soft Transfer Entropy Index (Soft TVTE) is defined as
where
quantifies the influence by directly comparing prediction differences, representing the Direct Transfer Difference (DTD),
represents the weighted sum based on attention responses, and
denotes the attention allocated to input variable
by the
h-th head in the
d-th layer, reflecting
’s contribution to predicting
Y, with layer weights
,
.
By computing
across all variable pairs, a time-varying causal graph can be constructed:
where
represents the causal graph adjacency matrix at time
t, with the
-th element reflecting the information transfer intensity from
to
. Note that this is a directed weighted graph, and as such its structure may vary across time
t to capture dynamic evolution of system causality.
Figure 4 shows the inference of transformer-based information gain.
5. Experiment
Process data were collected in real-time through a multi-source sensor network deployed at key process nodes in a 2250 mm hot strip mill of a steel plant. The raw data were first normalized as shown in
Figure 5, where
–
represent roll condition, rolling force distribution, rolling force, tension, incoming strip crown, and incoming thickness/hardness variation, respectively.
Based on field experience, the causal relationships between
are shown in
Figure 6, with the actual causal paths being
,
, and
. The production significance is as follows: abnormal roll condition affects rolling force distribution and tension, leading to abnormal rolling force, while incoming thickness/hardness variation causes abnormal strip crown. Ultimately, abnormal rolling force and incoming crown jointly cause thickness and shape defects.
5.1. Calculation of Correlation Matrix Calculation Based on DTW-MI
Taking
–
as an example, we take the DTW time window as 50 time steps and the mutual information threshold as 0.05. The processing of DTW is shown in
Figure 7,
Figure 8 and
Figure 9.
From the data in
Figure 7, it can be seen that
and
have a causal relationship, where
changes after
with a certain delay. Comparing
Table 1 and
Table 2, the correlation coefficient between
and
in
Table 2 is significantly higher than in
Table 1, showing that DTW processing can enhance truly causal variable pairs for better causal inference. After DTW processing, variable pairs
,
,
,
,
, and
were selected for information gain approximation.
5.2. Causal Path Inference
Each sample consists of observation sequences from the past
time steps, with the model task being to predict the target variable
Y at the next time step. A lightweight transformer model was built with two encoder layers, four attention heads, a hidden dimension of 64, learning rate 0.01, and MSE as the training loss.The total data volume used in this work comprised 200,000+ samples. The dataset was partitioned into 70% training, 15% validation, and 15% testing. The training process was monitored using early stopping criteria, where training was halted if the validation loss did not improve for ten consecutive epochs. After training, information gain was calculated for each input variable
according to (
21). Additionally, conventional Granger causality analysis and TVTE causality analysis were selected for comparison.
According to
Figure 10, the information flow chains can be represented as
,
,
, and
.
Figure 11 shows the information flow chains as
,
, and
.
Figure 12 indicates information flow chains of
,
,
,
, and
.
The causal paths obtained by Granger causality analysis differ most from the actual paths in
Figure 6, indicating nonlinear relationships between flatness/thickness and variables
that Granger causality cannot handle due to its linear assumption. TVTE-based causal inference produces paths closer to
Figure 6, but still has significant directional errors in information transfer between variables. The transformer-based information gain chain generates the causal paths that are most consistent with
Figure 6, more accurately reconstructing the true causal relationships.
Table 6 compares the proposed method with Granger, TVTE, the PC algorithm, and LiNGAM. The results show that our method outperforms the others in accuracy, recall, and F1-score. PC and LiNGAM rely on the linear ICA assumption, limiting them to linear relationships and making them less suitable for the nonlinear coupling common in hot rolling. Granger captures time dependence but typically uses fixed-lag modeling, which struggles under changing operating conditions. In contrast, the proposed transformer-based information gain framework effectively models nonlinear causal correlations and complex dynamic couplings, adaptively adjusts causal influence with process state changes via self-attention, and provides interpretable attention weights that quantify the contribution of variables and their time slices to quality outcomes.
To further evaluate the robustness and industrial applicability of the proposed causal inference framework, we conducted uncertainty, sensitivity, and generalization analyses. The uncertainty was quantified through multiple Monte Carlo training experiments with different random initializations and sampling conditions; the variance of attention-derived causal strengths remained below 0.07 across repeated runs, indicating stable causal edge confidence. Sensitivity analysis was performed by perturbing key process variables within ±5%–±10% of realistic operating fluctuations and masking selected input features. The resulting causal pathways exhibited consistent directionality and preserved more than 90% of the major inferred causal edges, demonstrating robustness against measurement noise and process disturbances. In addition, the model based on data training of one production line was tested on independent data sets with different rolling schedules and equipment configurations collected by another production line to evaluate the cross-site generalization ability. Based on the overlap rate of causal edges, the discovered causal structure maintains a structural consistency of more than 85%, highlighting its strong adaptability to different industrial conditions and confirming its practical effectiveness in multi-plant deployment.
5.3. Practical Deployment Feasibility in Industrial Hot Rolling
To demonstrate industrial applicability, the proposed causal inference framework is designed to be directly integrated into existing Level-2 automation systems without additional hardware modification. The model processes real-time process data streams (50–200 Hz sampling frequency) and updates causal indicators every 1–5 s, fully matching the decision cycle of thickness and shape control. Benefiting from efficient transformer inference, each forward computation requires less than 40 ms on a standard industrial server, ensuring real-time causal tracing and early warning alerts for potential defects. In production deployment, the inferred causal pathways are utilized to perform targeted corrective interventions, including adaptive tuning of the roll thermal crown, bending force distribution, roll gap, and cooling strategy. This capability significantly shortens the diagnosis-to-action latency compared with conventional correlation-based monitoring or offline expert analysis, thereby reducing defect propagation risks and improving overall operational stability. Furthermore, the proposed framework supports both offline and online operational modes. Offline analysis enables root-cause investigation for historical production deviations and assists in determining long-term process optimization strategies. Online monitoring continuously evaluates dynamic causal influence among critical variables, with actionable results fed back to the control system for timely parameter adjustment. While most responses can be executed immediately, certain control actions may exhibit slight inherent delays due to mechanical and thermal inertia in the rolling equipment. Overall, the proposed system provides high engineering feasibility with a favorable cost–benefit profile, contributing to improved product quality and yield in industrial hot rolling operations.