1. Introduction
Global steel output exceeds two billion tons per year. Among ironmaking technologies, the BFIP remains the dominant route for large-scale pig iron production due to its high productivity, operational stability, and cost efficiency. A schematic representation of the BFIP is shown in
Figure 1. In this process, alternating layers of iron-bearing materials, coke, and flux are continuously charged from the furnace top, while preheated blast air, often enriched with oxygen and auxiliary fuels, is injected through tuyeres located near the furnace hearth. As the burden descends, the iron oxides are progressively reduced by ascending reducing gases generated from coke gasification, forming molten iron and slag. The molten iron collects in the hearth and is periodically tapped, whereas top gas is recycled after dust removal and heat recovery [
1].
The BFIP is essentially a highly coupled, nonlinear, and multiphase chemical-physical reactor characterized by strong spatiotemporal variability, high temperatures, and intense energy exchange. These complexities pose significant challenges for process optimization, energy conservation, and environmental protection. Given the substantial energy consumption and emissions associated with ironmaking, improving process efficiency and ensuring operational safety are key priorities for the metallurgical and chemical process industries [
2,
3].
In practical blast furnace ironmaking operations, fault scenarios such as furnace collapse, furnace temperature rise or down, and channeling events frequently occur under complex and strongly coupled process conditions. These faults not only differ in their physical mechanisms and observable patterns but also exhibit highly imbalanced occurrence frequencies. More importantly, misclassification of certain fault types, especially missed detection of severe faults, may lead to substantial safety risks, equipment damage, and economic losses. Therefore, fault diagnosis in BFIP is not merely a pattern recognition task but a risk-sensitive decision-making problem that must account for asymmetric misclassification costs. This practical characteristic motivates the formulation of BFIP fault diagnosis as a cost-sensitive sequential decision problem in this study.
Fault diagnosis technology is essential for maintaining stable furnace operation and avoiding catastrophic failures. Fault diagnosis methods are broadly categorized as either model-based or data-driven. Model-based techniques depend on accurate mathematical models based on first-principles mechanisms. However, constructing an accurate mathematical model for a process as complex as the BFIP, where gas–solid reactions, heterogeneous heat and mass transfer, and dynamic burden descent coexist, is difficult [
4,
5]. A vast amount of process data is now continuously collected from blast furnace instrumentation systems. This enables the rise of data-driven fault diagnosis, in which knowledge of the system’s health state is inferred directly from historical and real-time measurements [
6,
7]. Consequently, data-driven diagnosis has become a promising research frontier for improving operational reliability and efficiency in the BFIP [
3].
Traditional machine learning algorithms were widely applied in this context due to their relatively straightforward implementation and interpretability. For instance, Zhou et al. [
8] applied principal component analysis (PCA)-based methods for process monitoring of the iron-making process in a blast furnace. Zhang et al. [
9] proposed a two-stage PCA for fault diagnosis in the blast furnace ironmaking process. Tian et al. [
10] developed a support vector machine (SVM) ensemble for fault diagnosis in blast furnaces. Liu et al. [
11] introduced an optional SVM approach for fault diagnosis of blast furnaces handling imbalanced data. Pan et al. [
12] presented robust principal component pursuit for fault detection in blast furnace processes.
Despite these developments, traditional machine learning methods were limited by their dependence on manually engineered features and their shallow representational capacity. Deep learning (DL) has therefore emerged as a powerful alternative, offering automated extraction of deep hierarchical features from high-dimensional process data [
13]. For example, Jang et al. [
14] proposed an adversarial autoencoder-based approach to learn features for fault detection in industrial processes. Wang et al. [
15] applied a stacked supervised auto-encoder for fault-relevant feature extraction and classification. Yu et al. [
16] introduced a supervised convolutional autoencoder for fault-relevant feature learning in fault diagnosis. Zhu et al. [
17] explored variational autoencoders for fault detection and diagnosis in industrial processes. Xu et al. [
18] employed a hybrid framework combining long short-term memory (LSTM) networks and convolutional neural network (CNN) to extract deep temporal and spatial features from time series and image data, respectively. Building on LSTM architectures, bidirectional long short-term memory (BiLSTM) models have also been widely used to capture bidirectional temporal dependencies in industrial time series data analysis [
19]. In addition to recurrent models, the Transformer architecture [
20], powered by multi-head self-attention, has recently been adopted for industrial fault diagnosis due to its strong capability in capturing long-range global dependencies.
However, two practical challenges persist in blast furnace fault diagnosis. The first is data imbalance, in which abnormal conditions occur infrequently and result in scarce labeled faulty data. The second is asymmetric misclassification cost, where false negatives (missed fault detection) can lead to serious operational or safety risks, unnecessary process interruptions, and productivity losses. These issues hinder the deployment of conventional classification models for reliable industrial applications. Reinforcement learning (RL) offers a promising solution to overcome these limitations by optimizing long-term rewards through continuous interaction with the environment, eliminating the need for extensive labeled datasets [
21]. Its capability to account for penalties, uncertainties, and deferred consequences positions RL as an ideal method for decision-making in sophisticated industrial environments [
22]. By framing fault diagnosis within a Markov Decision Process (MDP) framework, RL’s effectiveness is amplified in that the agent progressively refines its diagnostic policy and integrates both diagnostic results and operational expenses [
23]. This methodology is highly compatible with fault diagnosis in the BFIP, where dynamic decisions are required to balance precision, operational safety, and cost-effectiveness [
24]. Nevertheless, traditional RL still struggles with the rarity of fault samples during learning. Prioritized Experience Replay (PER) [
25] addresses this issue by sampling experiences in proportion to their temporal-difference error, enabling the agent to focus more heavily on informative but infrequent fault-related transitions. Furthermore, to better handle the asymmetric misclassification costs inherent in industrial diagnosis, an adaptive penalty mechanism [
26] can be incorporated into the reward design, dynamically adjusting penalty magnitudes based on the evolving diagnostic performance to achieve a more balanced tradeoff between safety and false alarms.
Although cost-sensitive learning can alternatively be achieved within supervised learning frameworks through techniques such as class-weighted cross-entropy or focal loss, these approaches typically rely on fixed and predefined penalty structures. In contrast, the reinforcement learning paradigm adopted in this study provides a more flexible and policy-oriented solution. By formulating the fault diagnosis task as a cost-sensitive Markov decision process, the diagnostic policy is optimized through a reward-driven learning mechanism that explicitly integrates adaptive misclassification penalties and prioritized experience replay. This formulation allows the learned policy to account for long-term decision consequences and asymmetric fault risks, rather than optimizing only instantaneous classification loss. Consequently, the model can place greater emphasis on high-consequence fault categories while maintaining overall diagnostic effectiveness under normal operating conditions. This risk-aware decision formulation is particularly suitable for the blast furnace ironmaking process, where fault occurrences are highly imbalanced and the costs associated with different misclassification types vary substantially.
Considering the characteristics of blast furnace process data, effective fault diagnosis requires models that can simultaneously capture global correlations among multivariate variables and fine-grained temporal evolution patterns. Transformer architectures, equipped with multi-head self-attention, are well suited for extracting long-range contextual dependencies, while BiLSTM networks provide strong capabilities in modeling forward–backward temporal dynamics in sequential industrial data. However, even with powerful feature extractors, the challenges of data imbalance and asymmetric misclassification costs remain difficult to address using purely supervised learning. Double Deep Q-Network (DDQN) offers a principled solution by enabling cost-aware decision-making through interaction with the environment and by incorporating mechanisms such as prioritized experience replay to emphasize rare fault samples. Recent studies have made notable progress in this domain. Lou et al. [
27] proposed a deep correlation network representation regression framework for modeling and monitoring blast furnace ironmaking processes. Wang et al. [
28] introduced multiple local manifold learning for fault detection in blast furnace systems. Domain-adaptation and knowledge-driven methods have also been developed to handle limited fault samples and distribution shifts in blast furnace fault diagnosis [
29,
30]. Meanwhile, deep reinforcement learning approaches with reward adjustment for minority classes and cost-sensitive classifiers have been explored to address class imbalance in industrial fault diagnosis [
31]. Despite these advancements, existing hybrid architectures (e.g., CNN-LSTM, Transformer-based models) often lack effective joint spatiotemporal modeling for strongly coupled process variables. Furthermore, most reinforcement learning-based methods still fail to explicitly address asymmetric misclassification costs under extreme class imbalance. Building upon the above discussions, this study proposes a Transformer–BiLSTM DDQN framework for imbalanced fault diagnosis in the BFIP. The proposed method integrates a Transformer-based feature extraction branch to capture global contextual dependencies with a bidirectional LSTM-attention branch for dynamic temporal pattern learning. The outputs of both branches are fused to construct discriminative spatiotemporal representations. On this basis, a reinforcement learning strategy based on a DDQN is employed, incorporating prioritized experience replay and an adaptive cost-sensitive reward mechanism to handle data imbalance and optimize diagnostic decision-making dynamically. The main contributions of this work are summarized as follows:
A dual-branch spatiotemporal feature extraction framework is proposed, in which a Transformer-based spatial encoder captures global inter-variable dependencies, while a BiLSTM-attention module models temporal dynamics in BFIP data.
A cost-sensitive reinforcement learning formulation for fault diagnosis is developed by modeling the task as a Markov decision process and embedding adaptive misclassification penalties into a Double DQN architecture with prioritized experience replay, enabling effective handling of class imbalance and asymmetric fault costs.
Extensive experiments conducted on real-world blast furnace datasets demonstrate that the proposed TBDDQN achieves competitive performance compared with conventional deep learning and reinforcement learning baselines in terms of weighted-averaged and macro-averaged evaluation metrics.
The remainder of this paper is organized as follows.
Section 2 reviews related work on Classification Markov Decision Process, DDQN learning strategies, and Transformer encoders.
Section 3 presents a detailed description of the proposed framework and methodology.
Section 4 presents the experimental validation on real-world BFIP data and performance comparisons. Finally,
Section 5 concludes the paper and outlines potential future research directions.
2. Related Work
2.1. Classification Markov Decision Process
The Classification Markov Decision Process (CMDP) extends the conventional MDP paradigm to address classification-oriented learning tasks [
32]. Unlike standard MDPs designed for sequential decision-making under uncertainty, CMDP reformulates the supervised classification problem into a sequential decision process, allowing reinforcement learning agents to iteratively improve classification accuracy through interaction with labeled data.
The CMDP can be expressed as a tuple , built upon a training dataset , where each subset represents samples belonging to the kth class and contains labeled pairs . The key components of the CMDP are defined as follows.
Action Space (): For a classification task involving K categories, the action set is given by , where each action corresponds to the agent’s predicted class label. Selecting an appropriate action thus reflects the agent’s classification decision at each step.
State Space (): Each state in the CMDP corresponds to one labeled sample drawn from . The initial state is associated with the first sample in the training dataset, and the agent transitions through subsequent samples as it progresses through the classification episode.
Transition Probability (
P): The environment transitions deterministically between samples in a fixed sequential order. Specifically,
This deterministic setting ensures that each classification attempt is treated as a sequential step rather than a stochastic sampling process.
Reward Function (
): To evaluate the correctness of classification actions, the reward function is defined as
This binary reward design provides immediate feedback on prediction accuracy. However, such a formulation overlooks the issue of class imbalance, where misclassifying minority samples incurs the same penalty as misclassifying majority ones, an important limitation motivating later extensions to cost-sensitive CMDP.
Discount Factor (
): The parameter
determines the weight of future rewards, maintaining consistency with standard reinforcement learning frameworks [
33].
A complete CMDP trajectory can thus be expressed as , where T denotes the terminal step. Each episode terminates either when all samples in the training dataset are processed or when the agent makes an incorrect prediction, marking the end of a classification sequence.
2.2. Double Deep Q-Network
The DDQN algorithm is developed to address the systematic overestimation of action values in Deep Q-Network (DQN). In the standard DQN framework, a single neural network is used for both selecting and evaluating actions, which often leads to overly optimistic value estimates due to the coupling of these two processes.
In DQN, the target value is defined as:
where
represents the parameters of the target network. The max operator in this formulation tends to favor actions with overestimated Q-values, resulting in biased learning and unstable convergence.
To mitigate this issue, the DDQN framework introduces a crucial modification by decoupling the action selection and evaluation processes. DDQN retains the core advantages of DQN, specifically leveraging deep neural networks in Q-learning, while substantially reducing the overestimation bias. Specifically, two separate networks with identical architectures are maintained, which are the main network with parameters
and the target network with parameters
. The main network selects the optimal action for the next state, while the target network evaluates the value of that chosen action. Then, it leads to a revised target definition for DDQN:
Here, the main network determines the action
, and the target network evaluates its Q-value. This separation effectively eliminates the overestimation induced by the max operator, yielding a more accurate and stable learning signal.
Similar to DQN, the DDQN framework also incorporates two essential mechanisms which are Experience Replay (ER) and a target network to improve sample efficiency and stabilize training. Experience tuples gathered from interactions with the environment are stored in a replay buffer, from which mini-batches are sampled for training.
The network parameters are optimized by minimizing the mean squared error (MSE) loss function:
and updated via gradient descent:
where
denotes the learning rate. The target network parameters
are updated from the main network either periodically or through a continuous averaging process to ensure smoother convergence and training stability.
In summary, DDQN effectively inherits the representational power of deep Q-learning while mitigating the overestimation bias of DQN through the decoupling of action selection and evaluation. This improvement enhances both the reliability and robustness of the reinforcement learning process.
2.3. Transformer Encoder
The Transformer model, originally proposed by Vaswani et al. [
20], introduces a novel network architecture based solely on attention mechanisms, representing a paradigm shift in sequence modeling by eschewing traditional recurrent and convolutional architectures in favor of a purely attention-based mechanism. This architectural innovation addresses fundamental limitations of recurrent neural networks (RNNs), particularly their sequential computation nature that precludes parallelization and struggles with long-range dependencies.
The core component of the Transformer Encoder is the multi-head self-attention mechanism, which allows the model to jointly attend to information from different representation subspaces at different positions. The self-attention function maps a query and a set of key-value pairs to an output, which is computed as a weighted sum of the values, where the weights are determined by the compatibility between queries and keys. The scaled dot-product attention formulation is expressed as:
where
Q,
K, and
V represent queries, keys, and values respectively, and
is the dimensionality of the keys. The scaling factor
prevents the softmax function from entering regions of extremely small gradients.
The encoder is composed of multiple identical layers stacked together, with each layer containing two sub-layers which are a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections [
34] are employed around each sub-layer, followed by layer normalization [
35]. The feed-forward network applies two linear transformations with a Rectified Linear Unit (ReLU) activation in between:
A critical innovation in the Transformer architecture is the positional encoding, which injects information about the relative or absolute position of tokens in the sequence since the model contains no recurrence or convolution. This design allows the model to extrapolate to sequence lengths longer than those encountered during training and enables learning of relative positional relationships.
The overall architecture of the Transformer Encoder is illustrated in
Figure 2, which shows how each encoder layer consists of a multi-head self-attention block followed by a feed-forward network, enabling global dependency modeling across multivariate features.
Compared to recurrent and convolutional layers, the Transformer Encoder offers significant advantages, including enabling full parallelization during training and demonstrating strong performance in capturing long-range dependencies. These properties make the Transformer Encoder a foundational building block in numerous subsequent architectures across various domains beyond natural language processing.
3. Proposed Method
This section introduces a novel intelligent fault diagnosis framework for blast furnaces based on deep reinforcement learning. The proposed method formulates the fault classification task as a Cost-Sensitive Markov Decision Process (CSMDP), which explicitly incorporates the asymmetric costs associated with different types of misclassifications. This allows the model to address real-world class imbalance by accounting for the different consequences of false positives and false negatives.
To effectively model complex spatial and temporal dependencies inherent in industrial time-series data, a dual-branch architecture based on Double Deep Q-Networks is proposed. The TBDDQN model integrates a Transformer module for spatial feature extraction and a BiLSTM-attention network for temporal modeling. Additionally, to improve sample efficiency and training stability, a PER mechanism is incorporated. The overall architecture of the TBDDQN model is shown in
Figure 3.
After introducing the overall architecture of the proposed TBDDQN framework, we briefly discuss its implementation considerations from an engineering perspective.
From an implementation perspective, the proposed TBDDQN framework adopts a modular design. The Transformer-based spatial encoder and BiLSTM-attention temporal extractor follow task-oriented deep learning implementations and can be readily developed using existing libraries. Compared with traditional supervised fault classifiers, the additional implementation effort primarily arises from the reinforcement learning-based decision layer, including the formulation of the diagnostic process as a Markov decision process and the design of cost-sensitive rewards. Nevertheless, this layer operates independently of the sensing and preprocessing modules, enabling the proposed framework to be integrated into existing industrial monitoring systems with moderate additional effort while providing enhanced robustness under class imbalance and asymmetric fault costs.
3.1. Spatiotemporal Feature Extraction
The proposed TBDDQN framework adopts a dual-branch structure for spatiotemporal feature extraction, designed to capture both spatial correlations among process variables and temporal dependencies within sequential data. As depicted in
Figure 3, the dual-branch design consists of two complementary components, each responsible for extracting different aspects of the data.
The first branch employs a Transformer-based spatial encoder (TSE) to model the global dependencies between multivariate features. The input time-series segment is first linearly projected into a latent embedding space, then combined with learnable positional encodings. The data are processed through multiple stacked Transformer encoder layers, each consisting of a multi-head self-attention mechanism and a feed-forward subnetwork with residual connections. This structure enables the model to adaptively capture cross-feature relationships across time, providing a comprehensive spatial understanding of the process state. Finally, a global pooling layer aggregates these contextual representations into a compact spatial embedding.
The second branch focuses on capturing temporal dependencies within the data using a BiLSTM network, enhanced with an attention pooling mechanism. The BiLSTM captures both forward and backward temporal dependencies, enabling the model to represent the complete sequence of events. The attention mechanism assigns adaptive weights to each time step based on its relevance to the overall state, thus producing a more informative temporal representation.
The outputs of both branches are concatenated along the feature dimension, resulting in a fused spatiotemporal representation. This joint feature vector is then passed through a fully connected layer, which estimates Q-values for all possible actions, enabling the TBDDQN model to make integrated spatiotemporal decisions.
Beyond representation learning, the proposed dual-branch architecture also provides intrinsic interpretability. Specifically, the Transformer-based spatial encoder employs multi-head self-attention to explicitly model the relative importance and interactions among different process variables, enabling the identification of critical variables that contribute most to fault discrimination. Meanwhile, the attention mechanism embedded in the BiLSTM branch assigns adaptive weights to different time steps, highlighting key temporal segments that are most informative for fault diagnosis. As a result, the fused spatiotemporal representation not only improves diagnostic performance but also offers interpretable insights into both variable-level and temporal-level contributions underlying the decision-making process.
3.2. Cost-Sensitive Markov Decision Process
To address the challenge of asymmetric misclassification costs and class imbalance in blast furnace fault diagnosis, this study reformulates the condition classification problem as a CSMDP. Within this framework, the learning agent interacts with the environment to make diagnostic decisions while explicitly incorporating the varying severities of misclassification through a cost-weighted reward mechanism.
At each time step
t, the system state
is represented as a sliding window
containing consecutive process observations:
where
denotes the multivariate feature vector at time
t, and
represents the sliding window length. The action
corresponds to the predicted operating condition or fault category.
To model the asymmetric impact of different diagnostic outcomes, a cost-sensitive reward function
is defined as:
where
represents the predefined penalty coefficient derived from a misclassification cost matrix. This matrix encodes the relative severity of various error types, ensuring that high-risk misclassifications yield greater penalties and promoting cost aware decision-making behavior.
To further enhance robustness under imbalanced data conditions and mitigate the persistent misclassification of rare fault types, the penalty term is dynamically adjusted according to historical error frequencies:
Here,
denotes the number of times class
has been misclassified as
,
represents the total occurrences of class
, and
is a small positive constant introduced to prevent division by zero. This adaptive reward mechanism encourages the agent to focus more on high-cost or frequently misclassified categories, thereby achieving balanced learning and improved diagnostic precision, particularly for minority and critical fault classes.
Unlike conventional supervised classification models, the diagnostic decisions of the proposed TBDDQN are explicitly guided by a cost-sensitive reward design. The learned Q-values therefore encode not only classification confidence but also the potential operational risk associated with different misclassification types. This explicit coupling between reward and decision-making allows the diagnostic policy to be interpreted as a rational trade-off between accuracy and fault-related risk, which cannot be directly achieved by static loss reweighting strategies.
Unlike the standard CMDP framework that assumes deterministic temporal transitions, the proposed CSMDP adopts a stochastic transition mechanism to enhance exploration diversity. Notably, in our formulation, the transition to the next state is independent of the agent’s action because the diagnostic decision does not alter the underlying blast furnace process. Therefore, after each interaction, the subsequent state is randomly sampled from the training dataset instead of following a fixed sequence. This randomized sampling strategy enhances exploration by exposing the agent to diverse states and reduces overfitting to local temporal correlations that are common in industrial process data, while preserving the Markov property for decision learning and preventing bias toward temporally clustered fault patterns.
The agent ultimately learns an optimal policy that maximizes the expected cumulative discounted reward under the cost-sensitive constraints, thereby achieving an intelligent and risk-aware fault diagnosis strategy that balances diagnostic accuracy and operational safety.
3.3. Prioritized Experience Replay
To enhance sample efficiency and improve the stability of reinforcement learning in fault diagnosis, the proposed TBDDQN framework integrates the PER mechanism [
25]. Unlike conventional experience replay, which samples past interactions uniformly from the replay buffer, PER assigns higher sampling probabilities to experiences with greater learning potential, thereby allowing the agent to focus more on transitions that are critical for value estimation improvement.
In standard experience replay, all stored transitions are treated equally, leading to inefficient use of valuable experiences and potentially slowing convergence. To address this limitation, PER evaluates the importance of each transition based on its temporal-difference (TD) error, defined as:
where
denotes the TD error of the
ith transition,
is a small positive constant ensuring non-zero probability, and
controls the degree of prioritization. A higher TD error indicates a larger discrepancy between the predicted and target Q-values, implying a higher learning potential. The normalized sampling probability for each transition is computed as:
Since prioritized sampling introduces bias into the training distribution, importance-sampling (IS) weights are used to compensate for this effect and maintain unbiased gradient updates:
Here,
N is the buffer size,
is the sampling probability of transition
t, and
adjusts the strength of bias correction. These weights ensure that less frequently sampled transitions receive proportionally higher importance during the optimization process, promoting balanced and effective learning.
To further stabilize the training process, the proposed framework incorporates several reinforcement learning techniques in conjunction with PER. Specifically, a soft target-network update [
36] is employed to prevent abrupt changes in Q-value estimation:
where
denotes the soft update coefficient controlling the rate of parameter blending. Meanwhile, an exponentially decaying
-greedy exploration strategy [
21] dynamically balances exploration and exploitation as training progresses, while gradient clipping [
37] prevents gradient explosion and improves numerical stability.
Through the integration of PER, soft target updates, and adaptive exploration, the TBDDQN agent achieves a robust learning process with enhanced convergence efficiency and fault diagnosis accuracy. The overall training procedure is summarized in Algorithm 1.
| Algorithm 1 Cost-Sensitive Reinforcement Learning Environment for BFIP Diagnosis |
- Require:
Training subset , base cost matrix , adaptive flag - Ensure:
Environment state , reward
- 1:
Initialize: - 2:
Randomly select index - 3:
, - 4:
, - 5:
Small constant - 6:
procedure Reset - 7:
- 8:
▹ Return initial state sample - 9:
return - 10:
end procedure - 11:
procedure Step() - 12:
Obtain true label - 13:
if then - 14:
▹ Correct classification reward - 15:
- 16:
else - 17:
- 18:
- 19:
if then - 20:
- 21:
- 22:
end if - 23:
▹ Penalty based on cost sensitivity - 24:
end if - 25:
- 26:
- 27:
return - 28:
end procedure
|
3.4. Proposed Imbalanced Fault Diagnosis via TBDDQN
The proposed TBDDQN-based imbalanced fault diagnosis framework is designed to achieve accurate and stable classification in the BFIP, where the inherent class imbalance and strong spatiotemporal dependencies pose major challenges. As shown in
Figure 4, the framework integrates Transformer-based spatial encoding, BiLSTM-attention temporal modeling, PER, and a CSMDP into a unified reinforcement learning paradigm.
Step 1 Data preprocessing. To preserve temporal continuity, each sample is constructed by applying a sliding window to consecutive process measurements. All variables are standardized to eliminate scale discrepancies.
Step 2 Spatiotemporal feature representation. The preprocessed sequence is simultaneously fed into two parallel encoding branches. The Transformer-based spatial encoder captures global inter-variable correlations and dependencies, while the BiLSTM-attention branch learns bidirectional temporal dynamics within each window. The feature embeddings from both branches are concatenated to form a unified spatiotemporal representation, which is subsequently passed through fully connected layers to estimate Q-values corresponding to candidate fault categories.
Step 3 Adaptive learning under cost-sensitive MDP. To address class imbalance, the learning process is formulated as a CSMDP. In this setting, rewards are adaptively scaled according to class-specific misclassification costs, thereby mitigating bias toward majority classes. Meanwhile, PER is incorporated to improve sampling efficiency by emphasizing transitions with higher TD errors. Together, these mechanisms enable the agent to focus on rare but informative fault patterns. A DDQN strategy is adopted to decouple action selection and evaluation, thereby reducing overestimation bias. The target network parameters are updated softly as to ensure stable convergence. Moreover, -greedy exploration with exponential decay maintains a balance between exploration and exploitation during training.
Step 4 Fault classification. Once training is complete, the agent operates in a greedy mode, selecting the action with the maximum estimated Q-value for each input sequence. The resulting policy enables accurate and balanced fault identification across both minority and majority fault classes, demonstrating the effectiveness of the TBDDQN framework in handling imbalanced industrial diagnosis tasks.
4. Case Study
To evaluate the capability and effectiveness of the proposed method, datasets were collected from the No. 2 blast furnace system at Guangxi Liuzhou Iron and Steel Company Ltd., located in Southern China. The datasets reflect the typical characteristics of BFIP, including strongly coupled multivariate variables, complex operating conditions, and highly imbalanced fault occurrences. Based on the expert knowledge, 10 critical process variables were selected for fault diagnosis as listed in
Table 1. The proposed TBDDQN framework is developed with these blast furnace-specific data characteristics in mind, aiming to address the challenges of imbalanced fault diagnosis and asymmetric misclassification risks in BFIP. Although the dataset cannot be publicly released due to industrial confidentiality constraints, the methodological design is applicable to other industrial processes with similar characteristics.
According to the operational log, a total of 4343 samples were collected across 5 distinct operational scenarios, comprising 3018 normal cases, 125 furnace collapse incidents, 216 instances of furnace temperature rise, 739 channeling events, and 245 cases of furnace temperature down. These faults correspond to distinct operational risks in BFIP. Furnace collapse incidents may lead to burden instability and emergency shutdowns; furnace temperature rise affect hot metal quality and refractory lifespan; channeling events result in uneven gas distribution and reduced reaction efficiency while furnace temperature down is often related to abnormal fuel injection or blast conditions, directly impacting productivity and energy efficiency. The severe consequences and highly imbalanced occurrence frequencies of these faults further motivate the adoption of cost-sensitive decision modeling. The dataset was split into training and testing sets in an 8:2 ratio.
4.1. Parameter Settings
For comparative analysis, several baseline models are employed, including CNN, LSTM, CNN-LSTM, CNN-DQN, and CNN-LSTM-DQN. The baseline models are configured as follows:
CNN model: Four convolutional layers with channel sizes of 1, 4, 16, and 32, followed by two fully connected layers with 100 and 5 neurons, respectively, for spatial feature compression and classification.
LSTM model: A single LSTM with 64 hidden units, where the final hidden state is used to capture temporal dynamics.
CNN-LSTM model: A hybrid framework integrating two convolutional layers (with 4 and 8 channels) as the CNN feature extractor and an LSTM with 64 hidden units.
CNN-DQN model: A CNN backbone comprising two 1D convolutional layers (64 and 128 channels) and a fully connected layer with 256 neurons for Q-value prediction.
CNN-LSTM-DQN model: Extends the CNN-DQN by adding a parallel LSTM branch with 128 hidden units to capture temporal dependencies.
All baseline models are trained for 100 epochs using the Adam optimizer with a learning rate of 0.001. For the DQN-based models, the discount factor is set to , and an -greedy strategy is employed. Each epoch contains 512 steps, and the target networks are updated every 1000 steps to maintain training stability.
In the proposed TBDDQN model, the sliding window size is set to 10. In the Transformer encoder branch, the input is projected to 256 dimensions using a linear layer, combined with learnable positional encodings, and processed by three Transformer encoder layers with eight attention heads each, generating 256-dimensional spatial features. In the LSTM branch, a single-layer, bidirectional LSTM with a hidden size of 128 is used, enhanced with an attention mechanism that adaptively weights all time steps, producing a 256-dimensional temporal feature vector. These representations are concatenated to form a 512-dimensional feature vector (256 from Transformer + 256 from BiLSTM), which is subsequently passed through a fully connected layer with five outputs corresponding to the action space.
The discount factor is set to
, and an
-greedy exploration strategy is adopted, where
and
with a decay rate of 0.05. The Adam optimizer is used with a learning rate of 0.001. Training is performed over 100 epochs with 512 steps per epoch, and the target network update rate is set to
. For prioritized experience replay, the buffer capacity is set to 10,000, with prioritization parameters
and
(annealed to 1.0 during training). The cost-sensitive reward is constructed using the following base cost matrix:
To further refine the reward design, an adaptive penalty adjustment mechanism is introduced, defined as Equation (
11) which adaptively penalizes frequently misclassified actions based on historical error statistics.
4.2. Results Analysis
This subsection evaluates the diagnostic performance using precision, recall, F1-score, t-SNE visualizations, and confusion matrices to demonstrate TBDDQN’s effectiveness in handling imbalance. In multi-class classification tasks, the performance metrics such as precision, recall, and F1-score are calculated for each class and then aggregated to evaluate the overall performance. To assess the fault diagnosis capability of different models, we employ two widely adopted aggregation strategies, macro-averaging and weighted-averaging, as discussed by Sokolova and Lapalme [
38]. The macro-averaged treats all classes equally, providing a balanced view across classes, while the weighted-averaged considers the support (sample size) of each class, reflecting the model’s performance under class imbalance.
Macro-averaging treats all classes equally by computing the unweighted mean across all classes, providing a fair reflection of performance on minority fault categories. In contrast, weighted-averaging accounts for class imbalance by computing the average weighted by the number of instances per class, which better reflects the model’s overall performance dominated by frequent classes. The corresponding formulations are expressed as:
where
represents the metric (precision, recall, or F1-score) of class
k,
is the number of samples in class
k, and
K denotes the total number of classes.
Table 2,
Table 3 and
Table 4 present the diagnostic performance in terms of precision, F1-score, and recall, respectively. As shown in
Table 2, the CNN model achieves a weighted-averaged precision of 0.873 and a macro-averaged precision of 0.835. LSTM improves both metrics to 0.902 and 0.835, respectively. The CNN-LSTM model, by combining spatial and temporal feature extraction, further enhances performance, achieving a weighted-averaged precision of 0.931 and a macro-averaged precision of 0.865.
Reinforcement learning-based models exhibit more significant improvements. CNN-DQN attains a weighted-averaged precision of 0.929 and a macro-averaged precision of 0.899, indicating its ability to handle class imbalance effectively. The CNN-LSTM-DQN model achieves slightly higher results, with a weighted-averaged precision of 0.925 and a macro-averaged precision of 0.903. In contrast, the proposed TBDDQN achieves the best performance, with a weighted-averaged precision of 0.934 and a macro-averaged precision of 0.970, clearly outperforming all other baselines.
Similarly,
Table 4 shows that TBDDQN achieves the highest weighted-averaged recall of 0.928 and macro-averaged recall of 0.906, respectively, reflecting a comparative advantage in sensitivity and robustness to class imbalance. The CNN-LSTM model also demonstrates high recall, with a weighted-averaged recall of 0.920 and a macro-averaged recall of 0.912, verifying the benefit of fusing temporal-spatial dependencies. As illustrated in
Table 3, TBDDQN also exhibits the best overall F1-score, achieving a weighted-averaged F1-score of 0.922 and a macro-averaged F1-score of 0.929. These results indicate that TBDDQN effectively balances precision and recall while maintaining stability across both major and minor fault classes.
Furthermore, in terms of computational complexity, the proposed TBDDQN contains approximately 745,478 trainable parameters (0.75M). The forward pass complexity is 0.010 GFLOPs. Training on the BFIP dataset (approximately 3474 training samples) for 100 epochs takes about 20 min on an NVIDIA RTX 3080 GPU. The average inference latency is 1.75 ms on GPU and 3.48 ms on CPU per sample, which is suitable for real-time monitoring in industrial control systems. Compared with baseline models, TBDDQN provides a favorable trade-off between diagnostic accuracy and computational efficiency.
To further verify the representational capability, t-distributed stochastic neighbor embedding (t-SNE) is used to visualize the features learned by different models. As shown in
Figure 5, features extracted by TBDDQN display clearer cluster separations than those of CNN, LSTM, CNN-LSTM, CNN-DQN, and CNN-LSTM-DQN. The improved discriminability indicates that the dual-branch structure and adaptive cost-sensitive reward mechanism enable the model to learn more distinct class boundaries in imbalanced industrial scenarios.
Figure 6 presents the confusion matrices of the six models across five operational states. The diagonal elements represent the precision for each class, indicating the proportion of correctly classified samples among all samples predicted as that class, while the off-diagonal elements illustrate the misclassification patterns between different classes. Several critical observations can be made as follows. First, TBDDQN demonstrates effective rare-fault detection, achieving 100% precision for both Fault 2 (Furnace temperature rise) and Fault 3 (Channeling). This validates the effectiveness of the adaptive cost-sensitive reward design in addressing severe class imbalance. Second, CNN-LSTM shows an improvement in capturing temporal-spatial correlations but still exhibits a precision drop for Fault 1 (Collapse), suggesting insufficient robustness under data scarcity. Lastly, CNN-LSTM-DQN achieves good balance but exhibits slightly lower performance on certain rare faults due to limited samples. Overall, the proposed TBDDQN achieves the highest precision consistency across all fault categories, confirming its strong performance in intelligent blast furnace fault diagnosis.
4.3. Ablation Study
To thoroughly investigate the contribution of each key component in the proposed TBDDQN framework, we conducted a comprehensive ablation study. Five variants were evaluated: (1) the full TBDDQN model, (2) w/o BiLSTM-attention (Transformer branch only), (3) w/o Transformer (BiLSTM-attention branch only), (4) w/o PER (replaced with uniform replay buffer), and (5) w/o Adaptive Cost (fixed cost matrix). Performance was assessed using per-class recall, F1-score, and precision, with macro-averaged and weighted-averaged metrics reported to account for class imbalance in blast furnace fault diagnosis. The results are summarized in
Table 5,
Table 6 and
Table 7.
As shown in
Table 5, the complete TBDDQN model achieves the highest macro-averaged precision of 0.970, outperforming all ablated variants. Removing the BiLSTM-attention branch results in the most pronounced degradation, with macro-averaged precision dropping sharply to 0.804. In particular, the precision for Fault 1 (Collapse) decreases sharply from 1.000 to 0.500, indicating that the BiLSTM-attention module is crucial for capturing discriminative temporal patterns associated with sudden and high-risk collapse events. In contrast, removing the Transformer branch leads to a moderate decline in macro-averaged precision to 0.942, highlighting the importance of global inter-variable dependency modeling provided by the Transformer encoder. Additionally, both the w/o PER and w/o Adaptive Cost variants exhibit noticeable reductions in macro-averaged precision and weighted-averaged precision, confirming the effectiveness of prioritized experience replay in emphasizing rare fault samples and adaptive cost-sensitive rewards in penalizing high-risk misclassifications.
Table 6 reports the F1-score comparison among different variants. The full TBDDQN achieves the highest macro-averaged F1-score of 0.929 and weighted-averaged F1-score of 0.922, demonstrating a well-balanced diagnostic capability across all fault categories. Removing the BiLSTM-attention branch again causes the most severe performance drop, with macro-averaged F1-score decreasing by 0.134 to 0.795. Notably, the F1-score for Fault 1 (Collapse) drops from 1.000 to 0.571, and for Fault 4 (Furnace temperature down) from 0.948 to 0.557, underscoring the critical role of attention-enhanced temporal modeling in distinguishing subtle and imbalanced fault signatures. Excluding the Transformer branch also leads to a clear decline in macro-averaged F1-score to 0.906, confirming its effectiveness in capturing long-range dependencies within the sliding time window. Furthermore, disabling prioritized experience replay reduces the macro-averaged F1-score to 0.899, while replacing adaptive cost-sensitive rewards with fixed penalties yields a macro-averaged F1-score of 0.913, demonstrating that both mechanisms contribute to improving learning efficiency and mitigating bias toward the majority class.
The recall comparison in
Table 7 further corroborates these observations. The complete TBDDQN achieves the highest weighted-averaged recall of 0.928, proving to be the most effective approach in achieving overall fault coverage. Removing the BiLSTM-attention branch leads to the most significant degradation, with macro-averaged recall dropping to 0.802 and the recall for Fault 4 (Furnace temperature down) decreasing sharply from 0.958 to 0.458. This confirms that the BiLSTM-attention mechanism is indispensable for capturing critical temporal dependencies and long-term fault evolution patterns in multivariate time-series data. Conversely, removing the Transformer branch severely affects the detection of Fault 3 (Channeling), where recall drops to 0.603, demonstrating the Transformer’s advantage in modeling global contextual relationships across strongly coupled process variables. The effects of prioritized experience replay and adaptive cost-sensitive rewards are also evident. Eliminating PER reduces the weighted-averaged recall to 0.911, indicating that prioritizing high-error and rare fault transitions is beneficial for improving fault coverage under class imbalance. Although removing the adaptive cost mechanism yields a slightly higher macro-averaged recall, the full TBDDQN achieves the highest weighted-averaged recall, which is more representative of real-world diagnostic risk. This demonstrates that the adaptive cost-sensitive reward effectively emphasizes safety-critical and high-frequency fault misclassifications during training, thereby improving the overall robustness of the diagnosis system under imbalanced industrial conditions.
Overall, the results in
Table 5,
Table 6 and
Table 7 consistently demonstrate that each component of the proposed TBDDQN framework contributes positively to fault diagnosis performance. The Transformer branch enhances global dependency modeling, the BiLSTM-attention module captures discriminative temporal dynamics, prioritized experience replay improves learning efficiency for rare faults, and the adaptive cost-sensitive reward mechanism effectively addresses asymmetric misclassification risks. Their synergistic integration enables TBDDQN to achieve well-balanced diagnostic performance that surpasses that of any individual component alone.