Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking

Yang, Hao; Fu, Meixia; Li, Wei; Sun, Lei; Wang, Qu; Chen, Na; Zhang, Ronghui; Wang, Zhenqian; Lu, Yifan; Ma, Zhangchao; Wang, Jianquan

doi:10.3390/met16020185

Open AccessArticle

Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking

by

Hao Yang

¹

,

Meixia Fu

^1,2,*

,

Wei Li

³,

Lei Sun

^1,*

,

Qu Wang

^1,2

,

Na Chen

¹

,

Ronghui Zhang

¹,

Zhenqian Wang

³,

Yifan Lu

¹,

Zhangchao Ma

¹

and

Jianquan Wang

¹

Automation and Electrical Engineering, Institute of Industrial Internet, University of Science and Technology Beijing, Beijing 100083, China

²

Shunde Graduate School, University of Science and Technology Beijing, Foshan 528399, China

³

School of Science and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Authors to whom correspondence should be addressed.

Metals 2026, 16(2), 185; https://doi.org/10.3390/met16020185

Submission received: 3 January 2026 / Revised: 27 January 2026 / Accepted: 2 February 2026 / Published: 4 February 2026

(This article belongs to the Special Issue Advanced Simulation and Modeling Technologies of Metallurgical Processes—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurately predicting molten steel carbon content plays a crucial role in improving productivity and energy efficiency during the Basic Oxygen Furnace (BOF) steelmaking process. However, current data-driven methods primarily focus on endpoint carbon content prediction, while lacking sufficient investigation into real-time curve forecasting during the blowing process, which hinders real-time closed-loop BOF control. In this article, a novel Transformer-based framework is presented for real-time carbon content prediction. The contributions include three main aspects. First, the prediction paradigm is reconstructed by converting the regression task into a sequence classification task, which demonstrates superior robustness and accuracy compared to traditional regression methods. Second, the focus is shifted from traditional endpoint-only forecasting to long-term prediction by introducing a Transformer-based model for continuous, real-time prediction of carbon content. Last, spatial–temporal feature representation is enhanced by integrating an optical flow channel with the original RGB channels, and the resulting four-channel input tensor effectively captures the dynamic characteristics of the converter mouth flame. Experimental results on an independent test dataset demonstrate favorable performance of the proposed framework in predicting carbon content trajectories. The model achieves high accuracy, reaching 84% during the critical decarburization endpoint phase where carbon content decreases from 0.0829 to 0.0440, and delivers predictions with approximately 75% of errors within ±0.05. Such performance demonstrates the practical potential for supporting intelligent BOF steelmaking.

Keywords:

BOF steelmaking; carbon content; intelligent prediction; Transformer

1. Introduction

BOF steelmaking is the primary method accounting for 70% of global steel production [1]. Accurate prediction of carbon content is crucial for optimizing production efficiency, reducing operational costs and minimizing resource consumption [2]. With the advancement of endpoint control technology for BOF steelmaking, various prediction methods such as empirical control, static control, dynamic control and intelligent control have been developed and applied in the field [3,4].

Empirical control is a traditional endpoint prediction approach that relies on operators observing flame shape and color at the converter mouth [1]. Operators determine whether the molten steel meets target requirements for tapping by combining their experience and steelmaking-related data. However, traditional manual judgment faces significant challenges. The adverse shop-floor conditions, marked by high temperatures and process instability, are unsuitable for sustained human assessment. Moreover, the subjectivity introduced by varying levels of operator experience further compromises the consistency and accuracy of the predictions.

Static control adopts a predictive model which obtains the correlation between oxygen consumption and endpoint carbon content by combining theory and statistical models [5,6]. Specially, the theoretical model uses metallurgical reaction kinetics and thermochemical equations to simulate decarbonization reaction pathways, while the statistical model determines key parameters such as decarburization efficiency coefficients based on historical data. Compared with empirical methods, static prediction models realize the automation of process parameter optimization and carbon content prediction. However, the static endpoint prediction model is unsuitable for practical use because it ignores the dynamic, time-series nature of the steelmaking process, relying only on initial static data.

Dynamic control methods aim to achieve real-time parameter adjustments by predicting the endpoint carbon content in molten steel [7,8,9,10]. Specifically, dynamic models are constructed based on time-series monitoring data, such as steel composition, oxygen lance position, off-gas composition, and spectral flame characteristics. In practice, sub-lance detection, flame spectral analysis, and off-gas analysis are three commonly used dynamic detection techniques. During the middle and later stages of the blowing process, namely the intense decarburization and final endpoint adjustment stage, the sub-lance detection method directly measures temperature and carbon content via probe insertion. The measurements thereby acquired guide the adjustment of raw material addition and oxygen lance height [11]. The flame spectral analysis method predicts endpoint composition online by analyzing spectral characteristics of flame radiation [12,13]. Off-gas analysis calculates molten steel carbon content dynamically using mathematical models based on measured off-gas chemical composition [14], thereby enabling process control. However, the extremely harsh high-temperature environment of molten steel results in the disposable nature of the probe, making sub-lance detection costly and incapable of providing real-time continuous data. Both flame spectral analysis and off-gas analysis methods are difficult to deploy widely due to the high operational and maintenance costs caused by equipment exposure to corrosive high-temperature environments.

The methods described above suffer from several limitations, such as the complexity of metallurgical reaction kinetics, high operational costs, and inadequate reliability for real-time control, which have collectively hindered widespread practical application. With the development of data collection and intelligent modeling, data-driven intelligent prediction methods have shown significant advantages. Advanced methods like support vector machines [15,16,17], artificial neural networks [18], and random forests (RFs) [19,20] are capable of capturing complex nonlinear relationships in the steelmaking process. Additionally, hybrid modeling strategies that integrate metallurgical mechanisms with data-driven algorithms have been developed to improve interpretability and generalization [21]. A clear trend observed over the past decade shows modeling of the BOF process evolving toward more sophisticated artificial neural networks and hybrid models [22]. However, the majority of the existing research predominantly focuses on predicting the endpoint state while lacking the research on continuous, real-time monitoring of key parameters throughout the blowing process. Such a limitation hinders the achievement of real-time, closed-loop control for the BOF process.

To address the above issues, this study proposes a Transformer-based architecture that realizes end-to-end real-time prediction of carbon content through flame videos. This methodology is founded on the well-established physical principle where the morphology of the flame at the converter mouth serves as a direct visual proxy for the instantaneous decarburization rate. The carbon–oxygen reaction governs the combustion state, creating a deterministic and observable link between flame appearance and molten steel carbon content. Key contributions of the research include the following:

This study innovatively employs a classification-based approach for carbon content prediction by discretizing the carbon content into multiple distinct categories. Compared to conventional regression-based methods, the proposed classification strategy not only reduces prediction difficulty but also better aligns with industrial requirements.
This study introduces a Transformer-based architecture for the task of full-process carbon content prediction during the later stage of the BOF steelmaking process. Unlike previous works focused solely on endpoint prediction, the proposed model targets continuous, real-time, and long-duration forecasting throughout the steelmaking process.
This study proposes a data augmentation strategy by constructing four-channel input tensors combining RGB information with optical flow features. Incorporating optical flow characteristics enhances the model’s capability to capture dynamic flame motion patterns in steelmaking.

The remaining part of this article is organized as follows. Section 2 provides a review of the related work. In Section 3, a detailed description of the proposed Transformer-based framework for real-time carbon content prediction is presented. Section 4 discusses the experimental environment and results, while Section 5 concludes the work with future directions.

2. Related Work

Empirical control is the most primitive control method in the BOF steelmaking process, while the prediction hit rate is relatively low, usually ranging from 30% to 50% owing to differences in subjective experience and insufficient stability. With the advancement of metallurgical theory and computational technology, static control methods were put forward to model the converter steelmaking process based on material and heat balance principles. By introducing metallurgical kinetics and thermochemical equations to construct a steel composition prediction framework, the prediction accuracy has been improved to 40%~60%. For instance, Wang et al. [23] developed a highly accurate static prediction model for endpoint carbon and temperature using an automatically optimized twin support vector regression. Similarly, Liu et al. [24] proposed a hybrid PCA-GA-BP neural network model to predict endpoint phosphorus and oxygen contents, demonstrating the application of advanced data-driven approaches for multi-component static prediction. However, the static model performed poorly in end point carbon content prediction due to the limited adaptability to dynamic process changes.

With the advancement of detection technology, various advanced sensors have been deployed for endpoint control in BOF steelmaking, thereby promoting the development of dynamic control methodologies. Three primary sensing approaches have been widely investigated, which include sub-lance measurement, off-gas analysis, and flame spectral analysis. Sub-lance is the most widely used detection equipment in developing countries. Numerous studies have utilized sub-lance detection data to establish dynamic models for predicting the endpoint carbon content of molten steel. For instance, Hubbeling et al. [25] demonstrate that real-time carbon content and temperature data acquired by the sub-lance during the mid-to-late blowing stage can be used for online correction of the blowing model, allowing for dynamic adjustments of lance height, oxygen blowing intensity, and coolant addition, which significantly improves endpoint hit rate and shortens the tap-to-tap time. Yue et al. [26] established prediction models for endpoint carbon content, temperature, phosphorus, and manganese content based on exponential models, heat balance, and thermodynamic equations. In addition to sub-lance, non-contact detection methods have been explored, with some researchers employing flame spectral analysis and off-gas analysis techniques for detection. For instance, Zhao et al. [12] proposed a high-accuracy prediction model for converter temperature and carbon content by extracting key features from flame spectra using intelligent algorithms. Sun et al. [27] developed an off-gas model technology to replace sub-lance operation for endpoint carbon control. The above dynamic control methods have improved prediction accuracy to some extent, but still face many challenges. Specifically, sub-lance detection cannot achieve continuous detection, and early models based on sub-lance usually only fit the relationship between oxygen supply (amount/time) and carbon content, without considering the influence of oxygen lance position and bottom blowing flow rate on carbon content. In addition, off-gas detection and flame spectral analysis technology are difficult to widely apply due to high equipment usage and maintenance costs, and off-gas detection has a lag problem because the equipment is far away from the reaction zone in the converter bath, making it difficult to achieve real-time prediction of carbon content.

Data-driven dynamic and intelligent carbon prediction is now achievable in the BOF steelmaking process driven by the rapid development of auto-detection methods, mathematical models, and algorithms. The evolution of machine learning techniques in BOF modeling demonstrates a clear trend of moving from shallow to deep learning architectures and from single to hybrid models. Early research was dominated by Support Vector Regression (SVR) and Artificial Neural Networks (ANNs). SVR captured nonlinear relationships via radial basis kernel functions [15,16], while researchers further reduced prediction errors by using improved twin SVR [17]. The mainstream architecture of ANN is Multi-Layer Perceptron (MLP), which achieves complex mapping through nonlinear activation functions. Bae et al. achieved joint prediction of carbon content, temperature, and phosphorus content through three-layer MLP [18]. Feature selection methods have also been introduced to improve model performance. For instance, Z. Chen et al. [28] proposed an Improved Grey Wolf Optimizer (IGWO) to stabilize the identification of optimal feature subsets, thereby enhancing regression accuracy for endpoint temperature and carbon content.

To address the limitations of single models, researchers proposed hybrid strategies to improve the prediction accuracy of the BOF process. Wang et al.’s “K-means clustering + ANN” two-stage framework [29] processes data through clustering algorithms, with neural network modeling subsequently applied. However, traditional clustering methods often overlook feature importance and are sensitive to parameters. To overcome this drawback, Zhu et al. [21] developed a multi-level integration approach for endpoint phosphorus content prediction by synergizing metallurgical mechanisms with industrial data, achieving high accuracy through mechanism-guided feature selection and a composite loss function that enforces physical consistency. Similarly, some research proposed a method that amplifies key features based on metallurgical mechanisms and employs the Grey Wolf Optimizer to improve Affinity Propagation clustering, establishing a highly accurate temperature prediction model. In addition, Feng et al. proposed the ensemble model combining SVR, Random Forest, and ANN [30], improving prediction robustness by combining the capabilities of multiple models. Other hybrid strategies include combining multiple linear regression with Gaussian process regression to model both global trends and local variations [31]. Comparative studies have further enriched the understanding of different ML techniques. For instance, some works compared RF, ANN, and SVR [19], while others demonstrated the superior performance of ANN and SVR over other methods in predicting key endpoints [18]. More recent comparative assessments have included models such as RF, gradient boosting regressor (GBR), CNN, and metallurgical mechanism models, with findings indicating that RF and GBR outperformed others in certain contexts [20]. In recent years, researchers have explored advanced deep learning methods, such as Convolutional Neural Networks (CNNs) being applied to model off-gas time-series data [32], and Long Short-Term Memory (LSTM) networks to enhance temporal modeling capabilities [33]. Additionally, Liang et al. [34] developed an attention-based ConvLSTM network for predicting carbon content, further advancing sequence-aware modeling in BOF processes. The exploration of Transformer architectures in this domain has also begun, with work such as TTB-BOFNet demonstrating their effectiveness for endpoint prediction tasks on industrial-scale data [35]. In addition, Lu et al. [36] pioneered a novel framework that integrates thermodynamic models with the Transformer-based TabPFN algorithm for endpoint temperature prediction in electric arc furnaces, showcasing the potential of hybrid modeling approaches in steelmaking process intelligence. Other models such as transfer learning [37], Graph Neural Networks (GNNs) [38], and autoencoder Bayesian networks [39] are demonstrating growing potential in the field of BOF.

Despite the growing sophistication of models, a significant limitation persists. The vast majority of existing research was exclusively designed for endpoint carbon content prediction, overlooking the critical need to track the dynamic evolution of the carbon content throughout the blowing process. The exclusive focus on a single endpoint value fails to support real-time process control. To address the above challenge, the application of a standard Transformer architecture to this problem is introduced in the present research, utilizing its self-attention mechanism to model spatial features and temporal dynamics from flame video sequences. A novel paradigm is introduced that reconstructs the traditional regression task into sequence classification. The reconstruction discretizes continuous carbon content values into 36 distinct categories, thereby enhancing robustness against minor fluctuations while maintaining alignment with industrial measurement precision. Additionally, the model’s perception of flame motion characteristics is enhanced through integrated optical flow features combined with standard RGB inputs. The framework enables real-time prediction of carbon content in the later stage of the BOF steelmaking blowing process, providing a more comprehensive solution for converter control.

3. Methodology

3.1. Theoretical Foundation: Transformer Architecture

The Transformer architecture was originally designed for natural language processing tasks [40], with its core being the dynamic modeling of long-range dependencies between sequence elements through a self-attention mechanism. Unlike RNNs that process elements sequentially, the architecture significantly improves model efficiency by parallelizing the entire input sequence for global relationship calculation.

To maintain the spatial–temporal order of input elements, sinusoidal positional encoding [40] is employed. For each position

p o s

and dimension

i

in the sequence, position encoding is defined as:

P E_{(p o s, 2 i)} = s i n (\frac{p o s}{10000^{2 i / d_{model}}})

(1)

and

P E_{(p o s, 2 i + 1)} = c o s (\frac{p o s}{10000^{2 i / d_{model}}})

(2)

where

p o s

represents position,

i

represents dimension index and

d_{model}

represents the feature dimension of the model.

Multi-head self-attention (MHSA) allows the model to collaboratively focus on information from different positions of the input sequence in different representation subspaces, using the scaled dot-product attention mechanism defined in [40]:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}})

(3)

where query (

Q

), key (

K

), and value (

V

) represent query, key, and value matrices derived from linear projections of the input. The superscript

T

denotes the matrix transposition operation, indicating that the key matrix

K

is transposed before matrix multiplication with the query matrix

Q

.

d_{k}

denotes the dimension of the key matrix

K

in each attention head.

softmax

denotes the softmax activation function, which normalizes the output to a probability distribution between 0 and 1, ensuring the sum of attention weights is 1.

The encoder is composed of

N = 6

identical layers stacked together, and each encoder layer contains two sublayers: MHSA and FFN. MHSA projects the input of the

d_{model}

dimension to

h

different subspaces (

h is 8

in the experiment), calculates the attention independently in each subspace, and then concatenates the results. This design enables the model to simultaneously focus on different representation subspace information at different positions.

The FFN is composed of two linear transformations with a ReLU activation in between, as specified in [40]:

FFN (x) = W_{2} \cdot ReLU (W_{1} \cdot x + b_{1}) + b_{2}

(4)

where

W_{1}

and

W_{2}

are the weight matrices of the first and second linear transformations in the FFN,

b_{1}

and

b_{2}

are the bias terms corresponding to the first and second linear transformations, and the internal dimension

d_{ff} is 2048

. Each sublayer employs a residual connection followed by layer normalization, as introduced in [40]:

y = LayerNorm (x + Sublayer (x))

(5)

where

Sublayer (x)

represents the processing result of the sublayer for input

x

. The design effectively alleviates the vanishing gradient problem in deep networks.

The decoder also follows a multi-layer architecture similar to that of the encoder, while introducing two enhanced mechanisms. One is a masked self-attention module, which prevents access to subsequent tokens during the training phase. The other is an encoder–decoder cross-attention module, whose function is to align decoder states with the encoded representation.

3.2. Model Architecture Design

Figure 1 illustrates the end-to-end architecture of the proposed Transformer-based model for real-time carbon content classification, which processes flame images to predict discrete carbon content. Flame video frames are first transformed into four-channel input tensors combining RGB information with optical flow data in order to capture both static appearance and dynamic motion characteristics. Following the transformation, the frames undergo division into patches, and convolutional layers embed these patches into high-dimensional sequences. Positional encodings enhance these sequences to preserve spatiotemporal relationships. A Transformer encoder then processes the encoded sequence, modeling complex dependencies across all patches and frames through multi-head self-attention mechanisms for the integration of global spatial and temporal information. Guided by a learnable query vector, a decoder employs cross-attention to focus on the encoded representation, thereby aggregating the most relevant contextual features into a compact context vector. The vector passes through a classification head that maps it onto a probability distribution across 36 predefined carbon content categories, completing the prediction process. The specific implementation details are as follows.

3.2.1. Four-Channel Input Embedding and Patch Processing

Traditional three-channel RGB images only provide static spatial information and have difficulties in capturing dynamic changes in flames. In order to enhance the model’s perception ability of flame motion features, an optical flow feature channel is fused with the original three-channel image to construct a four-channel input tensor

X \in ℝ^{H \times W \times 4}

. The first three channels correspond to the RGB channels of the video frame, and the fourth channel is the introduced optical flow feature channel used to encode flame dynamic information. The structure of this four-channel input tensor is illustrated in Figure 2.

Optical flow features are used to quantify the motion of pixels between consecutive frames. As demonstrated in recent research, explicitly modeling inter-frame motion provides more discriminative features than static appearance information alone, particularly in scenes with complex motion and dynamic textures [41]. The Farnebäck algorithm [42] is adopted for optical flow field calculation due to its dense estimation capability and robustness to illumination changes, offering significant advantages over sparse feature-matching alternatives like Lucas–Kanade in the challenging furnace environment. Algorithm implementation employed a pyramid scale of 0.5 with three pyramid levels, a 15-pixel window size, three iterations, a polynomial neighborhood size of 5, and polynomial standard deviation of 1.2, thereby balancing motion detail preservation with computational efficiency. The algorithm is based on polynomial expansion theory to estimate the dense two-dimensional vector field

(u, v)

between adjacent frames, where

u

and

v

represent the displacement components of pixels in the horizontal and vertical directions, respectively. Two basic motion characteristics, amplitude and direction, can be derived from the original optical flow vector field

(u, v)

[42]:

Motion Magnitude refers to the intensity of pixel movement:

F_{magnitude} = \sqrt{u^{2} + v^{2}}

(6)

Motion Direction refers to the angular direction of pixel movement:

F_{angle} = a r c t a n 2 (u, v) \in (- π, π]

(7)

In order to generate single channel features that are compatible with RGB channels and can effectively represent motion information, the fusion feature is defined as follows:

F_{fused} = λ \cdot F_{magnitude} + (1 - λ) \cdot \frac{F_{angle} + π}{2 π}

(8)

The fusion weight parameter

λ

balances motion magnitude and directional information contributions, with validation experiments determining the optimal value as

λ

= 0.7. Values significantly lower than 0.7 inadequately capture motion intensity cues, while values substantially higher diminish sensitivity to flame directional patterns. The configuration of

λ

= 0.7 achieved the lowest validation loss and the best prediction accuracy. Following normalization of the directional component to [0, 1], the final four-channel tensor is obtained by concatenating

F_{fused}

with the original RGB channels.

When processing the input data, the original high-resolution video is first downsampled and frames are extracted. Then input frames are split into multiple patches, each processed through convolutional layers for local feature extraction before being linearly projected into an embedding vector, and position encoding is added. The embedded vector is passed through multiple layers of encoder layers to obtain the encoded feature representation. The decoder uses the query vector to interact with the output of the encoder in a single step of attention, and adopts a two-level fully connected layer for probabilistic space mapping to output the classification results.

During the above process, each processed four-channel image is divided into fixed-size patches. For the continuous frame image

X \in ℝ^{B \times T \times 4 \times H \times W}

input, where

B

is the batch size,

T

is the number of frames,

H

is the image height and

W

is the image width. The image is divided into

\frac{H}{p a t c h_s i z e} \times \frac{W}{p a t c h_s i z e}

image blocks of size

p a t c h_s i z e^{2}

. Each patch extracts local features through two layers of convolutional networks. The first layer uses a 3 × 3 convolution of 100 filters, followed by ReLU activation and batch normalization. The second layer uses a 3 × 3 convolution of 30 filters, followed by ReLU activation and batch normalization. The extracted features are mapped to the embedding space of the

d_{model}

dimension through the fully connected layer to obtain the patch embeddings. After adding position encoding and image block embedding, the training dynamics under different input distributions are stabilized through layer normalization.

3.2.2. Dynamic Query Mechanism and Classification Output

Unlike traditional continuous value regression, the prediction of steel carbon content is modeled as a classification task in this work. Since the carbon content value during the later stage of steelmaking ranges from 0.01 to 0.36, this range is discretized into 36 distinct categories with a step size of 0.01, thereby establishing a direct mapping between flame image sequences and carbon content intervals. This classification paradigm offers several advantages for the industrial scenario: it enhances error tolerance by reducing sensitivity to small predictive fluctuations within acceptable industrial measurement tolerances; it achieves operational alignment through output formats that match the discrete quality control standards used in actual steel production; and it improves prediction stability by constraining outputs within physically meaningful ranges. To realize this classification paradigm, the model employs a dynamic query mechanism; a single-step prediction mechanism is implemented through learnable task-specific query vectors. Firstly, a learnable query vector

q \in ℝ^{1 \times d_{model}}

was constructed, which is dynamically updated during the training process to capture task features. The decoder consists of

N

layers, each layer containing three sublayers, namely target sequence self-attention, encoder–decoder attention, and feedforward network. Due to the use of a single query vector, the length of the target sequence is 1, and the Masked MHSA sublayer of the target sequence is simplified to a linear transformation of the query vector. In the encoder–decoder attention sublayer, dynamic query vectors act as queries

(Q)

to interact with the global spatial–temporal features

K / V

output by the encoder, and generate context vectors by aggregating global spatial–temporal information. The FFN sublayer retains the same structural design as the encoder. Each sublayer adopts residual connections and layer normalization operations to ensure optimization stability.

Finally, the task context vector is probabilistically mapped through two fully connected layers. The first fully connected layer reduces the dimensionality from

d_{model}

and introduces ReLU nonlinear activation, while the second fully connected layer maps the features to a 36-dimensional space (corresponding to 36 carbon content categories). The final output is normalized by softmax to obtain the probability distribution

P \in ℝ^{36}

for each category, and the predicted category is determined by

\hat{y} = a r g m a x (P)

, where

\hat{y}

denotes the predicted category and

a r g m a x (P)

denotes the operation of taking the index corresponding to the maximum value in the probability distribution

P

, thus outputting the predicted carbon content value.

3.3. Evaluation Methods

The present study employs Top-1 classification accuracy [43], tolerance accuracy, precision [44], recall [44], weighted F1 score [45], cross-entropy loss [46], mean error, and standard deviation of error as comprehensive evaluation metrics. Top-1 classification accuracy measures the proportion of samples where the predicted class with the highest probability matches the true class label. It is defined as:

Accuracy = \frac{1}{N} \sum_{i = 1}^{N} II ({\hat{y}}_{i} = y_{i})

(9)

where

N

denotes the total number of samples,

y_{i}

represents the true class label for the

i

-th sample,

{\hat{y}}_{i}

is the predicted class label (the class with the highest probability), and

II (\cdot)

is the indicator function that returns 1 if

{\hat{y}}_{i}

equals

y_{i}

, and 0 otherwise.

Tolerance accuracy evaluates the proportion of predictions where the absolute error between the predicted continuous value and the true continuous value falls within a specified tolerance threshold

δ

. For classification models, the predicted continuous value is derived from the predicted class index. The metric is formulated as:

Tolerance Accuracy (δ) = \frac{1}{N} \sum_{i = 1}^{N} II (| {\hat{v}}_{i} - v_{i} | \leq δ)

(10)

where

v_{i}

is the true continuous carbon content value (in percentage) for sample

i

,

{\hat{v}}_{i}

is the predicted continuous carbon content value. For classification outputs,

{\hat{v}}_{i} = v_{\min} + {\hat{y}}_{i} \times Δ v

, with

v_{\min} = 0.01

,

{\hat{y}}_{i}

as the predicted class index, and

Δ v = 0.01

(class interval width).

δ

denotes the tolerance threshold (e.g., 0.005, 0.02, or 0.05), and

II (\cdot)

returns 1 if the absolute error

∣ {\hat{v}}_{i} - v_{i} ∣

is within

δ

, and 0 otherwise.

Precision and recall are computed for each carbon content range (class) to provide fine-grained performance analysis. Precision measures the proportion of correctly predicted positive instances among all predicted positives for class

c

:

{Precision}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c}}

(11)

Recall measures the proportion of actual positive instances correctly identified for class

c

:

{Recall}_{c} = \frac{{TP}_{c}}{{TP}_{c} + {FN}_{c}}

(12)

where

{TP}_{c}

(true positives) is the number of samples correctly classified into class

c

,

{FP}_{c}

(false positives) is the number of samples from other classes misclassified as class

c

, and

{FN}_{c}

(false negatives) is the number of samples from class

c

misclassified into other classes. These metrics enable detailed analysis of model performance across different carbon content intervals.

The F1 score for class

c

harmonizes precision and recall into a single metric:

F 1_{c} = 2 \times \frac{{Precision}_{c} \times {Recall}_{c}}{{Precision}_{c} + {Recall}_{c}}

(13)

The weighted F1 score aggregates class-specific F1 scores while accounting for class imbalance by weighting each class’s contribution by its sample proportion:

Weighted F 1 = \frac{1}{N} \sum_{c = 1}^{c} n_{c} \times {F 1}_{c}

(14)

where

C

is the total number of classes,

n_{c}

is the number of samples in class

c

, and

N = \sum_{c = 1}^{C} n_{c}

is the total sample count.

Cross-entropy loss quantifies the dissimilarity between predicted probabilities and true labels. For a classification model with

C

classes:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} \log ({\hat{p}}_{i, c})

(15)

where

y_{i, c}

is 1 if sample

i

belongs to class

c

and 0 otherwise,

{\hat{p}}_{i, c}

is the predicted probability that sample

i

belongs to class

c

,

N

is the batch size or total samples, and

\log (\cdot)

denotes the natural logarithm. Lower values indicate better alignment between predictions and ground truth.

Mean error (ME) measures the average prediction bias for continuous carbon content values:

ME = \frac{1}{N} \sum_{i = 1}^{N} ({\hat{v}}_{i} - v_{i})

(16)

where

v_{i}

and

{\hat{v}}_{i}

are defined as in Equation (10). Positive values indicate systematic overestimation of carbon content, while negative values indicate underestimation.

Standard deviation of error (

σ_{e}

) quantifies the dispersion of prediction errors around the mean error:

σ_{e} = \sqrt{\frac{1}{N - 1} \sum_{i = 1}^{N} {(e_{i} - \bar{e})}^{2}}

(17)

where

e_{i} = {\hat{v}}_{i} - v_{i}

is the prediction error for sample

i

,

\bar{e} = \frac{1}{N} \sum_{i = 1}^{N} e_{i}

is the mean error (Equation (16)), and

N

is the total number of samples. Lower values indicate more consistent predictions.

4. Experiment

This section conducts systematic experimental validation of the proposed Transformer-based method for predicting carbon content in the later stage of BOF steelmaking. Dataset construction and preprocessing processes are described in detail to ensure experimental conclusion reliability, followed by an explanation of the experimental configuration. Finally, the model’s performance is comprehensively evaluated through comparative experiments, ablation studies, and overall performance analysis.

4.1. Dataset Construction and Preprocessing

A video dataset was collected from actual production data, comprising 32 complete BOF blowing processes. A video dataset was collected from actual production data, comprising 32 complete BOF blowing processes. Video data were acquired using a high-performance industrial-grade camera from Hikvision, specifically designed for harsh industrial environments. The industrial camera was permanently installed at a fixed position, with its distance, framing, focus, and exposure parameters locked after initial calibration to ensure uniform imaging conditions across all processes. The camera’s frame rate was set to 25 fps to ensure high image quality and reliable long-term operation while fully capturing all critical transient flame activities at the converter mouth. Each blowing process lasted approximately 10 to 20 min yielding between 15,000 and 30,000 raw image frames with a resolution of 2560 × 1440 pixels, ensuring clear details of the flame. Carbon content labels were precisely annotated by experienced metallurgical experts at key time points during the later stage of BOF steelmaking, integrating video information, sub-lance measurements, and additional process parameters. Experts manually annotated 2–6 key time points per blowing process, corresponding to 2–6 video frames. Additionally, each process had one definitive sub-lance measurement at the endpoint. Continuous labels for all other frames were programmatically generated via cubic spline interpolation between these sparse, high-confidence reference points. The multi-source annotation approach integrating expert knowledge guarantees high-quality and reliable labels for supervised model training.

The raw images and labels were preprocessed to meet deep learning requirements. Discrete carbon content labels provided by experts were processed using cubic spline interpolation to generate a continuous and smooth label curve across the entire later stage of steelmaking, ensuring physically plausible variations. All flame images were resized to 128 × 528 pixels to preserve essential morphological and color features while reducing computational load. Pixel values were normalized to the [0, 1] range to facilitate model convergence.

A strict dataset partitioning strategy enables scientific evaluation of model performance. Data from 12 out of the 13 batches were allocated for model development and were randomly split into training and validation sets at an 8:2 ratio. Such random partitioning ensured consistent data distribution between the training and validation sets. Specifically, the training set contained approximately 372,000 image frames with corresponding carbon content labels while the validation set contained approximately 93,000 frames. One complete batch excluded from all training activities served as an independent test set for evaluating model generalization capability. The test set contains approximately 18,000 frames of continuous video data fully reflecting carbon content variations during the late stage of a single batch, providing reliable evidence for assessing model performance in practical applications.

All experiments were conducted on a high-performance computing platform. The hardware configuration employs NVIDIA H800 PCIe accelerators (NVIDIA Corporation, Santa Clara, CA, USA) equipped with 80 GB HBM3 high-bandwidth memory. The software environment was based on PyTorch 2.0 [47], Python 3.10.9 [48], and CUDA 12.6 [49] integrated with the cuDNN acceleration library achieving efficient execution of deep learning operations. Additional tools included NumPy 2.2.6 and Pandas 2.2.3 for data processing, and Matplotlib 3.10.3 and Seaborn 0.13.2 for visualization.

4.2. Model Configuration and Training Strategy

The proposed model uses a 6-layer Transformer encoder–decoder architecture, with each layer equipped with 8 parallel attention heads to enable multi-scale feature parallel processing while maintaining a feature space dimension of 512. The FFN adopts an expansion–compression structure, where the hidden layer dimension expands to 2048 and introduces nonlinearity through ReLU activation functions. Image feature extraction utilizes a patch embedding strategy, dividing standardized 128 × 528 pixel flame images into 16 × 16 pixel patches as fundamental feature units. The convolutional embedding module implements a two-stage feature transformation, with the first convolutional layer expanding the original four-channels to 100 feature channels for comprehensive low-level visual feature extraction. The second convolutional layer compresses to 30 feature channels, before final projection into a 512-dimensional feature space via a fully connected layer.

Table 1 summarizes key model parameter configurations. The classifier outputs 36 categories corresponding to carbon content intervals from 0.01 to 0.36 in 0.01 steps, balancing precision and training stability. The architectural hyperparameters were determined through an iterative process of preliminary ablation studies and validation on our dataset, aiming to balance model capacity, computational efficiency, and generalization performance. The selection of a 512-dimensional model embedding (d_model) and 8 attention heads follows common practice in medium-scale vision Transformers, providing a sufficient representational space while managing computational cost. The 6-layer encoder–decoder depth was chosen as a compromise; shallower networks (e.g., 4 layers) showed underfitting tendencies with higher training loss, while deeper networks (e.g., 8 layers) offered diminishing returns in accuracy at the cost of significantly increased training time and risk of overfitting on our dataset size. Similarly, the patch size of 16 × 16 pixels was selected to preserve fine-grained flame texture details crucial for discrimination, as larger patches led to a noticeable drop in feature resolution and prediction accuracy. These choices collectively represent a configuration optimized for our specific task and data characteristics.

The AdamW [50] optimizer was used with weight decay to prevent overfitting. Learning rate was set to 2 × 10⁻⁵, and the batch size was 16. Training lasted 100 epochs with cross-entropy loss. A dropout rate of 0.1 was applied across all Transformer and fully connected layers. Positional encoding provided spatial information to the model. Checkpoints were saved every 20 epochs for model selection. The validation set monitored performance, and the independent test set evaluated generalization.

Model performance is comprehensively evaluated through multiple metrics to ensure reliability and thoroughness. Top-1 accuracy serves as the primary evaluation indicator, measuring the proportion of instances where the predicted class with the highest probability matches the true label. The ±0.02 tolerance accuracy provides a more practical evaluation criterion, representing the proportion of samples where the predicted carbon content deviates from the true value within ±0.02. The weighted F1 score offers a balanced assessment of overall model performance by integrating both precision and recall. Cross-entropy loss quantifies the discrepancy between predicted values and ground truth.

4.3. Experimental Result

This section conducts systematic experimental validation of the proposed Transformer-based method for predicting carbon content in the later stage of BOF steelmaking. The model’s performance is comprehensively evaluated through comparative experiments ablation studies and overall performance analysis.

4.3.1. Comparative Experiments

The proposed model was compared with various representative deep learning methods for spatio-temporal sequence modeling tasks, including ConvGRU [51], CNN-LSTM [52], and 3D-CNN [53]. ConvGRU combines convolutional layers for spatial feature extraction with Gated Recurrent Units for temporal modeling. The architecture processes four-channel input through 3 × 3 convolution kernels for input-to-state and state-to-state transitions. Extracted features feed into a three-layer ConvGRU network with 64-dimensional hidden layers, with a linear classifier outputting carbon content categories. The CNN-LSTM uses CNN for spatial feature extraction and LSTM for temporal modeling. The CNN component contains four convolutional blocks equipped with batch normalization and ReLU activation functions, followed by adaptive average pooling. Extracted spatial features are subsequently processed by a two-layer LSTM with 256 hidden units, employing dropout regularization for temporal modeling. The last time step’s output is used for classification. 3D-CNN directly learns spatio-temporal features from stacked frames through 3D convolution kernels. The architecture employs five 3D convolutional layers with 3 × 3 × 3 kernel sizes and channel progressions of 64, 128, 256, 512, and 1024. Each layer includes batch normalization, ReLU activation, and 3D max pooling.

All benchmark models used identical experimental configurations. Input preprocessing, data augmentation, training procedures, and evaluation metrics remained consistent across models. Benchmark models were fully optimized on the dataset to ensure fair comparison.

Table 2 shows the comprehensive performance comparison of different models on the validation set. The model based on Transformer shows the best performance in all evaluation indexes. Specifically, the Transformer model achieves 90.31% of the top-1 classification accuracy, significantly outperforming ConvGRU (6.36%), CNN-LSTM (66.79%) and 3D-CNN (66.35%). The Transformer’s ±0.02 tolerance accuracy reached 96.82%, while other models ranged from 17.23% to 93.64%, demonstrating highly reliable predictions within acceptable deviation ranges. The weighted F1 score of 0.9035 exceeded ConvGRU, CNN-LSTM, and 3D-CNN by 0.8785, 0.2631, and 0.2514 respectively, indicating robust classification capability across different carbon content ranges. The Transformer achieved the lowest cross-entropy loss of 0.4597, demonstrating minimal deviation between predictions and true carbon content values.

The performance gap stems from the Transformer architecture’s unique advantages in capturing complex spatio-temporal dependencies across frames. While ConvGRU and CNN-LSTM model temporal dynamics through recurrent networks, they suffer from sequential dependencies and gradient issues when processing multi-frame information, struggling to establish global cross-frame correlations. Though 3D-CNN jointly models spatial and temporal features through 3D convolution, its receptive field remains limited by kernel size, restricting long-range inter-frame dependency capture. Conversely, the Transformer architecture directly establishes connections between any image patches across frames through self-attention mechanisms, enabling effective focus on discriminative spatial regions and critical inter-frame evolution patterns.

Figure 3 illustrates training and validation loss curves over 100 epochs, revealing distinct convergence characteristics across models. The Transformer model demonstrates optimal convergence, with training loss rapidly decreasing from 3.233 to 0.676 within 30 epochs, then continuing stable optimization to converge at 0.034 without significant fluctuation. Validation loss shows stable decline from 2.982 to 0.543. Despite slight mid-training validation loss fluctuations, overall trends remain similar between training and validation losses, indicating good generalization without overfitting. CNN-LSTM exhibited relatively stable convergence. Training loss decreased from 3.403, achieving primary convergence within 50 epochs and stabilizing around 0.732. Validation loss decreased from 3.185 to 1.049. Though convergence was slower than the Transformer with some training-validation loss separation in later stages, the overall stable declining trend indicates effectiveness in processing spatio-temporal data. 3D-CNN’s convergence showed considerable volatility. Training loss decreased from 3.280 to 0.282, but with notable oscillations, particularly during epochs 20–40. This high variance indicates an inability to find stable, generalizable solutions. While 3D convolution effectively extracts local spatio-temporal features, performance remains limited by finite receptive fields, causing optimization instability. ConvGRU demonstrated the poorest convergence characteristics. Training loss minimally decreased from 3.584 to 3.284, with validation loss changing slightly from 3.579 to 3.300. The flat loss curves reflect insufficient learning capability.

Comprehensive analysis confirms the Transformer model’s superior performance in convergence speed, training stability, and generalization ability. The model rapidly converges to low loss values while maintaining stability throughout training, effectively avoiding overfitting. These excellent convergence characteristics ensure reliability and stability for practical industrial applications. Traditional ConvGRU, CNN-LSTM, and 3D-CNN models exhibit various limitations processing complex spatio-temporal features in converter flame images, further validating Transformer architecture superiority. All subsequent experiments are conducted to rigorously validate the proposed Transformer architecture.

4.3.2. Classification Versus Regression Paradigm Comparison

To verify the effectiveness of defining carbon content prediction as a classification task rather than a regression task, a controlled comparative experiment was designed. The same Transformer architecture, feature extraction backbone network and training settings were maintained, and a regression model variant was built by modifying only the output layer and loss function for comparison.

The classification model is described in Section 4. The model uses a classification header that outputs 36 units and is optimized using cross-entropy loss. The regression model replaces the classification header with a regression header. The regression head is composed of a linear layer, ReLU activation function and the final linear output layer to predict the carbon content value. The model is optimized using MSE losses. See Table 3 for the performance comparison of the two paradigms on the test set.

Table 3 shows that the performance of the two paradigms is significantly different. The accuracy of the classification paradigm within ±0.005 tolerance was 90.31%, much higher than 32.11% from the regression task. Under the tolerance level of ±0.02, the accuracy of the classification paradigm was 96.82%, which exceeded the regression paradigm by 11.27%. The accuracy of the two paradigms is similar under the loose tolerance of ±0.05, indicating that the regression model can match the performance of the classification model when the accuracy requirement is low.

The error distribution shown in Figure 4a,b shows that the classification model shows a highly concentrated error distribution, showing a sharp peak at the zero error position, with a mean error of 0.0007 and a standard deviation of 0.0117. In contrast, the error distribution of the regression model presents a contour similar to the Gaussian distribution, the mean error is −0.0020, and the standard deviation is 0.0139, indicating that the prediction results of the model fluctuate greatly and that there are higher prediction errors. This phenomenon can be attributed to the differences in modeling methods. The classification model discretizes the continuous carbon content values into limited categories, and explicitly optimizes the category boundaries through cross-entropy loss, so that the model can more accurately capture the changes near the key threshold. However, the regression model uses the mean square error loss to directly fit the continuous value, which is easily affected by noise and outliers in the face of complex nonlinear relationship, resulting in large prediction deviation.

To sum up, the classification paradigm shows higher accuracy and stability in the prediction task of converter endpoint carbon content, especially suitable for industrial scenarios with strict accuracy requirements. Although the regression paradigm can achieve similar performance under loose tolerance, its error distribution is wide and the prediction consistency is poor. Therefore, the classification paradigm has more advantages in practical applications requiring high-precision control.

4.3.3. Evaluation of Four-Channel Input Effectiveness

To evaluate the contribution of spatial–temporal feature enhancement. controlled experiments were conducted comparing the proposed four-channel (RGB + optical flow) input against the traditional three-channel RGB input. Both models used identical Transformer architectures and training protocols, with input channel count as the only variable, ensuring that any performance differences could be directly attributed to the inclusion of optical flow information.

Table 4 presents a detailed performance comparison of different input modalities on the test set. The four-channel model achieved 90.31% top-1 classification accuracy, representing a 4.89 percentage point improvement over the three-channel baseline’s 85.42%. The practically significant ±0.02 tolerance accuracy reached 96.82% with the four-channel model, outperforming the three-channel model by 3.26%. The weighted F1 score improved from 0.8512 to 0.9035 while cross-entropy loss decreased from 0.6124 to 0.4597. These consistent improvements demonstrate that explicit motion information provided by optical flow features significantly enhances the model representation of flame dynamic characteristics, enabling more accurate carbon content judgments.

4.3.4. Ablation Studies

Ablation experiments were designed to analyze the contribution of key components in the proposed model. Specifically, three different variants of the model are constructed. The first variant removes positional encoding to evaluate sequence order information impact. The second variant replaces multi-head attention with average pooling to assess attention mechanism necessity. The third variant eliminates the decoder and introduces a learnable CLS token for direct classification through the encoder output, in order to evaluate the role of the decoder structure in the whole architecture. The performance comparison results of each model on the test set are shown in Table 5.

The result showed that the performance of the model deteriorated after removing the positional encoding, the accuracy of top-1 classification decreased by 2.14%, the accuracy of ± 0.02 tolerance decreased by 0.89%, and the cross-entropy loss increased to 0.5447. The result indicates reduced spatial order perception capability, with models lacking explicit positional guidance struggling to capture flame morphology evolution patterns effectively.

Replacing multi-head self-attention with average pooling caused severe performance degradation. The accuracy of top-1 decreased by 72.04%, the accuracy of ±0.02 tolerance decreased by 63.46%, and the cross-entropy loss increased to 6.0076. The significant loss confirms self-attention’s effectiveness in capturing inter-frame dependencies and regional relationships, while average pooling cannot achieve critical visual feature focus and integration.

Removing the decoder and introducing a learnable CLS token for classification resulted in comprehensive performance decline. Experimental results show that top-1 accuracy decreased by 3.18%, weighted F1 score decreased to 0.8717, and cross-entropy loss increased to 0.5503. This indicates that although the CLS token can achieve global information aggregation, the decoder enables more careful integration of multi-level features and optimization of classification decision boundaries, making it more suitable for serialized flame image classification tasks.

The ablation experiments verify the rationality of the proposed architecture design. Positional encoding enables spatial sequence perception for flame morphology evolution. Multi-head self-attention captures inter-frame dependencies and regional correlations as the core performance guarantee. The decoder enhances classification accuracy through multi-level feature fusion. The synergy of different components constitutes an efficient flame analysis framework, which effectively integrates the temporal and spatial characteristics for high-precision prediction of endpoint carbon content.

4.4. Overall Performance Analysis and Visualization on Test Set

This section provides a comprehensive evaluation of the model’s performance on a previously unseen and independent test set.

Table 6 shows the classification performance indicators of the model in different carbon content categories. The model demonstrates varying performance across different carbon content ranges. In the low carbon content range (0.0440–0.0829), the model achieves a good precision value of 0.84 but relatively lower recall of 0.57, resulting in an F1 score of 0.68. In the medium carbon content range (0.0829–0.1510), the performance decreases with a precision value of 0.43, recall of 0.36, and F1 score of 0.40. However, in the high-carbon content range (0.1510–0.2482), the model shows strong performance with a high recall of 0.85 and balanced precision of 0.59, achieving the best F1 score of 0.69 among all categories. The overall performance demonstrates the model’s capability to handle different carbon content levels, with particular strength in identifying high-carbon-content samples.

Figure 5 presents the distribution of prediction errors from the model on the test set. The histogram reveals a left-skewed distribution with the peak slightly shifted to around −0.01 error, where approximately 190 samples are concentrated, indicating a mild tendency toward underestimation. The majority of predictions, representing approximately 75 percent of the total, fall within the error range of plus or minus 0.05. The distribution exhibits an asymmetric pattern, characterized by a longer tail extending toward negative errors compared to positive ones. Despite the presence of some outliers around +0.20 to +0.25, the sharp peak near zero and the overall standard deviation of approximately 0.04 to 0.05 demonstrate the model’s generally reliable performance in predicting carbon content during the later stage of BOF steelmaking.

Figure 6 shows a comparison between the predicted and true carbon content values during the later stage of BOF steelmaking, with the red dashed line indicating the true values and the blue scatter points representing the model predictions. The plot reveals a high level of agreement between the predicted and true values, indicating that the model captures the characteristic decreasing trend of carbon content during the blowing process. Predictions exhibit minimal deviation from the true values in the medium-to-low carbon content region (0.05–0.15). Some discrepancies are observed in certain cases, including several outlier predictions approaching approximately 0.30 toward the end of the sequence. Overall, the model maintains reliable tracking performance throughout the later steelmaking stage, validating the capability of the Transformer-based approach for accurate carbon content estimation via flame video analysis at the converter mouth.

Furthermore, the model demonstrates strong potential for real-time deployment. The decrease in carbon content during the critical endpoint phase, such as from approximately 0.0829 to 0.0440, is a rapid process. Based on analysis of qualified heats, this transition occurs over an average duration of only about 17 to 18 s. To support real-time monitoring, the model operates on a very short temporal window, requiring only three consecutive image frames sampled at five-frame intervals, which corresponds to 0.4 s of video at 25 fps. The lightweight Transformer-based architecture enables a complete forward pass to be executed in far less than 1 s on standard GPU hardware. During offline validation, processing tens of thousands of frames for an entire heat to generate a continuous carbon content curve took only a few seconds. This confirms that the computational overhead per prediction is negligible. In summary, the model not only achieves accurate tracking of carbon content in the later stage of steelmaking but also fulfills the low-latency requirements necessary for real-time inference, providing a reliable foundation for dynamic carbon content monitoring based on converter mouth flame video.

5. Conclusions

Carbon content prediction is critical for precise endpoint control and quality improvement in BOF steelmaking. While industrial flame recognition has been studied to some extent, real-time carbon content prediction based on Transformer architecture presents a promising approach for intelligent visual analysis in metallurgical applications. This work proposes a Transformer-based model to predict carbon content during the blowing process in the later stage of BOF steelmaking, incorporating several key innovations. The traditional regression task was reformulated as a classification problem to enhance robustness and interpretability under challenging industrial conditions. A data augmentation strategy is introduced through four-channel input tensors that combine RGB information with optical flow features, thereby enhancing the model’s capacity to capture dynamic flame motion patterns during steelmaking.

Results from an independent test set confirm the effectiveness of the proposed method, particularly within the critical decarburization endpoint phase. Such performance validates the model’s potential to support intelligent BOF steelmaking. By encapsulating expert knowledge, the generated real-time carbon content curve can provide crucial reference for field operators in oxygen lance control, thereby reducing reliance on expensive sub-lance measurements and demonstrating considerable practical value.

Currently, the model relies on human-annotated data for supervision, which may constrain scalability. Future research could explore self-supervised or weakly supervised learning frameworks to reduce annotation dependency. As Transformer architectures advance, broader applications and enhanced automation capabilities in industrial process monitoring are anticipated, effectively bridging computer vision with metallurgical process control.

Author Contributions

Conceptualization, M.F., L.S. and Y.L.; Methodology, H.Y.; Software, H.Y.; Validation, H.Y. and Z.W.; Formal analysis, H.Y.; Investigation, H.Y., M.F., W.L., L.S., Q.W., N.C., R.Z., Z.W., Y.L., Z.M. and J.W.; Resources, M.F., W.L. and L.S.; Data curation, H.Y., Z.W. and Y.L.; Writing—original draft, H.Y.; Writing—review & editing, M.F., W.L., L.S., Q.W., N.C., R.Z., Z.W., Z.M. and J.W.; Visualization, H.Y.; Supervision, M.F., W.L., L.S., Q.W., N.C., R.Z., Z.M. and J.W.; Project administration, M.F. and W.L.; Funding acquisition, M.F. and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by National Science and Technology Major Project (2025ZD1602303), National Natural Science Foundation of China (U25A20433, 42401521), Joint Research Fund for Beijing Natural Science Foundation and Haidian Original Innovation (L232001), Henan Key Research and Development Program (241111320700), Guangdong Basic and Applied Basic Research Foundation (2024A1515011866, 2024A1515011480, 2025A1515011300), Central Guidance on Local Science and Technology Development Fund of Shanxi Province (YDZJSX20231D005, YDZJSX20231B017), Science and Technology Innovation Program of Xiongan New Area under Grant 2025XAGG0028, National Key Research and Development Program of China 2023YFF0905903.

Data Availability Statement

The datasets presented in this article are not readily available because they contain confidential industrial flame video data from the steelmaking process, which is subject to proprietary restrictions under the collaboration agreement with the steel plant. Requests to access specific, non-confidential portions of the data for verification purposes should be directed to the corresponding authors, Meixia Fu (mxfu1205@ustb.edu.cn), and will be considered on a case-by-case basis.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, H.; Wang, B.; Xiong, X. Basic oxygen furnace steelmaking end-point prediction based on computer vision and general regression neural network. Optik 2014, 125, 5241–5248. [Google Scholar] [CrossRef]
Wang, Z.; Liu, Q.; Liu, H.; Wei, S. A review of end-point carbon prediction for BOF steelmaking process. High Temp. Mater. Process. 2020, 39, 653–662. [Google Scholar] [CrossRef]
Li, G.H.; Liu, Q. Present status and prospect of BOF steelmaking process control. J. Iron Steel Res. Int. 2013, 25, 1–4. [Google Scholar]
Feng, S.C.; Wang, Y.H.; Ding, R.F. Application status of end-point control technologies in converter steelmaking. Metall. Ind. Autom. 2016, 40, 1–6. [Google Scholar]
Huang, J.X.; Jin, N.D. Static control predictive model for converter refining end-point. Steelmaking 2006, 22, 45–48. [Google Scholar]
Tang, B.; Wang, X. Static model for converter steelmaking with limestone. J. Northeast. Univ. Nat. Sci. 2014, 35, 534–538. [Google Scholar]
Xu, G.; Li, M.; Xu, J.; Jia, C.; Chen, Z. Control technology of end-point carbon in converter steelmaking based on functional digital twin model. Chin. J. Eng. 2019, 41, 521–527. [Google Scholar]
Rout, B.K.; Brooks, G.; Rhamdhani, M.A.; Li, Z.; Schrama, F.N.H.; Overbosch, A. Dynamic model of basic oxygen steelmaking process based on multi-zone reaction kinetics: Model derivation and validation. Metall. Mater. Trans. B 2018, 49B, 537–557. [Google Scholar] [CrossRef]
Dering, D.; Swartz, C.; Dogan, N. Dynamic modeling and simulation of basic oxygen furnace (BOF) operation. Processes 2020, 8, 483. [Google Scholar] [CrossRef]
Kang, Y.; Zhao, J.X.; Li, B.; Ren, M.M.; Cao, G.; Yue, S.; An, B.Q. End-Point Prediction of Converter Steelmaking Based on Main Process Data. Steel Res. Int. 2024, 95, 2400151. [Google Scholar]
Vortrefflich, W.; Vries, J. Maximizing BOF production capacity and producing cost efficient by using sublance based process control. Iron Steel Rev. 2010, 10, 94–100. [Google Scholar]
Zhao, B.; Zhao, J.; Wu, W.; Zhang, F.; Yao, T. Research on prediction model of converter temperature and carbon content based on spectral feature extraction. Sci. Rep. 2023, 13, 14409. [Google Scholar] [CrossRef]
Zhou, M.; Zhao, Q.; Chen, Y. Endpoint prediction of BOF by flame spectrum and furnace mouth image based on fuzzy support vector machine. Optik 2019, 178, 575–581. [Google Scholar] [CrossRef]
Liu, K.; Liu, L.; He, P.; Liu, W. A new algorithm of endpoint carbon content of BOF based on off-gas analysis. Steelmaking 2009, 25, 33. [Google Scholar]
Schlueter, J.; Odenthal, H.J.; Uebber, N.; Blom, H.; Morik, K. A novel data-driven prediction model for BOF endpoint. In Proceedings of the Iron & Steel Technology Conference, Pittsburgh, PA, USA, 6–9 May 2013; pp. 923–928. [Google Scholar]
Liu, L.M.; Li, P.; Chu, M.; Gao, C. End-point prediction of 260 tons basic oxygen furnace (BOF) steelmaking based on WNPSVR and WOA. J. Intell. Fuzzy Syst. 2021, 41, 2923–2937. [Google Scholar] [CrossRef]
Gao, C.; Shen, M.G. End-point prediction of basic oxygen furnace (BOF) steelmaking based on improved twin support vector regression. Metalurgija 2019, 58, 29–32. [Google Scholar]
Bae, J.; Li, Y.; Ståhl, N.; Mathiason, G.; Kojola, N. Using Machine Learning for Robust Target Prediction in a Basic Oxygen Furnace System. Metall. Mater. Trans. B 2020, 51B, 1632–1645. [Google Scholar] [CrossRef]
Laha, D.; Ren, Y.; Suganthan, P.N. Modeling of steelmaking process with effective machine learning techniques. Expert Syst. Appl. 2015, 42, 4687–4696. [Google Scholar] [CrossRef]
Zhang, R.; Yang, J.; Wu, S.; Sun, H.; Yang, W. Comparison of the Prediction of BOF End-Point Phosphorus Content Among Machine Learning Models and Metallurgical Mechanism Model. Steel Res. Int. 2023, 94, 2200682. [Google Scholar] [CrossRef]
Zhu, M.; Li, C.; Zhang, X.; Yang, Z. A New Method to Predict Endpoint Phosphorus Content During Converter Steelmaking Process via Industrial Data and Mechanism Analysis. Metall. Mater. Trans. B 2024, 55B, 4660–4675. [Google Scholar] [CrossRef]
Ghalati, M.K.; Zhang, J.; El-Fallah, G.M.A.M.; Nenchev, B.; Dong, H. Toward learning steelmaking—A review on machine learning for basic oxygen furnace process. Mater. Genome Eng. Adv. 2023, 1, e6. [Google Scholar] [CrossRef]
Wang, M.; Li, S.; Gao, C.; Yang, Y.; Ai, X. A static prediction model of end-point carbon content and temperature of BOF steelmaking based on automatic optimization. Ironmak. Steelmak. 2024, in press. [Google Scholar]
Liu, Z.; Cheng, S.; Liu, P. Prediction model of BOF end-point P and O contents based on PCA–GA–BP neural network. High Temp. Mater. Process. 2022, 41, 505–513. [Google Scholar]
Hubbeling, P.D.; Oostermeijer, G.A. Sublance and dynamic control in converter steelmaking. Iron Steel 2007, 42, 83–86. [Google Scholar]
Yue, F.; Bao, Y.P.; Cui, H.; Gao, S.Y.; Li, B.H.; Zhang, J. Sub-lance control-based predication model for BOF end-point. Steelmaking 2009, 25, 38–40. [Google Scholar]
Sun, S.; Dongsheng, L.; Pyke, N.; Boylan, K.; Wallace, G. Development of an offgas/model technology to replace sublance operation for KOBM endpoint carbon control at ArcelorMittal Dofasco. Iron Steel Technol. 2008, 5, 36–42. [Google Scholar]
Chen, Z.X.; Liu, H.; Qi, L. Feature selection of BOF steelmaking process data by using an improved grey wolf optimizer. J. Iron Steel Res. Int. 2022, 29, 1205–1223. [Google Scholar] [CrossRef]
Wang, H.B.; Xu, A.J.; Ai, L.X.; Tian, N.Y. Prediction of endpoint phosphorus content of molten steel in BOF using weighted K-means and GMDH neural network. J. Iron Steel Res. Int. 2012, 19, 11–16. [Google Scholar] [CrossRef]
Feng, K.; Yang, L.; Su, B.; Feng, W.; Wang, L. An integration model for converter molten steel end temperature prediction based on Bayesian formula. Steel Res. Int. 2022, 93, 2100433. [Google Scholar]
Jiang, S.L.; Shen, X.; Zheng, Z. Gaussian process-based hybrid model for predicting oxygen consumption in the converter steelmaking process. Processes 2019, 7, 352. [Google Scholar] [CrossRef]
Wang, Z.L.; Bao, Y.P.; Gu, C. Convolutional Neural Network-Based Method for Predicting Oxygen Content at the End Point of Converter. Steel Res. Int. 2023, 94, 2200342. [Google Scholar]
Gu, M.Q.; Xu, A.; Wang, H.; Wang, Z. Real-time dynamic carbon content prediction model for second blowing stage in BOF based on CBR and LSTM. Processes 2021, 9, 1987. [Google Scholar] [CrossRef]
Liang, B.; Wang, K.; Li, X. A Deep Learning Method for the Endpoint Carbon Prediction in BOF Steelmaking Process. In Proceedings of the 2024 IEEE 13th Data Driven Control and Learning Systems Conference (DDCLS), Kaifeng, China, 17–19 May 2024; pp. 666–671. [Google Scholar]
Khaksar Ghalati, M.; Hao, Z.D.; Zhang, J.; Dong, H. Deep Transformers for Analyzing BOF Steelmaking Data. Metall. Mater. Trans. B 2025, 56, 4201–4217. [Google Scholar]
Lu, H.; Zhu, H.; Jiang, Z.; Li, H.; Yang, C. Assessment of Multiple Hybrid Modeling Approaches Combining Mechanistic and Machine Learning Methods for Endpoint Temperature Prediction in Electric Arc Furnace. Metall. Mater. Trans. B 2026, 1–16. [Google Scholar] [CrossRef]
Yang, L.; Liu, H.; Chen, F. Just-in-time updating soft sensor model of endpoint carbon content and temperature in BOF steelmaking based on deep residual supervised autoencoder. Chemom. Intell. Lab. Syst. 2022, 231, 104679. [Google Scholar] [CrossRef]
Chang, S.C.; Zhao, C.; Li, Y.; Zhou, M.; Fu, C.; Qiao, H. Multi-channel graph convolutional network based end-point element composition prediction of converter steelmaking. IFAC-PapersOnLine 2021, 54, 152–157. [Google Scholar] [CrossRef]
Liu, C.; Tang, L.X.; Liu, J.Y. A stacked autoencoder with sparse Bayesian regression for end-point prediction problems in steelmaking process. IEEE Trans. Autom. Sci. Eng. 2019, 17, 550–561. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Sun, F.; Su, C.; Xiong, B.; Wang, Y. DCR-EFlow: Dynamic Correlation Recurrent Architecture for Optical Flow Estimation Based on Event Cameras. Intell. Comput. 2025, 4, 0243. [Google Scholar] [CrossRef]
Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian Conference on Image Analysis; Halmstad, Sweden, 29 June–2 July 2003, Springer: Berlin/Heidelberg, Germany, 2003; pp. 363–370. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar]
Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
PyTorch. PyTorch 2.0 (Version 2.0), [Computer Software]; Meta AI, 2023. Available online: https://pytorch.org/ (accessed on 2 January 2026).
Python Software Foundation. Python 3.10.9 (Version 3.10.9), [Computer Software]; Python Software Foundation, 2023. Available online: https://www.python.org/ (accessed on 2 January 2026).
NVIDIA; Vingelmann, P.; Fitzek, F.H.P. CUDA, Release: 12.6 [Computer Software]. NVIDIA Corporation. 2023. Available online: https://developer.nvidia.com/cuda-toolkit (accessed on 2 January 2026).
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Ballas, N.; Yao, L.; Pal, C.; Courville, A. Delving deeper into convolutional networks for learning video representations. arXiv 2015, arXiv:1511.06432. [Google Scholar]
Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]

Figure 1. System architecture of the improved Transformer model. The asterisk (*) denotes the Learnable Query Token.

Figure 2. Input four-channel tensor.

Figure 3. Training and validation loss curves of four compared models over 100 epochs. The top panel shows training loss and the bottom panel shows validation loss for the Transformer (blue), ConvGRU (orange), CNN-LSTM (green), and 3D-CNN (red) models across 100 epochs. The vertical axis represents loss value, and the horizontal axis denotes epoch number.

Figure 4. (a) Error distribution of classification paradigm; (b) error distribution of regression paradigm. The vertical axis indicates frequency, and the horizontal axis denotes prediction error (Predicted-True). In both panels, the red dashed line marks zero error, while green and blue dashed lines indicate ±0.02 and ±0.05 error tolerance boundaries, respectively. Panel (b) also includes an orange dashed line for ±0.005 tolerance.

Figure 5. Prediction Error Distribution. The vertical axis represents frequency, and the horizontal axis denotes prediction error (Predicted-True). The red dashed line indicates zero error. The histogram shows the number of predictions falling within each error bin.

Figure 6. Predicted vs. true values scatter plot over time. The horizontal axis shows the true carbon content, and the vertical axis shows the predicted carbon content. Each blue dot represents a single prediction. The red dashed line indicates perfect prediction (y = x). The green dashed lines mark the ±0.02 tolerance boundaries, and the purple dashed lines mark the ±0.05 tolerance boundaries. The text in the top left reports performance metrics: Top-1 accuracy (0.1232), ±0.02 tolerance accuracy (0.4839), and ±0.05 tolerance accuracy (0.7581).

Table 1. Parameter configurations.

Component	Parameter/Configuration	Value
Image Embedding	Patch Size	16 × 16
Image Embedding	Conv Layers	[4 → 100 → 30]
Transformer	d_model	512
	Layers (N)	6
	Attention Heads (h)	8
	FFN Dimension	2048
Classifier	Hidden Units	256
Classifier	Output Classes	36
Training	Dropout	0.1
	Loss Function	CrossEntropy
	Optimizer	AdamW
	Learning Rate	2 × 10⁻⁵
	Batch Size	16
	Epochs	100

Table 2. Performance comparison on validation set.

Model	Top-1 Classification Accuracy	±0.02 Tolerance Accuracy	Weighted F1 Score	Cross-Entropy Loss
ConvGRU	6.360%	17.23%	0.0250	3.2999
CNN-LSTM	66.79%	93.64%	0.6404	1.0486
3D-CNN	66.35%	87.28%	0.6521	1.4959
Transformer	90.31%	96.82%	0.9035	0.4597

Table 3. Classification vs. regression performance on validation set.

Paradigm	±0.005 Tolerance Accuracy	±0.02 Tolerance Accuracy	±0.05 Tolerance Accuracy	Mean Error	Std. Error
Regression	32.11%	85.55%	99.54%	−0.0020	0.0139
Classification	90.31%	96.82%	99.26%	0.0007	0.0117

Table 4. Performance comparison of different input modalities on validation set.

Input Modality	Top-1 Accuracy	±0.02 Tolerance Accuracy	Weighted F1 Score	Cross-Entropy Loss
RGB (3 Channels)	85.42%	93.56%	0.8512	0.6124
RGB + Optical Flow (4 Channels)	90.31%	96.82%	0.9035	0.4597

Table 5. Ablation study performance on validation set.

Model	Top-1 Classification Accuracy	±0.02 Tolerance Accuracy	Weighted F1 Score	Cross-Entropy Loss
Proposed model	90.31%	96.82%	0.9035	0.4597
w/o position code	88.17%	95.93%	0.8821	0.5447
w/o attention	18.27%	33.36%	0.2028	6.0076
w/o decoder	87.13%	96.15%	0.8717	0.5503

Table 6. Model classification performance metrics.

Carbon Content Range	Precision	Recall	F1 Score	Samples
0.0440–0.0829	0.84	0.57	0.68	477
0.0829–0.1510	0.43	0.36	0.40	532
0.1510–0.2482	0.59	0.85	0.69	541

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, H.; Fu, M.; Li, W.; Sun, L.; Wang, Q.; Chen, N.; Zhang, R.; Wang, Z.; Lu, Y.; Ma, Z.; et al. Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking. Metals 2026, 16, 185. https://doi.org/10.3390/met16020185

AMA Style

Yang H, Fu M, Li W, Sun L, Wang Q, Chen N, Zhang R, Wang Z, Lu Y, Ma Z, et al. Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking. Metals. 2026; 16(2):185. https://doi.org/10.3390/met16020185

Chicago/Turabian Style

Yang, Hao, Meixia Fu, Wei Li, Lei Sun, Qu Wang, Na Chen, Ronghui Zhang, Zhenqian Wang, Yifan Lu, Zhangchao Ma, and et al. 2026. "Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking" Metals 16, no. 2: 185. https://doi.org/10.3390/met16020185

APA Style

Yang, H., Fu, M., Li, W., Sun, L., Wang, Q., Chen, N., Zhang, R., Wang, Z., Lu, Y., Ma, Z., & Wang, J. (2026). Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking. Metals, 16(2), 185. https://doi.org/10.3390/met16020185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Transformer-Based Dynamic Flame Image Analysis for Real-Time Carbon Content Prediction in BOF Steelmaking

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Theoretical Foundation: Transformer Architecture

3.2. Model Architecture Design

3.2.1. Four-Channel Input Embedding and Patch Processing

3.2.2. Dynamic Query Mechanism and Classification Output

3.3. Evaluation Methods

4. Experiment

4.1. Dataset Construction and Preprocessing

4.2. Model Configuration and Training Strategy

4.3. Experimental Result

4.3.1. Comparative Experiments

4.3.2. Classification Versus Regression Paradigm Comparison

4.3.3. Evaluation of Four-Channel Input Effectiveness

4.3.4. Ablation Studies

4.4. Overall Performance Analysis and Visualization on Test Set

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI