An Application of Decision Tree-Based Twin Support Vector Machines to Classify Dephosphorization in BOF Steelmaking

: Ensuring the high quality of end product steel by removing phosphorus content in Basic Oxygen Furnace (BOF) is essential and otherwise leads to cold shortness. This article aims at understanding the dephosphorization process through end-point P-content in BOF steelmaking based on data-mining techniques. Dephosphorization is often quantified through the partition ratio ( 𝑙 (cid:3043) ) which is the ratio of wt% P in slag to wt% P in steel. Instead of predicting the values of 𝑙 (cid:3043) , the present study focuses on the classification of final steel based on slag chemistry and tapping temperature. This classification signifies different degrees (‘High’, ‘Moderate’, ‘Low’, and ‘Very Low’) to which phosphorus is removed in the BOF. Data of slag chemistry and tapping temperature collected from approximately 16,000 heats from two steel plants (Plant I and II) were assigned to four categories based on unsupervised K-means clustering method. An efficient decision tree-based twin support vector machines (TWSVM) algorithm was implemented for category classification. Decision trees were constructed using the concepts: Gaussian mixture model (GMM), mean shift (MS) and affinity propagation (AP) algorithm. The accuracy of the predicted classification was assessed using the classification rate (CR). Model validation was carried out with a five-fold cross validation technique. The fitted model was compared in terms of CR with a decision tree-based support vector machines (SVM) algorithm applied to the same data. The highest accuracy ( ≥ 97%) was observed for the GMM-TWSVM model, implying that by manipulating the slag components appropriately using the structure of the model, a greater degree of P-partition can be achieved in BOF.


Introduction
With an almost 100% increase in the price of iron ore over the past 5 years, the removal of phosphorus from these ores has become essential in order to maintain the persistent quality of steel [1]. Increased levels of phosphorus in steel can lead to cold shortness causing brittleness and poor toughness [2,3]. The process of phosphorus removal from iron ores is known as dephosphorization. In comparison to dissolved oxygen in liquid steel, iron oxide content in slag has shown a greater influence on dephosphorization for a given slag basicity and carbon content of steel. Dephosphorization has often been defined as (%P)/[%P], i.e., the ratio of slag/steel phosphorus distribution that frequently lies around the calculated equilibrium values for the metal/slag reactions involving iron oxide in slag [3,4].
Over the last few decades, ensuring high quality and productivity have motivated a substantial amount of research on phosphorus removal in steel based on various empirical and thermodynamic models [5][6][7][8][9]. Equilibrium relationships to estimate the effect of various slag components on phosphorus concentration were initially studied by Balajiva and Vajragupta in the 1940s on a small electric arc furnace (EAF) [5]. They reported that an increase in the concentration of CaO and FeO resulted in a positive influence on dephosphorization. In 1953, Turkdogan and Pearson discussed that the reactant concentration was not consistent in the changing external conditions, and therefore, decided to focus on estimating the equilibrium of the following reaction:

2[P] + 5[O] = (P O )
(1) where [A] and (B) represent a species in the metal phase and slag phase respectively [6]. The equilibrium constant, , for (1) is given by: where T is the slag temperature. Further, Suito and Inuoi investigated the CaO-SiO2-MgO-FeO slag system, where they concluded that the phosphorus distribution ratio increases with an increasing concentration of CaO in the slag [7]. The equation representing the phosphorus partition ratio from Suito and Inuoi is given by, where (%A) represents the percentage by weight of any component A. Moreover, Healy used thermodynamic data on phosphorous activity and phosphate-free energy in a CaO-P2O5 binary system to develop a relationship as shown in (5) [8].

log (%P)
[%P] = 22350 + 0.08(%CaO) + 2.5 log(%Fe ) − 16 ± 0.4 The mathematical relationship in (5) estimates the phosphorus distribution between molten iron and complex slags of the CaO-FeO-SiO2 system and was extended to the CaO-Fet-SiO2 system. In 2000, Turkdogan assessed for a wide range of CaO, FeO, and P2O5 concentrations given by (6) [9]. log = −9.84 − 0.142(%CaO + 0.3%MgO) More recently, Chattopadhyay and Kumar applied multiple linear regression (MLR) to analyze data from two plants: one with low slag basicity (low temperature) and the other with high slag basicity (high temperature) [10]. They suggested that a significant improvement of P distribution can be obtained by reducing the phosphorus reversal during blowing and after tapping, and reducing the tapping temperature. In 2017, Drain et al. reviewed 36 empirical equations on phosphorous partitions and presented their own new equation based on regression [11]. They identified the effects of minor slag constituents, including Al2O3, TiO2 and V2O5. An increase in Al2O3 content was found to have a detrimental effect on (%P)/[%P] except for low oxygen potential conditions, whereas TiO2 and V2O5 were found to positively affect (%P)/ [%P]. In an effort to understand the reaction kinetics and identify optimum treatment conditions, Kitamura et al. proposed a new reaction model for hot metal phosphorus removal by saturating the slag with dicalcium silicate in the solid phase and then applying the dissolution rate of lime to simulate laboratory-scale experiments [12]. Kitamura et al., discussed a simulation-based multi-scale model for a hot metal dephosphorization process by multiphase slag, which was an integration of macro-and meso-scale models [13]. A coupled reaction model was used to define the reactions between liquid slag and the metal in the macro-scale model, whereas phase diagram data was applied and P2O5 partition between solid and liquid slag was analyzed by thermodynamic data in the meso-scale model. While kinetic models are definitely better than thermodynamic models for predicting phosphorus partition ratio and end point phosphorus, owing to the super complex nature of the BOF process, data driven models have a better chance of prediction with much higher accuracy. Additionally, data-driven models are dynamic in nature, can accommodate new data sets, and are not limited by the test conditions or experimental conditions under which the model is developed.
In general, thermodynamic models are extremely beneficial to get an idea of the dephosphorization process in BOF steelmaking, and in many cases, provide accurate estimates of the dephosphorization index too. However, accuracy of such estimates greatly depends on homogeneity of the slag compositions across different batches. Such homogeneity are highly unlikely in a BOF shop due to high variability in the data that exists due to the dynamic nature of the multiphase process, e.g., variation in iron ore quality, composition of coke etc. Moreover, there exists strong dependence of these models with the original experimental data. Identifying the factors which can lead to a higher degree of dephosphorization, building new infrastructures, and updating the existing amenities based on these models to initiate and accelerate the phosphorus removal process can be both time consuming and financially burdensome.
On the other hand, the empirical models are useful to predict the dephosphorization measure corresponding to slag compositions and tapping temperatures only for a specific thermodynamic system. However, estimates may not be so precise for completely different systems. This implies that although a model accurately estimates phosphorous partitions from one dataset, it may not perform as well on a new dataset. Most of these models, although they apply regression, do not estimate the slope parameters using parametric least-square methods or non-parametric rank-based methods, thereby, ignoring the concept of error variability [14,15]. Furthermore, the application of a multiple linear regression model requires certain criteria related to error distribution (e.g., homoscedasticity, normality and independence) to be met, which are often not verified during model development. In this context, the application of data-driven approaches like machine learning methods can be transformative as these models have the potential to identify and utilize the inherent latent structures and patterns in the process based on available data, and thereby, evolve accordingly.
Machine learning is a fast growing area of research due to its nature of updating the implemented algorithm based on training data. Few ML techniques have been used to predict the end-point phosphorus content in the BOF steelmaking process. For example, a multi-level recursive regression model for complete end-point P content prediction was established by Wang et al. in 2014 based on a large amount of production data [16]. A predictive model based on principal component analysis (PCA) with back propagation (BP) neural network has been discussed by He and Zhang in 2018 where they predicted end-point phosphorus content in BOF based on BOF metallurgical process parameters and production data [17]. Multiple linear regression and generalized linear models were fitted to the BOF data from two plants for predicting log by Barui et al. in 2019, where they discussed various verification and adequacy measures that need to be incorporated before fitting a MLR model [18].
Though not many articles have been published discussing the role of ML-based methods in endpoint phosphorus content prediction in BOF steelmaking, there are few significant studies which have demonstrated various ML-based approaches in end-point carbon content and temperature control in BOF. Wang et al. (2009) developed an input weighted SVM for endpoint prediction of carbon content and temperature, with reasonable accuracy [19]. Improving on the works of Wang et al., Liu et al. (2018) used a least squares SVM method with a hybrid kernel to solve the dynamic nature of problems in the steelmaking process [20]. More recently in 2018, Gao et al. used an improved twin support vector regression (TWSVR) algorithm for end-point prediction of BOF steelmaking, receiving results of 96% and 94% accuracy for carbon content and temperature, respectively [21]. Though the applications of these models have not been explored specifically in the dephosphorization process, these models do indicate that ML-based algorithm has the potential for dealing with non-linear patterns in data associated with BOF steelmaking processes.
In this paper, an attempt was made to classify final steel based on = log (% ) [% ] values, and then predict classes to which the response may belong depending on slag chemistry and tapping temperature. The process parameters such as initial hot metal chemistry and slag adjustment could be used as inputs to the model. However, the final slag chemistry is highly correlated to the hot metal chemistry and fluxes added, through the BOF mass balance model. Therefore, a model built on final slag chemistry is very similar to a model built using initial hot metal chemistry, fluxes added and amount of oxygen blown. By creating categories based on values, the degree of phosphorus partition in the BOF process can be predicted; slag compositions belonging to the highest ordinal category would predict the lowest concentration of end-point phosphorus. In contrast to regressionbased problems, a classification of can be useful when the quality of steel within specific thresholds can be assumed to be similar. The classes are created based on quartiles (percentiles) of and using K-means clustering method.
Following the work of Dou and Zhang (2018), a decision tree twin support vector machine based on kernel clustering (DT 2 -SVM-KC) method is proposed in this paper for multiclass-classification [22]. This structure or approach of classification considers twin support vectors as opposed to a single support vector in traditional SVM. TWSVM represents two non-parallel planes by solving two smaller quadratic programming problems, and hence, the computation cost in the training phase for TWSVM is almost reduced to 25% when compared to standard SVM [23,24]. On the other hand, decision trees based on recursive binary splitting are used extensively for classification purposes [25]. As a combination of TWSVM and decision tree, the proposed approach is appropriate to deal with multiclass classification problems with better generalization performance with lesser computation time [22].
The paper is arranged in the following scheme. Section 2 highlights the theory behind the algorithm, as well as an in-depth explanation of the algorithm itself. In Section 3, the results are presented and interpreted. Furthermore, this section analyzes the main findings through the outcome of the results. In Section 4, future considerations and improvements are discussed.

Theory and Methodology
As mentioned earlier, phosphorous partition is measured by a natural logarithm of the ratio of % weight of P in slag to % weight of P in steel, and it is denoted by . A larger value indicates a greater degree of phosphorus partition, resulting in steel with a lower content of phosphorous. As a result, the quality of steel is correlated to the value, and a model that can accurately predict this value would be sufficient to characterize the dephosphorization process. In this paper, a hybrid method combining a decision tree with TWSVM was considered for analysis which could classify unlabeled test data to various dephosphorization categories. For implementing our proposed algorithm, Python 3.7 was used.

Nature of the Data
The proposed algorithm was constructed and tested on datasets obtained from two plants: plant I and plant II. Data from plant I (tapping temperature 1620-1650 °C and 0.088% P) consist of observations on nine features of slag chemistry from 13,853 heats to characterize . On the other hand, data collected from plant II (tapping temperature 1660-1700 °C and 0.200%P) has seven slag chemistry features from 3084 heats. A detailed summary on various features and values for both plants are presented in Table 1. All values (except ) are given in weight %.

Theoretical Model
The classification model comprises three phases: (1) initial labelling of data, (2) splitting the labeled data using a decision tree, and (3) training and testing of the TWSVM. The process flow is highlighted in Figure 1.
In the first phase, two unsupervised learning approaches, namely K-means clustering and quantile-based clustering, were incorporated to categorize the entire data set into four clusters based on the values [26]. The clusters were labelled as 0, 1, 2, and 3. The approaches are discussed below. K-means clustering: Initially, the cluster centroids { , , , } were randomly assigned using some randomization technique. For each , = arg min , , , dist( , ) was computed where the dist(. , . ) represents the Euclidean distance between two points. Each value in the dataset was assigned to a cluster based on the centroid of closest proximity. Let be the set of points assigned to the cluster. The updated centroids { ′ , ′ , ′ , ′ } were calculated based on the mean of the clusters given as The centroid update steps were carried out iteratively until a convergence condition was met. Typically, a convergence would mean that the relative difference between two consecutive steps would be less than some small pre-specified quantity . Phase 1 of initial labelling of data using Kmeans clustering is shown in Figure 2. Quantile-based clustering: The second method used for initial labelling of values is based on quantiles (percentiles). Each was assigned to one of the four groups, viz. Minimum-25th percentile, 25th-50th percentile, 50th-75th percentile, and 75th percentile-Maximum. Figure 3 shows an example of quartile-based clustering of values. In the second phase of the algorithm, the labelled data was passed through a decision tree, with two output nodes [25,26]. Three different criteria for splitting the decision tree were considered, namely, Gaussian mixture models (GMM), mean shift (MS), and affinity propagation (AP) [26][27][28][29][30]. K-means clustering is developed on a deterministic idea that each data point can belong to only a single cluster. However, GMM assumes that each data point has certain probabilities of belonging to every cluster and that the data in each cluster follows a Gaussian distribution. The algorithm utilizes All clusters have equal data points expectation-maximization (EM) technique to estimate model parameters [27]. For a given data point, the algorithm estimates the probability of belonging to a specific cluster and the point is assigned to the cluster for which this probability is maximum [28]. The next splitting algorithm, MS, clusters all the data points based on an attraction basin with respect to convergence of a point to a cluster center [29]. This method is an unsupervised learning approach that iteratively shifts a data point towards the point of highest density or concentration in the neighborhood. The number of clusters is not required to be specified as opposed to K-means clustering. In Python, this algorithm is controlled by a parameter called the kernel bandwidth value which generates a reasonable number of clusters based on the data. The final splitting algorithm tested was affinity propagation [30]. Similar to MS, AP does not require to specify the number of clusters. This algorithm clusters points based on their relative attractiveness and availability. In the first step, the algorithm takes the negative square difference feature by feature among all data points to produce a × similarity matrix , where is the number of data points. Similarity Function ( , ) is defined as where and are the row and column indices, respectively. Based on this matrix, a × responsibility matrix is generated by subtracting the similarity value between two data points, and then subtracting the maximum of the remaining similarities. The responsibility matrix works as a helper matrix to construct the availability matrix as which indicates the relative availability of a data point to a particular cluster given the attractiveness received from the other clusters. In this matrix, the diagonal terms are updated using (12) and the rest of the terms using (13) The final step to create the clusters is by constructing a criterion matrix , which is the sum of the availability matrix and the responsibility matrix.
Subsequently, the highest value on each row is taken, and this value is known as the exemplar. Rows that share the same exemplar are in the same cluster where To generate the decision tree, one of the clustering algorithms was applied on the entire dataset to produce two centroids acting as the basis for child nodes. The purpose of applying a cluster algorithm was to find the optimal split of one node with four labels into two nodes with two labels each where the splitting criterion is the entropy among data points. For ease of computation, the initial clusters and child nodes are represented by their respective cluster centroids. This computation evaluates the difference of each initial cluster centroid to each child node centroid, and the assignment is carried out based on whichever combination produces the least entropy.
Once two children nodes are created with corresponding labels, they are passed through the TWSVM [22,23]. For generating the TWSVM, the data is split into 80% training and 20% testing data. A brief discussion on the working mechanism of TWSVM follows. Let denote the feature vector and denote the response variable. Let be the training set: where each response ∈ {+1, −1}, +1 and −1 represent binary classes to which each will be assigned. The matrices ∈ ℝ and ∈ ℝ represent the data points assigned to class +1 and −1, respectively, where and represent the number of data points assigned to +1 and −1. Using the training data, the goal of the TWSVM is to obtain two non-parallel hyperplanes: In these equations, and are parameters that quantify the trade-off between classification and margin error, and are the relative cost of misclassification, and is the error vector associated with each sample. To simplify the solutions of the QPPs, Lagrangian multipliers were used. For example, to solve TWSVM1, the corresponding Lagrangian is given by: As the objective functions of the TWSVM are convex, the KKT conditions are necessary to find a solution for this type of problem. After the algorithm is trained with labeled data, the unlabeled testing data are passed through the TWSVM and assign a data sample to a class, based on whichever hyperplane is closer with respect to perpendicular distance. The accuracy of the final model is measured by the classification rate ( ) which is the percentage of the test data correctly labeled. More specifically, = where is the number of heats with actual group label as ' ' ( = 1, 2, 3, 4) and predicted group label as ' ' ( = 1, 2, 3, 4) in the test data and = ∑ ∑ . For our data, represents the vector of slag chemistry values corresponding to a particular heat and takes -1 or +1 based on two classes of values.

Model Adequacy
Five-fold cross-validation was applied to the model. This is a resampling procedure used to evaluate the machine learning model performance on the training data. Following this procedure, the results were less biased towards an arbitrary selection of the test set. In the five-fold crossvalidation step, the dataset is split into five groups, so that the model is trained four times with one group being held up from the training set and used as a test set. In this manner, each group has the opportunity to be the test set once, and therefore, the bias of the model is reduced. The final accuracy presented in the results section is an average of the four accuracies values obtained in the five-fold cross-validation step. Finally, in both plants the data points were normalized to reduce the amount of computational power needed to train and keep the data entropy in the same order of magnitude among all features.

Descriptive Statistics
Mean, standard deviation (SD), minimum and maximum for the set of features (i.e., slag chemistry components), and values are presented in Table 1. Box plots for all the features and value for plant I are presented in Figure 4. To maintain brevity, the box plots corresponding to the features for plant II are not provided. The box plots indicate that most of the features have symmetric distribution except Fetotal, MnO and Al2O3. These plots serve as a visual aid to identify the range of values containing the middle 50% of the data for each feature. For example, the longer whisker of the box plot corresponding to Al2O3 indicates that the values are skewed and potentially have many outliers, while the middle 50% lies between 1.5% and 2% by weight. Table 2 shows the distribution of the values in each of the initially labeled clusters by K-means for both plant I and plant II. It is observed that each cluster has a distinct range of values since there is no overlap among one standard deviation intervals from the mean. For quantile-based clustering, each cluster has approximately 3563 (25%) observations for plant I and 771 (25%) observations for plant II. This signifies that the initial cluster labels have the potential to classify values into disjoint intervals, and therefore, could be a reliable measurement to categorize features based on the degree of dephosphorization.

Model Hyper-parameter Selection
The model hyper-parameters are parameters that cannot be learnt by the model during the training phase. Hyper-parameters are supplied to the model by the user and tuned empirically, in most cases, by a trial and error approach. The performance of the model is heavily dependent on our choice of hyper-parameters since they influence how fast the model learns and converges to a solution. For Twin-SVM, those hyper-parameters are epsilon ( ) and the cost ( ) function as defined in Section 2. regulates the distance from the boundary decision layer to the threshold at which a point belonging to a certain class and within the threshold should be penalized by the cost function. Consequently, the larger the cost function the higher the penalty for a point within the threshold. A large will hinder the model convergence whereas a smaller may cause an over-fitted decision boundary. Other hyper-parameters also determine the model's ability to construct a linear or nonlinear structure. TWSVM can accommodate linear, polynomial, and radial basis function (RBF) kernels. These kernels are transformations applied onto the dataset to produce a representation of these data in a different space where feature sets can be classified [25,26].
The hyper-parameters where selected using a trial and error approach, with the starting values, increments, and ending values presented in Table 3. From this table, the selected hyper parameter based on best performance in terms of accuracy was selected. The trial and error approach though is not random but considered following the best practices in the field of machine learning. A comparison between Linear and RBF kernel-based DT2-SVM-KC was carried out. RBF performed better than the linear kernel-based TWSVM in terms of accuracy (classification rate), which further indicates the strong presence of inherent non-linearity in the data obtained in BOF steelmaking. A third option considered was a polynomial kernel; however, it was not considered because it resulted in computational overflowing even for a quadratic polynomial.

Accuracy of Results
The results of the analysis of BOF data based on decision tree twin SVM cluster-based algorithm are presented in this section. As mentioned previously, two methods of labeling values: quartilesbased and K-means-based, along with three clustering algorithms function as split criterion for the decision tree to give a total of six different DT 2 -SVM-KC algorithms. The accuracy (%) for each of these algorithms is presented in Table 4.   Table 5, K-means clustering as a label generator and GMM-cluster-based DT 2 -SVM-KC, provided the most accurate result with accuracy around 98.03%.
A comparison between GMM, MS and AP-based algorithm is presented Figure 5. Results show that GMM performed better than MS and AP in terms of accuracy for each dataset. Figure 6 shows the accuracy obtained using K-Means clustering vs. quartile clustering to generate initial labels of . This figure shows that K-Means clearly yield a superior accuracy when examined against the quartilebased clustering method. The mean of the accuracy across the node-leaves for plant I is 78.77% and for plant II is 98.04%.

Justification for Twin SVM over other SVM models
To validate the use of a complex model such as Twin SVM, we compare its results with general SVM-based models. Results are shown in Figure 7. Further, Table 5 shows that twin SVM improves the accuracy of the binary classification problem by at least 15% for both plants I and II. The hyperparameters for the SVM were tuned to reach maximum possible accuracy and the decision tree was consistent with the best method used with Twin SVM (i.e., K-means as labeling generator and GMM as splitting criterion for the decision tree).

Discussion and Interpretation of Results
In this section, the results of the analysis are discussed. This section comprises of two parts. First, a discussion about the performance of the algorithm, and second, an interpretation from an industrial perspective.

Algorithm Performance
Twin SVM outperforms SVM by at least 15% across all datasets. Mathematically, twin SVM solves two smaller quadratic programming problems rather than a larger one as SVM does. Therefore, twin SVM is tailored for binary classification, which also means that the decision tree boosts its accuracy. This is because the decision tree reduces a multi-class classification problem into small binary problems that can be solved by the twin SVM. In fact, the decision tree allows a greater number of classes as long as the node-leaves end up with two classes (i.e., 2 where ≥ 2). The downside of increasing the number of classes is that each node-leaf will have less data points to train every time we increase the number of classes resulting in less significant and biased results. Given the number of data points, empirically, four classes were found to be optimal for this experiment.
With regards to the difference in accuracy across plants, it is worth noticing that although plant I has five times more data points than plant II, the accuracy in plant I is lower. This is counter intuitive; however, two possible attributive factors could be suggested. First, that plant I has more features (i.e., V2O5 and TiO2) than plant II, and, second, the range in values from plant II is shorter than the one from plant I as shown in Table 1. As a consequence, the classification for plant I is more difficult due to higher variability of the data. On the other hand, ore from plant I contains more phosphorus than in plant B, which further plays role in the reduction of accuracy for plant I data. What such results suggest is that the tuning of hyper-parameters should be performed distinctively on each plant data; nevertheless, the algorithm will remain the same.
Finally, as part of the data pre-processing stage, all data points were normalized. The purpose of normalizing the data was to reduce the computational power for training and testing. Unexpectedly, normalizing also had an effect on the accuracy of the algorithm since an increment of at least 3 percentage points was noted for both plants. This effect was also attributed to the fact that less variability increases the accuracy of the machine-learning model. By normalizing the data, the influence of features with high numerical values such as temperature are weighted the same as the influence of other features with less numerical values such as MnO or CaO.

Application of the Results for Industry
From an industrial view point, the method of initial labeling is crucial. It means, given certain slag chemistry, we can predict the percentage of phosphorus in the steel. K-means clustering labels the data based on the proximity of an unlabeled data point to the cluster center of labeled data points. This center will be dynamically updated until the algorithm converges but it always represents the mean of data points within the cluster at any given time. For a metallurgist, the results of the algorithm will show how close the current batch is to a certain value (i.e., center of the cluster). The advantage of using K-means over quartiles labelling approach is clearly demonstrated in terms of accuracy as illustrated in Table 5. One explanation for such results is that K-means is groupedbased clustering while quartiles labeling considers only the relative position of a value with respect to the others.
The GMM cluster-based decision tree TWSVM algorithm designed in this research aims for attaining higher flexibility and adaptability to real world conditions. Given the values of slag chemistry for a particular heat in a BOF shop, our model will be able to predict to which of the four classes (i.e., 0-3), the batch will fall. For example, if the current batch belongs to class-0, the content of phosphorus in the resulting steel will be high and the corresponding output will be undesirable. The objective is to produce steel corresponding to high classes (class 2-3). The proposed algorithm has proven to work seamlessly with two different plants having different slag chemistries. The algorithm provides a general framework and requires training data from a specified plant in order to achieve the optimal accuracy results for classifying values.

Conclusion
A decision tree twin support vector machine based on a kernel clustering (DT 2 -SVM-KC) machine-learning model is proposed in this paper. The model classifies a batch (obtained from a specific heat at a BOF shop) based on slag chemistries into one of the four classes of values. Class 0 corresponds to highest phosphorus content in the resulting output while class 3 corresponds to lowest phosphorus content in the resulting steel. The model is efficient and shows high accuracy with a relatively low computational requirement and cost. Even though further testing on different datasets is required, the model has shown consistent performance across Plant I and II. However, it was noticed that with an increase in the number of features and variability of the response variable , the accuracy of the algorithm decreases. So, it is recommended that a metallurgical model based on fundamental theory might be helpful in ruling out features that do not necessarily affect or have negligible influence on the amount of phosphorus in the steel. Additionally, it is recommended that for samples with high variability in the response variable ( ), an increase in the number of labels during the cluster labelling could be advantageous given there are enough data points corresponding to each node leaf. This is to avoid an under fitted model with unreliable results.
Finally, one of the main considerations towards industry applications is the interpretability of these results. In this paper, the results suggest that given a data point from a heat, one can deduct a certain range for that batch and also understand the influence of the features on the results such that the slag composition can be tweaked to reduce the amount of phosphorus in liquid steel for the next batch.