Using Statistical Modeling to Predict the Electrical Energy Consumption of an Electric Arc Furnace Producing Stainless Steel

The non-linearity of the Electric Arc Furnace (EAF) process and the correlative behavior between the process variables impose challenges that have to be considered if one aims to create a statistical model that is relevant and useful in practice. In this regard, both the statistical modeling framework and the statistical tools used in the modeling pipeline must be selected with the aim of handling these challenges. To achieve this, a non-linear statistical modeling framework known as Artificial Neural Networks (ANN) has been used to predict the Electrical Energy (EE) consumption of an EAF producing stainless steel. The statistical tools Feature Importance (FI), Distance Correlation (dCor) and Kolmogorov–Smirnov (KS) tests are applied to investigate the most influencing input variables as well as reasons behind model performance differences when predicting the EE consumption on future heats. The performance, measured as kWh per heat, of the best model was comparable to the performance of the best model reported in the literature while requiring substantially fewer input variables.


Introduction
In the light of the increased use of Electric Arc Furnaces (EAF) to produce steel, it has become of more interest to study ways of reducing the raw material and energy consumption of the process. Not only will successful attempts reduce the environmental impact, but also increase the financial result for the company producing the steel. One common tool to evaluate new production strategies is through the act of modeling. Modeling is a valuable tool because it enables process engineers to evaluate proposed changes without interfering with the process, which may be more or less harmful to the steel plant supply chain. These modeling approaches can be, for example, physicochemical models using established relationships such as the mass-and energy balance equations [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19]. Another approach is through Computational Fluid Dynamics (CFD) modeling [20]. While these models are mainly based on physicochemical equations, there is another type of model approach that is purely based on data. These models are known as statistical models and have frequently been used to predict the Electrical Energy (EE) consumption, per heat [21][22][23][24], or per ton of tapped steel, from the EAF [25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40][41][42].
Several gaps in previous attempts to model the EE consumption using statistical modeling have recently been highlighted [43]. One such gap is related to the black-box behavior of non-linear statistical models, such as Artificial Neural Networks (ANN), which imposes lack of trust in the model in the eyes of the process engineers. Other gaps are related to the statistical modeling framework itself. For example, lack of a detailed description of the data-and modeling pipeline, which includes variable selection, data collection, data cleaning, parameter-search, modelling, and evaluation of model robustness. Another gap is the lack of connection between metallurgical knowledge and statistical modeling, both of which must be carefully considered if one aims to create a statistical model that is of practical use in the steel plant. In some cases, this has led to a lack of relevant variables and in some other cases an overuse of the number of variables. The aim of this article is to address these gaps in the context of the aforementioned statistical modeling of an EAF producing stainless steel.
ANN will be used to predict the EE consumption of an EAF producing stainless steel. This is a suitable algorithm for this prediction problem since some important physicochemical relations governing the EE consumption are non-linear. The choice of variables to model the EE consumption will be motivated using both metallurgical reasoning and statistical reasoning. First, the energy balance equation will be used to highlight the physicochemical relations between the variables governing the EE consumption [43]. Second, the expected correlative relation between the variables with respect to the EE consumption will be highlighted. The common thread in both approaches is the use of process knowledge in the reasoning.
The data-and modeling pipeline will be explained in detail for reproducibility and for reference to future studies aiming to use statistical models to predict the EE consumption of the EAF. To validate the models and to address the black-box behavior of the ANN models, statistical analysis tools such as Feature Importance (FI), Kolmogorov-Smirnov (KS) test, and Distance correlation (dCor), will be used. These methods have not previously been used in the context of statistical modeling of the EE consumption in the EAF. The consequences and aspects of this approach will also be discussed.
The results of the modeling shows that it is not necessary to use a high amount of input variables to create the best performing model. It is enough to select a subset of a set of well-chosen variables using process knowledge. Using dCor, KS tests, and FI, as complementary tools can aid the process engineers in both finding the most important variables with respect to the model performance and the reasons why a model performs differently on previously unseen heats (test data).

EAF Process
The EAF is the main melting process in the mini-mill, type of steel plant. It uses raw material such as steel scrap and alloys to create molten steel for further processing in downstream processes in the steel plant.
The EAF process begins with the charging phase during which raw materials are added to the furnace. The melting phase starts when the electrodes are powered on and bored down into the raw material. This phase lasts until enough of the raw material has melted to make room for the second basket of raw material. This is followed by yet another melting phase. During the melting phases, burners are activated to remove cold spots, which facilitates an even melting behavior of the charged scrap. After most of the raw material is melted, the refining phase starts during which the steel is adjusted to a pre-specified composition. Additional raw material such as carbon and silicon are added in combination with oxygen lancing to facilitate exothermic chemical reactions. This reduces the amount of EE needed to heat the steel. Lastly, the steel is tapped into a ladle for further processing in the steel plant. Any necessary preparations are then made for the furnace, such as fettling of the refractories, before the next heat starts. This generalized EAF process is illustrated in Figure 1.

Energy Balance Equation
The energy balance equation of the EAF process makes it possible to express EE consumption, E El , as a sum of ingoing and outgoing energy factors (Equations (1)- (3)).
Each of the energy factors are related to physical and chemical entities (see Table 1). Calculations of ingoing and outgoing energy in the EAF process have previously been calculated [9,25,[44][45][46]. This is compiled in Table 2. Table 2. Ingoing and outgoing energy in the EAF process as percentages of energy sources and energy sinks, respectively. Values are computed from a synthesis of reported values [9,25,[44][45][46].

Energy Factor
Percentage of in and out Energy Balance The ingoing and outgoing energy of the EAF, as percentage of each total, provide guidance in choosing input variables to a statistical model predicting the EE consumption.

Non-Linearity
Some of the terms in the energy balance equation listed in Table 1 are related to non-linear physical phenomena in the process. Examples include, E Gas and E Rad . Breaking down the Tap-to-Tap time (TTT) into its smaller components Charging, Melting, Refining, Extended Refining, Tapping, and Preparation, the process evidently becomes considerably more non-linear. The reason is because the energy loss through convection, conduction, and radiation are very different from these sub-processes For example, the energy loss through radiation is a lot higher when the steel is molten, i.e., during Rrefining, compared to the energy loss during the first charging phase. Adding the dynamic timelines of each sub-process due to varying amounts of scrap, refining times, and various delays between heats, the physicochemical relation to the EE consumption becomes even more complex.
Because of the non-linear and non-normal distributed outcome of the process, linear statistical models, such as Multivariate Linear Regression (MLR), and linear statistical metrics, such as the Pearson correlation, are sub-optimal as tools in the context of predicting the EE consumption.

Statistical Modeling
Statistical modeling differs from other types of mathematical modeling such as those based on physical and chemical relations. See the energy balance equation in Section 2.2 for an example of a physicochemical model. Physical models act directly upon the values of the input variables to predict the output variable. Statistical models, on the other hand, act on the value with respect to previous values of the input variable it has observed in the context of the values of the other input variables. The value of the output variable is always a probability or a value with the highest probability. This means that statistical models do not adhere to the physicochemical relations between the variables. The connection between the physicochemical relations and the model is, by and large, dependent on the data quality, variations in the data, and correlative behavior between the variables. Hence, statistical models can seem counter-intuitive in the eyes of a practitioner of physicochemical-based modeling.
Data quality: A bad data quality will negatively impact the performance of a statistical model since uncertainties are imposed. There are many different ways to get good data quality. The variables must be correctly defined with respect to what they represent. This may not always be the case because everything that can go wrong is not possible to account for. For example, if the furnace is considered closed when the roof is closed, then the charging time of the furnace could still be counted if the roof is not completely shut even though the furnace is in the melting stage. Connecting plant events to logged data is not always trivial and fault free. Other examples of where errors are prevalent are from online measurements and estimations of energies based on those measurements as well as manual logging of data.
Data variability: Near constant variables add little value to a statistical model because the variations in the data are what the statistical model learns and then uses to make predictions on test data. Statistical models require variations in the data to function properly. This is perhaps one of the most counter-intuitive traits of statistical modeling since a constant term in an arbitrary physicochemical model is important for its prediction. For example, a statistical model for an EAF where the total amount of ingoing lance oxygen is constant will not receive any benefit from including that variable. This is in contrast to a physical model where the amount of added oxygen leads to exothermic chemical reactions.
Correlation: Strongly correlated variables are in a statistical sense redundant variables. Weakly correlated variables with respect to the output variable may be redundant. The degree of redundancy is closely related to the correlation between the input variable and other input variables. It may be that the input variable in question does not add any further information to the statistical model with respect to lowering the prediction error. Because of this, some variables that are considered important for predicting EE consumption in a physical model may appear as not important for the statistical model. On one hand, correlation does not imply causation, on the other hand, statistical models cannot distinguish correlation from causation. However, correlation can point to areas were causation may exist. This stresses the importance of possessing domain-specific knowledge about the process to distinguish between causative and non-causative relations between the variables governing the statistical model. Three cases of correlation related to the EAF process with respect to the EE consumption are presented, which highlight the complex relations between the different entities in the EAF process governing the EE consumption. These cases are also illustrated in Figure 2.
First, the composition of the ingoing material affects the EE consumption exothermic reactions. Here, the composition together with oxygen lancing give rise to exothermic reactions which in turn reduce the EE consumption.
Second, the various sub-processes of the EAF affect the EE consumption in various rates. For example, during the melting process, EE is added in a more intense rate than, for example, during the refining process. Furthermore, the energy loss through radiation, convection, and conduction is more intense during the refining process when most of the steel is melted. Adding all sub-processes together gives us the total process time. The delays in the process are added to the sub-process where the delays occur. Thus, the time variables are both interrelated and have a complex relation to the EE consumption.
Third, the raw material types that are added to the EAF have a more complicated relation to EE consumption. Raw material types correlate positively to the EE consumption since they directly add to the total weight of the ingoing material to be melted. The raw material types are also correlated with one another since, given a pre-determined charge weight, the raw material types combined must make up to that specific weight. If one raw material type is not available, another is used as replacement. Raw material types can also contribute negatively to the EE consumption if the raw material type contains C, Si, Al, Fe, and Cr, all of which react exothermic with oxygen. Variable selection comes down to how much additional useful information, with respect to predictive performance, the model gains after adding another input variable. The three points previously mentioned also tie closely to the specific EAF where the data comes from. Even though the interrelations between the input variables to the EE consumption are similar between EAF furnaces, it is not possible to know, ad hoc, the specific impact on the predictive performance by a setup of input variables.

C Si
In the supervised statistical modeling framework, which will be used in this paper, each row of input data has a corresponding output value during the training phase and test phase. The supervised statistical modeling framework can be abstracted into the following steps.
1. Select statistical model and values for the model-specific parameters. 2. Train the model using training data until the model accuracy converges, i.e., stops improving. 3. Test the model on previously unseen data. 4. Record the accuracy on test data and evaluate its practical applicability. 5. If the accuracy on the test data is satisfactory, deploy the model into production. 6. Re-train the model if the model accuracy has deteriorated. This deterioration is bound to happen over time in a production setting, partly due to changes in the process.
To find the optimal combination of model parameters, multiple models are trained for each combination of parameters. This is known as grid-search or parameter-search.
There are also subcategories of supervised statistical models. In the scope of this paper, these are divided into linear and non-linear statistical models. Linear statistical models can learn linear relationships between the input variables and the output variables. One such example is Multivariate Linear Regression (MLR). On the other hand, non-linear statistical models can learn both linear and non-linear relationships between the input variables and the output variable. One such example is ANN. Both ANN and MLR have been commonly used to predict the EE consumption in the EAF [43]. While an ANN model can learn non-linear relations in the data, the model is almost impossible to interpret. This is commonly referred to as a black-box model and is one of the main reasons why such models are not widely accepted in practice in the context of steel process modeling. However, by using statistical analysis tools, it is possible for the process engineers to get insights on which input variables are the most important for the model to make accurate predictions as well as reasons behind performance differences on test data. Furthermore, it is always possible to create a more complex statistical model to any given problem. Nevertheless, one should always strive for model parsimony, which is achieved when picking the simplest of a group of models achieving similar performance. The model complexity is related to both the number of input variables as well as the chosen type of statistical model.

Method
The workflow explained in this section can be followed in Figure 3. The software used for all parts of the experiments was Python provided by Anaconda distribution. The hardware and software specifics are shown in Appendix A.1.

Furnace Information
The data used in this study comes from an Alternate Current (AC) EAF producing stainless steel. The furnace nominal capacity is 80 tons of molten steel and the electrical system has a maximum power of 80 MW. The charged raw material is 100% scrap, and alloys. A preheater, which uses part of the furnace off-gas enthalpy as energy source and heat transfer medium, is employed to remove entrained moist and to burn off residual oil and grease in the scrap. Conventional oxy-fuel burners are used to facilitate an even temperature distribution during melting. Oxygen lancing is used to facilitate oxidation of elements such as C and Si. The number of heats produced per year is approximately 5000.

Variable Selection
The variable selection process is partly based on the discussion regarding the energy balance equation, non-linearity, and the reasoning about the correlative relations of the EAF process with respect to the EE consumption, (see Section 2.2-2.4). The variables can be divided into four categories: 1. Time: The logged times for the whole process as well as the various EAF sub-processes are important for predicting the EE demand. Each process sub-stage, Charging, Melting, Extended refining, Refining, and Tapping, contribute differently to the heat loss. For example, the heat loss is higher when molten steel is present compared to when the first bucket of scrap is charged. Both the TTT and the Process Time were included. The total time imposed by delays, defined as the sum of the deviation from nominal time of each EAF sub-process, was also included.
Some obvious correlations can be expected with regards to these time variables. As Charging, Melting, Extended refining, Refining, and Tapping makes up the Process Time and the majority of TTT, these variables are expected to be relatively highly correlated. 2. Chemical: Oxidation and the oxyfuel-burners can account for as much as 50% and 11% of the total ingoing energy, respectively (see Table 2). The contribution by the burners was accounted for by one variable propane. The oxygen gas injected through the burners was not included due to a near stoichiometric relationship between oxygen gas and propane (5:1). The wt.% of C, Si, Al, O of the total charged metallic material and the oxygen lancing were also included to account for exothermic chemical reactions occurring during the process.
The wt.% Fe, Cr, and Ni of the total charged metallic material were included to account for possible deviations in the process due to the steel grade being produced.
The contents of the charged oxide bearing raw material are, of course, expected to impact the thermodynamics and kinetics of each reaction. However, there are more complicated factors in play between the metallic elements in the steel melt and the oxides in the slag. To account for these complex effects, the following oxides, in wt.% of total charged oxide bearing raw material, were included: Cr 2 O 3 , MgO, CaO, FeO, SiO 2 , and Al 2 O 3 .
It is known, by experience, that the specific heat per volume unit of oxygen is inversely proportional to total amount of lanced oxygen gas. 3. Charged material: The different material types are expected to contribute differently to the melting behavior and heat transfer of the scrap. Hence, all 8 of the available Material Types were included in the list of input variables. The Total Weight of the ingoing material was included because it is closely connected to the required energy to melt the steel. It was also divided into two separate variables, Metal Weight and Slag Weight, which represent the total weight of metallic material and total weight of oxide bearing raw material, respectively.
The sum of all Material Types equals the Total Weight. Hence, the Material Types are expected to be correlated with the Total Weight. Furthermore, the sum of the Slag Weight and Metal Weight equals the Total Weight and are also expected to be correlated. 4. Energy: There are numerous energy variables that can be included in the list of input variables.
However, most are in one way or another connected to the already mentioned variables. For example, heat loss is linearly related to time and amount of ingoing scrap is linearly related to the heat required to heat and melt the scrap.
The preheating energy was included because the actual preheating is conducted before the start of the EAF process itself. In the steel plant of study, the scrap is preheated with partial off-gas from the EAF operation. Hydrocarbons such as grease and oil, and moisture entrenched in the scrap are burned off. Due to the varying amounts of moisture and hydrocarbons in the scrap, the preheating energy as ingoing variable is therefore motivated.
EE is well defined in the transformer system and is subject to a negligible error. This means that the logged EE value can be trusted and is therefore also taken as the true value when training the statistical model.
The selected variables are shown in Table 3. Table 3. The variables used in the models.

Delays min
The sum of all delays, which is defined as all deviations from the nominal time of each sub-process. Correlations between the variables in the 4 categories are also expected. For example, EE demand should correlate with Total Weight. Another example is Material Types and chemical compositions as these material categories are partly based on chemical analyses.
The reason correlation is frequently referred to is because it is one of the main underlying factors that connects the input variables to the output variable by the trained statistical model. Furthermore, input variables that are correlated "soaks up" parts of the "correlation potential" with respect to the output variable. This is especially the case when the correlated input variables are correlated with the output variable. A simple example is the following correlative relationship between input variables A and B, and output variable C:

Variable Batches
The variables in each variable batch in Table 4 are explicitly shown in Table 5.
The motivation behind using a setup of variable batches is to investigate the true impact of each variable type. The impact of a variable on the model outcome can be diminished if that variable is partly represented by another variable, i.e., correlated, (see Section 2.4). One clear example is the percentage of Cr and a Material Type containing high amounts of Cr. Likewise, the total process time is partly represented by the melting time because the melting time adds to the total process time.
The decision to divide the variables into the specific variable batches shown in Table 4 was two-fold. First, there are multiple variables that represents the same effect of a physical phenomenon on the EE consumption, as motivated previously. Second, the number of model types to create given the number of combinations using each variable is very large (2 35 ≈ 3.44 · 10 9 ) and it is not possible to model all variants within a reasonable time frame. Hence, bundling together variables that are of the same type, for example Material Types, will reduce the number of unique variable batch combinations to 64 (2 6 ).
The base variables were included in all variable batches because they are believed to be strongly related to the EE consumption. The specific variables in the base variable group are shown in Table 5.

Selection of Test Data
The test data is commonly selected from a random sample of the complete data set. However, in this regard, the training data and the test data become chronologically intertwined. This shortcoming has been discussed in previous research [43]. From a practical process perspective, a statistical model will predict on data from heats that are produced after the training of model. This means that these heats will always be from a future point in time. To consider this in a model evaluation, the test data should be selected in chronological order with respect to the training data. In the experiments, the test data will be from heats produced within 30 days after the last heat in the training data.

Purpose
The purpose of data treatment is to clean, and possibly repair, erroneous data or remove data that is not part of regular production. Data treatment is a double-edged sword. On one hand, the aim of any given model is to predict well on as many heats as possible. On the other hand, including unrepresentative data makes it harder to optimize the model enough to make it practically useful.
There are two different categories for data treatment. One is domain-specific, where the values in the data are assessed to what is physically possible within the application of the model. For example, if the EAF has a maximum capacity of 80t, then any value above that value should be viewed as an erroneous value. The second strategy category is statistical outlier detection. For example, removing all values that are above 3σ from the mean value of a variable that is normal distributed. However, it is important to consider that statistical outlier detection methods do not take the application domain into account. These methods must be used with caution and carefully assessed on a case-to-case basis.
The data used in the modeling was aggregated from two disparate data flows. The first data flow consists of raw material compositions, raw material types, total charge weight, metal and oxide bearing raw material weights, energy data, and added propane through burners and oxygen gas through lance. The second data flow consists of process time, times for different sub-processes during the EAF process, and delay times.
All data from the two data flows was gathered after the 31 March 2016 and before 18 October 2018.

Domain-Specific Methods
Data flow 1: 12.1% of the raw material entries lacked oxide compositions. These entries were repaired using previously recorded oxide compositions for the same material entries. This is not expected to dramatically affect the performance of the models. An average of 176 kg of added material per heat, amounting to an approximate 0.2% of the total weight, had not available (NA) values for Rawtype, oxide, and metal composition. These raw material types were categorized into a variable called Type N. The total number of heats removed due to a raw material input with zero or negative weight was 59. The resulting number of heats at the end of the flow was 12,805.
Data flow 2: Only heats with logged events "heat started" and "heat ended" were removed. This was to safeguard against any complications such an error would give rise to in the rest of the entries. No other data treatment was commenced in this stage. A total of 12,649 heats were gathered from this data flow.
Aggregated data: Aggregating the two data flows resulted in a total of 12,587 heats. The resulting number of heats are less than from both data flows due to mismatching overlap of heats. After the aggregation of the two data flows, domain-specific data treatment rules were applied and are shown in Table 6. Table 6. The domain-specific cleaning rules that were applied to the data. The rules were applied to the training data and test data separately to simulate an applied data filtering system. Rules applied to the training data must be applied to the test data to safeguard against data points outside the scope of the statistical model.

Filter Motivation
Removal of all Trial heats Trial heats are not part of regular production since the aim is primarily to investigate the properties for new scrap types Heats with EE above 60 MWh Identified as abnormal EE by the process engineers Heats with Total charged weight above 110,000 kg Physically impossible weight due to furnace size limitations 45 min < TTT < 180 min 45 min are considered unusually short and above 180 min are likely due to a longer delay in the process or a scheduled stop. Usually, the TTT is aimed at 60-70 min Delays < 180 min Heats with delays over 3 h are because of longer stops due to, for example, broken equipment.

Statistical Methods
Statistical data cleaning methods were not applied to the data set in this study. There are three reasons for this: 1. A more robust outlier detection algorithm was used in an attempt to remove outliers [47].
However, the resulting data set became too small after applying the algorithm on just a few of the variables. 2. It makes sense to not apply statistical outlier detection methods on some variables. For example, the wt% content in the charged material can vary tremendously depending on what scrap types are available. Since multiple steel grades are produced using different charging strategies, the wt% of elements will also vary. Some stainless steel types need more wt% Cr and wt% Ni than others. 3. Most conventional outlier detection methods assume that the data is normal distributed, or normal distributed with various levels of skewness and kurtosis. This is not always the case in EAF production data, see Figure 4.

Artificial Neural Networks (ANN)
ANN, as mentioned earlier, is one type of non-linear statistical model framework. The idea is to connect the input variables to the output variable using a network of nodes that are fully connected [48]. The first layer is the input layer and the last layer is the output layer. The intermediate layers are called hidden layers. See Figure 5 for an illustration of a simple ANN.

Input Hidden
Output Figure 5. An Artificial Neural Network (ANN) with two input nodes, one hidden layer with three nodes, and one output node. Each line drawn between the nodes illustrates the forward flow of calculations in the network.
The number of hidden layers and the number of nodes in each hidden layer determine the complexity of the model. In each of the hidden layers, and in the output layer, each node multiplies a weight with each of the values from the nodes in the previous layer. The resulting values are summed together. This process can be mathematically expressed as: where P is the number of nodes in the previous layer and j is the jth node in the layer. A function, known as activation function, is then applied on s j resulting in the value that the current node sends forward in the network. The hyperbolic tangent (tanh) and the logistic sigmoid functions are two commonly used activation functions.
During the training phase, the training data is sequentially fed into the network upon which the network weights are updated in the direction of minimizing the loss function. The loss function can be, for example, the mean squared error (MSE) or R-square (R 2 ). As the output value is a function of all the weights in the network, it is also possible to express the loss function as a function of the values of those weights. The updating of the weights is also known as backpropagation, because the errors are propagated back through the network and the weights updated. Since the loss space is a function of all weights in the network, finding a good enough local minima requires a well-engineered algorithm. These types of algorithms are known as gradient-descent algorithms because their aim is to descent to the most optimal local minima in loss space [48].
The sequentially feeding of training data, and update of the weights, is done numerous iterations until the accuracy improvement vanishes. One also uses a validation set, a subsample of the training data, during the training phase to ensure that the model does not overfit on the training data.
Overfitting means that the model has learned the training data so well that it fails to predict well on unseen data. This is one of the drawbacks with neural network models because of its ability to learn complex relations even though the relations are not of value to solving the prediction problem. The validation set is the data set that the model is benchmarked against during the training phase to account for this drawback. After the training phase is completed, the test data is predicted by the model and model performance metrics are calculated.

Model Performance Metrics
To compare the performance of the models, R 2 and the regular error metric will be used. One should use the adjusted-R 2 formula if one aims to compare R 2 between models with different number of input variables. This is because each added predictor slightly increases the R 2 -value given that the number of data points is fixed [49]. The formula for the adjusted-R 2 can be written as follows where R 2 is the standard R-square, n is the number of data points, and p is the number of input variables. The regular error metric is advisable to use over the absolute error metric since an overestimated prediction is vastly different than an underestimated prediction in a practical steel plant context. The regular error metric for mean values is written as follows where y i is the true value,ŷ i as the predicted value, and i ∈ 1, 2, . . . , n. The standard deviation, E σ , is derived from E µ the ordinary way.

Hyperparameter Optimization
Hyperparameter optimization aims to find the combination of parameters that creates the best model with respect to minimizing the model error [50]. A common method is to use a pre-specified grid of these parameters where the framework trains one, or more, models for each combination of the parameters. This specific method is also known as grid-search. For the experiments in this paper, parameters include model-specific parameters as well as application-/domain-specific parameters both of which together totals 36864 parameter combinations. See Table 7. Table 7. The parameters used in the grid-search of which 2 · 3 · 48 = 288 are model-specific and 64 · 2 = 128 are domain-specific. A total of 288 · 128 = 36,864 parameter combinations.

Parameter
Description Values #

Activation function
Activation functions have different gradient intensity in the updating step of each iteration and different upper and lower bounds. This influences the training phase.

2
Learning rate A larger learning rate means that the gradient-descent algorithm takes a larger step in error space and may therefore miss more optimal local minimum points. Smaller steps, i.e., smaller learning rates, are preferable to find more optimal local minimum points.
[0.001, 0.01, Hidden node topology Increasing number of layers and nodes means increased model complexity.
Here, the goal is to investigate whether or not more hidden layers leads to a better model. One should always strive for model simplicity when two models are statistically significant equally good.

Input variables
It is not possible to know, a priori, which combination of input variables is the most relevant for a statistical model to predict the output variable with high precision and accuracy. See Section 2.4 for reasoning.
See Table 4. 64 Validation set The effect of a randomized validation set and ordered validation set with respect to the rest of the training data will be investigated.

ordered, randomized
2 Parameters which were the same for all model combinations are explained below: • Validation fraction: 0.2, which specifies the fraction of the training data used as validation set. • Gradient-descent algorithm : An adaptive learning rate optimization algorithm known as Adam. This algorithm has been shown to outperform other gradient-descent algorithms in a variety of models and datasets. The algorithm-specific parameter values were selected as recommended in the paper [51]. Each combination of parameters represents one trained model type. Furthermore, each model type will be instanced 10 times to investigate the stability of the parameter selection and to reduce the impact of randomness. The metrics used to evaluate the model types based on the 10 instances are explained in Table 8. To determine stability, the idea was to keep theR 2 min andR 2 max as close to theR 2 µ as possible. Hence, for filter out the best model type of each variable batch, the following algorithm was used.
1. Filter out any model that does not pass the following condition:R 2 max −R 2 min ≤ 0.05. The motivation using this specific condition was to ensure that only stable models were selected while still enabling at least one model type from each variable batch to pass the condition. 2. Sort the models on decreasingR 2 µ . 3. Pick the first model in the list.
The reason adjusted R-square was used for determining stability is because the metric indicates goodness of fit for the model. Mean model error and standard deviation of model error were included to determine the best model out of models who have close to equal R-square.

Algorithmic Approach
In the algorithmic approach, process knowledge will not be the basis of choosing variable batches. By contrast, the variables in each batch will be chosen based on their one-to-one correlation with the EE consumption. In this manner, the approach will be mostly algorithmic except for the chosen setup of available variables, which are the same as in the domain-specific approach. The proposed algorithm is described below: 1. Calculate the pair-wise correlation values between the input variables and the output variable (EE consumption). 2. Use all input variables in the first model. 3. For each subsequent model, remove the input variable with lowest correlation value with respect to the output variable. Save the performance metric results from all models for further analysis.
In the experiments, dCor will be used as correlation metric due to its ability to detect non-linear correlative relations between variables. dCor is explained in Section 3.4.5.
The rest of the model-specific and domain-specific parameters will be the same as in the domain-specific approach. This results in a total of 35 · 2 = 70 domain-specific parameter combinations and a total of 2 · 3 · 48 = 288 model-specific parameter combinations. Using the algorithmic approach, a total of 70 · 288 = 20,160 model types were trained. To filter out the best model for each variable batch, 10 instances of each parameter combination were created and the same algorithm as in the domain-specific approach was used.

Model Analysis
To ensure the reliability of a model, its transparency must be highlighted. To achieve this, three different statistical methods will be used. Two of these methods, KS test and dCor, are model independent and point to reasons behind model performance deviation from the training and test data. The third method, FI, is model dependent and explains how important each input variable from the model's viewpoint. By using FI in combination with dCor and KS tests, the goal is to home in on the variables that are the strongest reason behind model performance deviation between the training and test data.
dCor: Correlation metrics are used to investigate whether a variable is correlated with another variable. Correlated variables may or may not be causative. Nevertheless, two correlated variables may provide a hint to a possible causative relationship. Using domain-specific knowledge it is possible to assess the relative strength of the causative relation based on the correlation value, if there is evidence of such from physical deliberations.
The commonly used Pearson correlation metric only detects linear and monotonic relationships between two variables [52]. This severely limits the relevance of the correlation metric within the scope of this study since it is well known that parameters governing the EAF process are non-linear with respect to the EE consumption, (see Sections 2.2 and 2.3).
Alternative correlation metrics, such as dCor, can detect both linear and non-linear relationships between variables [53]. The resulting mathematical expression for dCor is similar to the Pearson correlation coefficient: where dCov(V 1 , V 2 ) is the distance covariance and dVar(V 1 ), dVar(V 2 ) is the distance variance of the random variables V 1 and V 2 , respectively. The square root of the latter is the distance standard deviations of V 1 and V 2 .
To calculate dCor, the task is first to calculate the nxn distance matrices of each random variable, V 1 and V 2 : where j, k = 1, 2, ..., n and || · || is the Euclidean norm. Using the distance matrices, the doubly centered distances are calculated as: whereā j is the row mean of the distance matrix,ā k is the column mean of the distance matrix, andā is the grand mean of the distance matrix for random variable V 1 . Analogous for random variable V 2 . The distance covariance is then calculated as the following arithmetic average: Analogous, the distance variance for V 1 and V 2 are: The dCor metric assumes values between 0 and 1, where 0 implies that the variables are independent and 1 implies that the variables are equal.
It has also been theoretically proven that dCor asymptotically detect deviations from independence. This implies that dCor is, at least in theory, a solid tool to detect any relationship between two variables given that enough data is provided. Some alternative correlation metrics with the similar characteristics as dCor are Hoeffding's D measure, HHG (Heller, Heller and Gorfine) measure and MI (Mutual Information) [52]. The use of any particular correlation metric depends in large on the number of observations and the type of relationship one intends to identify and the "nature" of the variables. This implicitly justifies that any correlation metric is not necessarily the "holy grail" of determining variable dependence [52]. It also means that a correlation metric may give low values if the pattern one intends to identify is challenging for the metric to detect. However, due to the limitations of the current study to the statistical modeling of the EE consumption in the EAF, dCor is chosen as the preferred metric due to its ability to detect both linear and non-linear relationships. Furthermore, for large number of observations, dCor has proven to be reliable for many different types of dependencies [52].
dCor will be calculated for each input variable to the output variable for both the training and the test data. A change in dCor between the training and test data indicates that the relation between the input variable and the output variable has changed. The relationship between the input variables and the output variable is what a statistical model learns using the training data. Thus, the change in dCor between the training and test data highlights the variable as a reason behind the model performance deviation.
In the analysis, input variable pairs with dCor values at or above 0.1 will be referred to as having relatively high correlation values. The reasoning behind this limit is two-fold. First, most of the dCor values between the input variables pairs were shown to be lower than 0.1 and most of the expected correlations between the input variable pairs was higher than 0.1. Second, to the authors knowledge, a clear guidance in the literature to what is considered a low or high dCor value does not exist.
KS tests: The KS-value is the maximum distance between two cumulative distribution functions (CDF). The CDF are either two samples of the same variable or one sample and one ideal CDF, such as the CDF for the normal distribution. Furthermore, the KS test is a non-parametric statistical test which means that it does not make any presumptions about the distributions governing the samples [54]. Variables from the EAF production are often from varying classes and superpositions of distributions, see Figure 4. The KS test is conducted by calculating the confidence, the p-value, of the KS-value under the null hypothesis, H 0 , that the two samples are from the same distribution. The KS-value takes values between 0 and 1 where 0 indicates the distributions are identical and 1 indicates that the distributions are completely different. The p-value is the probability that the two samples are in fact from the same distribution. Hence, a high KS-value with a low p-value strongly indicates that the two distributions underlying the two samples are different.
In the experiments, the two-sample KS test is of particular interest because it enables the study of the difference in two sampled distributions of the same variable. The two-sample KS test equations follows: where F n 1 and G n 2 are the two distribution functions with n 1 and n 2 are the number of samples from the two distributions, respectively. x is the total sample space. sup is the supremum. The two-sample KS test is illustrated in Figure 6. The null hypothesis, which is that the samples come from the same distribution, is rejected if: where α is the significance level and c is the threshold value calculated using α and the cumulative KS distribution [55].
One drawback of KS test is that it is sensitive to any difference between the two distributions. A large, but local, difference between the distributions will be captured by the KS test, but may not necessarily be representative over the complete distribution space.
The two-sample KS test will be used to investigate the distribution difference between the training data and the test data for both the input variables and the output variable. Since the neural network weights are adapted to the training data, any significant difference in distribution between the training and the test data for any variable (more affect on the most important variables) will affect the performance of the model on previously unseen data (test data). The aim is to pinpoint variables that are a probable cause to the performance difference rather than to prove specific performance changes by each variable. The latter approach is difficult, if at all doable, partly due to the complex interrelations between the variables in the data.
KS-values above or equal to 0.2 with p-values at or below 0.05 will be considered to be variables of interest in the analysis. The main reason is to avoid capturing all variables while at the same time capturing variables that change significantly enough between the training and test data. The implications of this limit on the KS test will be discussed. Distributions plots for variables above the KS-value threshold can further help to identify if the difference between the training and test data is significant enough.
FI: It is important to investigate how much each input variable affects the output variable. A useful way to do this is to use interpretable machine learning algorithms. One such algorithm is called FI which ranks each input feature, e.g., variable, in order of importance with respect to all predictions. Hence, FI is known as a global interpretable machine learning model. Each variable is singly broken and the recorded error from the model on this new data set is compared with the error from the original, unbroken, data set. If the error is higher, then the input variable is of some predictive importance with respect to the output variable. If the error is statistically indifferent, then one can conclude that the input variable is of little importance to the model [56]. The algorithm is described as follows: 1. Train a model and record its error L(X). 2. Permute one of the input variables,x j , in the input matrix X. 3. Apply the permuted input matrix, X j , to the trained model and record the error, L(X j ). 4. Repeat steps 2 and 3 for all input variables j ∈ 1, 2, ..., m. 5. Order all variables in the order of decreasing L(X j ). 6. Normalize all L(X j ) on max(L(X j )) (optional).
where L is the model loss function, X is the complete and non-permuted input data matrix, and m is the total number of variables.
FI will be applied to both the training data and the test data. FI on training data tells us how much the model relies on each input variable for making predictions on new data. On the other hand, FI on test data reveals how important each input variable is to the actual model performance on unseen data [57]. Hence, both approaches can be viewed as complementary tools for model transparency.
If the training and test data for a given variable come from the same distribution, then the FI value will be the same given that the number of data points is large enough. A larger deviation in FI indicates that the underlying relations between the input variables and the output variable have changed. FI will be calculated 20 times for each variable and data set to account for randomness in the permutations.
The presented values will be the mean FI-values for each variable.

Modeling
See Table 9 for the domain-specific cleaning results. Only 10% of the original data was removed after all cleaning steps. Table 9. The remaining data from the domain-specific cleaning rules that were applied to the data. The training and test data were selected as described in Section 3.2.3.

Filter
Training In a practical application context, only the best model will be selected. However, some interesting observations were made when analyzing the results from both the domain approach and the algorithmic approach. Therefore, a total of 6 models were selected for further analysis. D1 was selected because it uses all input variables, D64 was selected because it uses only the base variables in the domain approach. D15, D17, and D31 were selected because they are the top 3 best performing models on the test data from the domain approach. A21 was selected because it is the best performing model from the algorithmic approach. Hence, we refer to D15, D17, D31, and A21, as the best models. The true EE consumption is plotted against the predicted EE consumption by models D15 and A21 in Figure 7. The performance of the selected models are presented in Table 10. The performance of the best model from each variable batch is presented in Figures A1 and A2 for the domain approach and algorithmic approach, respectively.  A significant negative mean error on the test data compared to the training data can be observed for all selected models. This is also consistent for a majority of the best models from the variable batches, as can be seen in Figures A1 and A2. The minimum and maximum errors are in the ranges of −3.7 MWh to −4.8 MWh and 5.4 MWh to 6.6 MWh on the test data, for the best models.
Comparing the performance from the best model to reported ANN models in the literature is shown in Table 11. The standard deviation of error is better than all reported models evaluated on test data, while the mean error is worse. The minimum and maximum error are comparable. However, the number of variables is significantly lower for the best model from the experiments done in this paper; 20 compared to 82 to 100 in the literature. Furthermore, the results from the models in the present paper are the average over 10 iterations. No such approach to combat model instability was done on the models in the literature. In addition, one has to keep in mind that the reported ANN models, evaluated on test data, in the literature lack R 2 for all models and % cleaned data for all but one model [43]. The best model from the algorithmic approach is almost as good as the best model from the domain approach with a difference inR 2 µ of only 0.025. The model with the least amount of input variables, D64, has aR 2 µ that is only 0.001 higher than theR 2 µ for the model with the highest number of input variables, D1.
The performance on the test data is better than on the training data for D15, D31, D64, and D21. Significantly so for D31 and D64 where a difference inR 2 µ 0.062 and 0.096 are observed. This is somewhat unexpected since a model usually performs worse on test data.

Model Analysis
Assuming a threshold p-value of 0.05, all variables except Delays, Type B, Type C, and Type F, are of particular interest for further analysis of the KS-values, see Table 12 Figure 8.
The dCor between the input variables and the EE consumption are shown in Table 12. Changes in dCor above 0.1 between the training and test data are observed for Delays, TTT, Total Weight, Propane, Process Time, Extended refining, C, Si, O, Metal Weight, Type C, Type E, and Type N, raw materials.
Expected correlations between variables, as discussed in Section 2.4, can be observed in Table A4. dCor calculations between time variables and time variables have higher dCor values compared to dCor calculations between time variables against other variable types. The intra-correlation for metal composition, oxide compositions, and raw material types, are expected since the sum of the compositions are equal to one and the sum of the raw material types are bound by the total weight. The inter-correlation between metal compositions, oxide compositions, and the raw material types, indicates the complex relationship between these variables. Furthermore, the differences in dCor between the training and test data for all variables are explicitly shown in Table A3. While some larger changes in dCor can be observed for the time variables, the variables with the most number of larger dCor changes are the weight variables, metal composition, oxide composition, and raw material type variables.  Table 13. Delays, TTT, and Total Weight are the three most important variables for all models, except for model D15, with respect to the training data. Some significant changes in FI from the training and test data sets are also observed. These are marked with bold and underlined numbers in Table 13.   Tables A4 and A5). Analyzing their combined impact on the EE consumption is too extensive for this article. The EE consumption have increased in the test data set. So have the TTT, Process Time, Total Weight, and Metal Weight. The delay is about equal on average between the training and test data sets.

Grid-Search Metadata
The parameters from the models passing theR 2 max −R 2 min ≤ 0.05 criterion (see Section 3.4.3) and the parameters from the best models are summarized in Tables 14 and 15, respectively. Validating on chronologically ordered data during training is not always beneficial as the ordered is only present in 27% of the best models. Number of hidden layers, which is a proxy for model complexity, is equal to one for 87% of the best models. A similar reasoning can be made for the algorithmic approach, which metadata is presented in the same tables.

Modeling
The best model from the domain approach has aR 2 µ of 0.706 while the best model from the algorithmic approach has aR 2 µ of 0.731. These values were calculated from the average of 10 model instances, and the difference is not large. This means that the algorithmic approach can be used as a tool to select variables that governs a model that will have a top-tier accuracy. However, it is important to consider the domain-specific reasoning behind the initial selection of the 35 available input variables. It is, therefore, not suggested that one should blindly select variables stored in an arbitrary EAF-database and then perform the algorithmic variable selection explained in this paper. While this could work, it is not what is intended using the algorithmic approach.
Using a large scale grid-search with over 57,000 model types, using both a domain-specific approach and an algorithmic approach, the best model type achieved an R 2 µ of 0.731, a mean error of −554 kWh/heat, a standard deviation of error of 1126 kWh, a minimum error of −3819 kWh/heat, and a maximum error of 5735 kWh/heat. The best model in the literature had a mean error of approximately zero, a standard deviation of error of 1300kWh/heat, a minimum error of −3500 kWh/heat, and a maximum error of 6000 kWh/heat. The errors from both models are similar. However, the literature model did not report averages of 10 model iterations. Therefore, it is not possible to know if the performance is stable or purely based on luck. The models in this paper were selected after a filtering criterion,R 2 max −R 2 min ≤ 0.05 (see Section 3.4.3). After this filtering, the best models based on the highestR 2 µ on the test data were selected. In addition, the model from the literature did not report R 2 µ and was only evaluated on 20 test data points compared to 362 test data points for the best model produced in this paper. Assuming a production of 20 heats per day, this represents a model evaluation difference of approximately 17 production days. Furthermore, neither the data treatment specifics nor the percentage of cleaned data were specified for that model. Model reporting shortfalls, of which some are mentioned above, are frequently occurring in the available literature on the subject of statistical modeling to predict the EE consumption in the EAF. It is important to clearly describe all steps in the data-and modeling pipeline that impose changes to the end result. In the context of this paper, the end result is the best model. The best model presented in this paper is also less complex than the best reported model in the literature even though both models perform similarly. While the number of hidden layers are equal between both models, the model reported in the literature has 100 input variables compared to 20 input variables for the best model produced in this paper. In the interest of model parsimony, simplicity should be followed when creating any model. On the same note, the model using 6 of the base variables has an R 2 µ that is not significantly better (a difference of only 0.001) than the model using all 36 input variables, and the best model uses only 20 variables. Hence, adding a lot of input variables does not necessarily create a model with better performance. One has to search for the optimal combination of variables using either a purely domain-specific approach or an algorithmic approach, such as the one presented in this paper. Tables 14 and 15 further support model simplicity since models with 2 layers are not preferred over models with 1 layers with respect to model stability (theR 2 max −R 2 min ≤ 0.05 criterion, see Section 3.4.3) or with respect to the best model from each variable batch.
3 of the 6 selected models perform better on the test data than on the training data. This is somewhat contradictory since models usually perform worse on test data. However, the selection of the best models from each variable batch was based on the performance on the test data. The performance on the training data was not considered since, in a practical environment, only the performance on test data is important.

Model Analysis
Investigating the dCor-matrix for the training data, Table A4, one can observe many relatively high 1-to-1 correlation among the metal composition, oxide composition, and raw material type variables. This is in line with what is expected from a process-metallurgical perspective. Because of this trait, these variables share information between one another to a large extent. For example, Type D is relatively highly correlated with all metal composition and oxide composition variables. Simplified, Type D has the potential to account for the metal and oxide composition which renders those variables less important to the models performance. This is the reason adding a lot of input variables does not necessarily create a model with better performance, and the reason variables that are important in a physical model may be next to worthless in a statistical model. Changes in dCor from the training data to the test data, see Table A3, indicate that the underlying relation between the variables have changed. This is not as frequently occurring for the time variables as for the metal composition, oxide composition, and raw material types where changes to both lower and higher dCor are present. Due to changes in dCor, changes in the model performance are also expected. However, analyzing a 36x36 correlation matrix in depth is both impractical and does not show which variables are the most important for the performance of the model. Table 12 shows the results from the KS tests performed on each variable on training and test data sets. KS-values greater than or equal to 0.20 were considered of significant change given that the p-value is 0.05, or less. In conjunction with FI for the selected models, see Table 13, it is possible to identify reasons behind the performance differences from training to test. Delays, TTT, Process Time, Total Weight, and Metal Weight are some of the most important variables for the 6 selected models. Out of these variables, Total Weight and Metal Weight are variables with KS-values above 0.20. Process Time, TTT, and Delays, had KS-values of 0.18, 0.13, and 0.04, respectively. Furthermore, the KS-value for EE consumption is 0.25, which indicates that the EE consumption has also changed from the training to test data. However, KS-values do only provide a difference in the CDF and do not indicate whether the variable has increased or decreased. Observing the variable distributions for the training and test data for these variables in Figure 8, it is clear that all of these variables, except Delays, have increased from the training data to test data. Since the EE consumption increases with the amount of charged materials and longer process times, and because these input variables are important for the models, it is highly probable that these variables give rise to the decrease in ∆ µ and the change inR 2 µ for the 6 selected models.
Please note that the limitation of demarcating "interesting" variables to only those at or above 0.2 KS-value becomes clear when observing the clear and consistent increase in Process Time and TTT in Figure 8. Nevertheless, using KS test in conjunction with FI and the additional distribution plots enabled the discovery of the implications on the model performance due to the changes in these variables.
∆ µ is defined as the average error for all predictions across all 10 model instances. Each error is in turn defined as the true value minus the predicted value. Since all 6 selected models get a significant negative ∆ µ on the test data, the models overestimate the EE consumption due to the changes in the variables with the highest FI. In particular, the Total Weight, Metal Weight, TTT and Process Time. This behavior is not unexpected since any statistical model will predict the value of the highest probability with respect to the data it has used to adapt its parameters. In this case, the 10,966 training data points have adapted the ANN weights. Any change to the distribution of the variables with high FI will affect the models prediction on data sampled from this aggregated distribution.

Conclusions
One of the aims with the present work was to address the previously reported shortcomings in the statistical modeling approach to predict the EE consumption of the EAF [43]. In addition to addressing these shortcomings with a detailed data-and modeling pipeline, the choice of statistical modeling framework and statistical analysis tools was adapted to the non-linearity of the EAF process and the complex correlative relations between the selected process variables. The main conclusions of the modeling approach, and suggested future work, may be summarized as follows:

•
In the interest of model parsimony, increasing the number of input variables does not automatically increase model performance. This is both observed when comparing the best model from the experiments to the best model reported in the literature and when comparing model D64, which uses 6 variables, and model D1, which uses 35 variables. It is sufficient to choose several variables from an initial selection demarcated by the use of process knowledge. • Similar performance was retrieved by the best models from the algorithmic approach and the domain approach, respectively. Given a setup of variables that are selected with process expertise, an algorithmic approach can find an optimal selection of variables. This is important from a practical applicability standpoint since less time need to be invested in searching for the optimal number of variables and the optimal subset of variables.

•
Neural networks with many layers, increasing the model complexity, are not necessary to produce state-of-the-art model performance. This was observed both in the grid-search performed in the experiments and in comparing the best model from the experiments with the best models reported in the literature.

•
The mean error is slightly higher, and the standard deviation of error is better for the best model compared to the best model reported in the literature. The minimum and maximum errors are approximately the same as the best model reported in the literature. However, these are −3.8 MWh to 5.7 MWh, respectively, which are quite high from a practical application context.

Model Analysis
• Using KS test and FI as complementary tools, it was possible to identify an increase in Total Weight, Metal Weight, and Process Times as a highly probable cause behind the performance change from the training to test data for the 6 selected models. • An analysis of the ∆ µ error from the selected models on the test data indicates that the models overestimate the EE consumption with regards to the change in the most important variables to the models. In particular, the Total Weight, Metal Weight, TTT, and Process Time variables.

•
High intra-correlation between the metal composition, oxide composition, and raw material type variables was found using dCor on the training data. This explains why some of these variables get a high FI in some of the models. The changes in dCor from the training to test data were also prominent for these variables and can explain part of the performance changes for the selected models attributing higher FI to some of these metal composition, oxide composition, and raw material type variables.

Future Work
Based on the present work carried out and the conclusions drawn, it is suggested to further investigate the following:

•
Use upsampling on the training data to get more data points as outliers. Could potentially solve the problem with large maximum and minimum errors.

•
Develop a more advanced variable selection algorithm based on dCor that takes into account one-to-many correlations for each input variable. The current algorithm only considers the correlation between the EE and each input variable.

•
Apply dCor to groups of variables and the EE consumption instead of variables in a 1-to-1 fashion. This feature of dCor should be explored further. Funding: This research was funded by "Hugo Carlssons Stiftelse för vetenskaplig forskning" in the form of a scholarship granted to the corresponding author.

Acknowledgments:
We want to thank the plant engineers Pär Ljungqvist, Christoffer Schmidt, and Jesper Janis at the Outokumpu Stainless Avesta mill for their support and data provisioning during this project.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: