The Steelmaking Process Parameter Optimization with a Surrogate Model Based on Convolutional Neural Networks and the Fireﬂy Algorithm

The Steelmaking


Introduction
Under the pressure of the fierce competition between steel companies, quality improvements in high-strength low-alloy steel (HSLA) products are constantly being pursued. Abnormal variations in upstream process parameters such as alloy composition might cause deviations in mechanical properties and thus lead to unsatisfactory quality and a high rejection rate. To reduce the rejection rate and to effectively improve the competitiveness of China Steel Corporation (CSC) products, we developed a dynamic process control system to predict and monitor the mechanical properties of products before they enter downstream production lines. If the predicted mechanical properties deviate too much from the usual level, the system can perform quality compensation by calculating and applying appropriate downstream process parameters and thereby meet the final quality requirements. This is done through a sequence of up-and downstream production lines, which include the crude making, hot rolling, cold rolling, and cold-rolled coating lines. Each line should meet quality-level requirements, or else the overall finished products will not achieve their required qualities. In the crude making process, iron ore is first reduced to iron by mixing it with coal/coke and limestone in a blast furnace (BF); then, the iron is converted into steel using a basic oxygen furnace (BOF). The hot-rolling is a mill process in which the steel is rolled at a temperature above its recrystallization temperature. When steel is heated past its recrystallization point, it becomes more malleable and can be properly formed and shaped. This also allows for the 2.1ability to produce larger quantities of steel. The steel is then cooled at room temperature, which "normalizes" it, eliminating the worry for stresses in the material arising when quenching or work-hardening. Cold-rolling steel allows for the creation of precise shapes. Since the process is performed at room temperature, the steel will not shrink as it cools, as it does in the hot-rolled process. The cold-rolled use plastic coating to protect the steel surface.
Although the CSC is a high-quality manufacturer of HSLA, a small number of nonconforming products cause a great loss of finance and reputation. The steeling process parameter optimization is important to promote the quality of HSLA. Most manufacturing processes [1,2] require parameterization to achieve their optimal cost, quality, and other properties. The number of process parameters considered usually exceeds 10 or even 100, and current approaches to their optimization require many expensive and complex experiments. Although some physically precise simulation models have been developed, such as the finite element method [3] and the Taguchi method [4], these need many hours or even days for computation. Another approach is the use of surrogate models, which can effectively decrease the number of simulations needed when applied to the problems of process parameter optimization.
The surrogate model is an easy-to-evaluate approach to construct high-fidelity product models [5][6][7][8][9], which were created by using a decision tree, artificial neural network, radial basis function, kernel smoothing, and stochastic processes [10]. Surrogate-based optimization tries to search for the optimal process parameters based on an established surrogate model, and it requires an appropriate mechanism to do so. Genetic algorithms [11] and model-based self-optimizations [12] have been used to iteratively improve candidate process parameters.
We further developed a simplified 1D version of the original 2D Visual Geometry Group (VGG) 16 model [26,27] to establish a surrogate model between the process parameters and the product quality attributes, as shown in Figure 1. This 1D VGG model is called the simplified VGG (SVGG) model, and it can be regarded as a process for controlling HSLA product quality attributes through its parameter inputs. To meet the product quality requirements of HSLA, we applied the PSO, ABC, and firefly algorithm to search for the optimal adjustable process parameters and then com-pare their performance. In our experiments, 9000 samples (meeting the quality requirements) were used to train the five different methods for establishing the surrogate model, and 299 samples (not meeting the quality requirements) were used to evaluate three bioinspired process optimization search algorithms. The FA ultimately produced the most optimal selection of adjustable process parameters. Our experimental results further showed that all the adjustable process parameters used in our test samples were successful in determining whether the corresponding product attributes met the mechanical requirements of the product.
The main contributions of this paper are as follows: 1.
We addressed the interesting problem of steelmaking process parameter optimization by proposing a simplified VGG model to build a surrogate model and then compared it with four other machine-learning methods.

2.
We applied three different algorithms-PSO, the ABC, and the FA-to search for optimal process parameters and then evaluated their performance. Our experimental results demonstrated that the FA can achieve high performance and outperforms the other methods.
The remainder of this paper is organized as follows. Section 2 reviews the machinelearning methods and bio-inspired algorithms. Section 3 describes our proposed method called the simplified VGG model + Firefly algorithm to search for optimal process parameters. Section 4 presents our experimental results and discussions. Finally, conclusions and remarks are given in Section 5.

Surrogate Model
A production process can be represented as a function π : X → Y , in which X denotes the process parameters, and Y denotes the product attributes. This function maps process parameter configurations x ∈ X to product attributes y ∈ Y., and the map can be regarded as a surrogate model ϕ : X → Y based on all observations (X i , Y i ). The construction of a surrogate model needs a fitness function to evaluate how well the model predictions match the observations. In general, we need an optimizer to obtain the optimal linear or nonlinear maps. In this study, we use linear regression, random forests, support vector regression, neural networks, and convolutional neural networks to build surrogate models, with process parameter optimization being dependent on the accuracy of these models. These five methods are described in the following subsections.
where Y j and f i are the j-th product attribute and the i-th dimension of the process parameters, respectively. Particularly, the gradient descent method is used to decide the regression parameters α i,j and β j .

Random Forests
Random forests (RFs) is an ensemble learning method used for classification or regression that constructs a multitude of decision trees at training time and outputs the mode or mean predictions of those individual trees. In general, RFs outperform traditional decision tree methods because they can overcome the problem of overfitting. In the implementation of an RF algorithm, new training sample sets are randomly selected by replacing the original training set. Each training set is separated into two in-of-bag sets, including one-third of the samples and one out-of-bag set from the remaining two-thirds of the samples. All samples in the out-of-bag set are collected into test samples, while samples from each of the in-of-bag sets are dependently built by decision tree induction into their own decision tree. This way, the regression results R(x i ) of each test sample x i are calculated by Equation (2): where M is the number of decision trees, and y m is the decision value of the m-th decision tree of test sample x i .

Support Vector Regression
Support vector machines are intelligent statistical learning algorithms used for classification and regression. They solve regression problems through a nonlinear mapping function µ(x i ), which maps the original samples x i to a feature space of a higher dimension, and then uses the linear regression method to compute the corresponding targets. Given a set of n training data { (x 1 , y 1 ), (x 2 , y 2 ), (x 3 , y 3 ), . . . ., (x n , y n )} ∈ R n × R, where x i is the input vector and y i is its target, the following decision function can be defined: where · is the inner product, ω is the weight vector, and b is the bias parameter. SVR applies the structured risk minimization principle to generate the above decision function [see Equation (3)] by minimizing a regularized risk function as presented in Equations (4) and (5): where: In SVR, Vapnik's ε-insensitive loss function is used to measure empirical risk, where ε is the tube size, 1 2 ||ω || 2 is a regularization term used as a measure of flatness or complexity of the function, and C is a regularized constant that describes the trade-off between the empirical risk and the regularization term.
According to Wolfe duality and the saddle-point condition, the dual optimization problem of the aforementioned primal one is described by the following term: The weight parameters are then described by are nonnegative Lagrange multipliers, which can be obtained to solve the convex quadratic programming. Finally, based on Equation (7) and the radial basis function (RBF) kernel trick, the decision function given by Equation (3) has the following specific form: where, with σ representing the kernel parameters and j ∈ {j < C}. Note, there are three penalty parameters: C, the RBF kernel parameter, and the width of the ε loss function.

Multilayer Perception
Multilayer perception (MLP) is a powerful class of data-driven function approximation algorithms that represent information through a hierarchy of features. They follow a simple ANN model, beginning with the input layer and ending with the final output layer, with intermediate layers known as hidden layers. By manipulating the number of hidden layers and the size of each, one can learn functions of arbitrary complexity. The input and output layer sizes are fixed, being determined by the dimensionality of the input feature and the output target. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP algorithms also utilize a supervised learning technique, i.e., backpropagation for training. We used an MLP algorithm with two hidden layers; each one has 100 nodes, the dimension of the input layer is 30 (the dimension of our process parameters) and involves three targets (the product attributes).

Bio-Inspired Search Algorithms
Over the last decade, modeling the behavior of social insects such as ants and bees has been used as a way of solving search problems. There are many different bio-inspired search algorithms, including particle swarm optimization, artificial bee colony, and the firefly algorithm, which have been widely used in numerical optimization [28], motion estimation [29], image thresholding [30,31], neural network parameter training [32], enhanced near field characteristic [33], simulation-driven spatial phase shifters [34], and electromagnetic band-gap resonator antenna [35].

Particle Swarm Optimization
Particle swarm optimization (PSO) is a metaheuristic, stochastic, and populationbased evolutionary optimization algorithm, and its standard form was initially developed by Kennedy Eberhart [36]. It searches for an optimal solution in its search space through the modeling of a swarm, where each particle in the swarm survives with a velocity and a position in the solution search space. The lower and upper bounds of each dimension of a particle are denoted in the algorithm by lb and ub. It improves on the best solution traversed so far by iteratively updating its velocity and position in the search space, as described by Equations (9) and (10): where ω indicates the inertia weight, c 1 and c 2 are the learning rates, r 1 and r 2 are random numbers ranging from 0 to 1, and p i (t) and p g (t) represent the personal local best and the global best, respectively.

Artificial Bee Colony Algorithm
The artificial bee colony (ABC) algorithm, as proposed by Karaboga and Basturk [37], has recently become available and is a promising technique for solving real-world optimization problems. It models a colony of artificial bees, containing three different groups: employed bees, onlookers, and scouts. The employed bees carry information about food sources and share it in the dancing area of the hive. The onlookers wait in the dancing area to receive this probability information from the employed bees, and they use it to make decisions regarding the selection of a food source. The computation of this probability is based on the amount of food located at each source. The other kind of bee-scouts carries out random searches for new food sources. An employed bee becomes a scout when its food source is abandoned and becomes an employed bee again as soon as it finds a new food source. Therefore, each cycle of the ABC algorithm contains three steps. First, employed bees are sent to the known food sources, and the amounts of nectar are calculated. After receiving that information, onlooker bees visit the food sources and provide updates. When the nectar at a food source is depleted, a scout is sent out to find a new food source.
Within the algorithm, the position of a food source x i represents a candidate solution to the optimization problem, and the amount of nectar at this food source is denoted as its fitness (fit). In general, the number of employed bees or onlookers is equal to the number of food sources. Initially, the ABC algorithm randomly generates a distributed initial population of K solutions, denoted by P = {x 1, x 2 , . . . , x K }, where K denotes the number of employed bees or onlookers, and each solution x i (for i = 1, 2, . . . , K) is a D-dimensional vector. During each execution cycle C (where C = 1, 2, . . . , MCN, the maximum cycle number), the population of solutions is subjected to the search processes of the employed bees, onlookers, and scouts. An employed bee modifies the possible solution as a function of the fitness value (amount of nectar) of the new solution (food source) by using Equation (11): where k ∈ {1, 2, . . . , K }, but k = I and j ∈ {1, 2, . . . , D} are randomly selected indexes, and ϕ is a random number between −1 and 1.
If the fitness value of the new solution v i is greater than that of the previous solution x i , then the employed bee simultaneously remembers the new solution and abandons the old one; otherwise, it will retain the location of the old one in its memory.
When all employed bees have finished their search process, they bring back the information they have on the positions and nectar amounts of all food sources to the onlookers. Each of the onlookers then decides on a particular food source to further call upon, according to a probability proportional to the amount of nectar at that food source. This probability p i of selecting a food source z i is determined using the following Equation (12): In practice, each food source z i sequentially generates a random number between 0 and 1. If this random number is less than the probability p i , an onlooker is sent to the food source and produces a new solution based on Equation (13): where k ∈ {1, 2, . . . , K }, but k = i and j ∈ {1, 2, . . . , D} are randomly selected indexes, and ϕ is a random number between −1 and 1.
If the fitness value of the new solution is greater than the old one, the onlooker memorizes the new solution and shares this information with the other onlookers. Otherwise, the new solution is discarded. This process is repeated until all onlookers have been distributed to food sources. If the food source could be improved upon (as predetermined by a limiting value), then it is abandoned, and the corresponding employed bee becomes a scout. This scout then goes on to discover a new food source to replace theabandonedsolutionz j , as described by Equation (14): where z j min and z j max are the lower and upper bounds of the j-th component of the solution, and σ is a random number ranging from −1 to 1. If the new solution is better than the abandonedsolutionz j , the scout becomes an employed bee, and the new solution is retained.
The search processes of the employed, onlooker, and scout bees are repeated until the execution cycle equals MCN. Of all the methods described so far, the best solutions with the largest fitness are outputted by this ABC algorithm.

Firefly Algorithm
The firefly algorithm (FA) was developed by Xin-She Yang at Cambridge University in 2008 [38][39][40][41]. It has three idealized rules: first, all fireflies are unisex, so each firefly is attracted to all other fireflies, regardless of their sex. Second, attractiveness is proportional to brightness-thus, for any two flashing fireflies, the least bright one will move toward the brighter one. If there is no brighter one, then that particular firefly will move randomly. To model firefly attractiveness, one should select any monotonically decreasing function of the distance r i,j = d x j , x i from the chosen (j-th) firefly x j to the target (i-th) firefly x i . This is described by Equations (15) and (16): where β 0 is the attractiveness at r i,j = 0, and γ is the light absorption coefficient at the source. The movement of firefly i when it is attracted to another, more attractive firefly j is determined by: If a particular firefly x i is already the brightest (i.e., it is the one with the maximum fitness), then it will move randomly according to the following equations: where rand1 ≈ U(0, 1) and rand2 ≈ U(0, 1) are random numbers obtained from a uniform distribution.
The third rule is that the brightness of a firefly is affected or determined by the landscape of the fitness function ϕ(•). For maximization problems, the brightness I of a firefly at a particular location x can be chosen as a function I(x) that is proportional to the value of the fitness function ϕ(x).

Materials and Experimental Setup
All experiments were performed on a PC with an Intel Core i5 3.30 GHz CPU, 8 GB of RAM, and an NVIDIA GeForce GTX 1060 GPU. All used machine-learning methods were implemented in simultaneously multitask learning and coded the programs by using Python language, and then individually were used to verify the performance of process parameters optimization using the four-fold cross-validation method.
The 9299 HSLA samples were collected from the China Steel Corporation (Kaohsiung, Taiwan) from 2016 to 2010. Each sample has a 30-dimensional process parameter and three corresponding product quality attributes., respectively. The 9299 samples included the 9000 training data samples (two of 10 samples as validation samples), which meets the quality requirement and 299 false samples (as test samples) of no meeting requirements. In general, the false samples are difficult to collect since the CSC is a high-quality steel manufacturer. Each process parameter x includes temperature, steeling time, iron composition, and so on. The product attribute Y includes three quality attributes, i.e., yield stress, tensile stress, and plastic strain ratio. The products of HSLA must meet reasonable quality ranges for these three attributes yield stress [0,185], tensile stress [270, ∞], and plastic strain ratio [39, ∞]. For easy identification, the process parameters are represented by x i = (f i1 , f i2 , . . . , f i30 ) for i = 1, 2, . . . , 9000, and the product attributes are represented by Y i = (y i1 , y i2 , y i3 ) for training samples. However, the additional 299 test samples have only 25 fixed process parameters, with the other five being adjustable parameters denoted by f i2 , f i4 , f i6 , f i8 , and f i9 for the i-th test sample. Table 1 shows the data for the five original training samples, where each includes the 30 dimensions of the process parameters, named f 1 , f 2 , . . . , f 30 , and the three product attributes of Y 1 , Y 2 , and Y 3 . The five adjustable process parameters are restricted to positive integers. In total, 9000 training samples were used to train the simplified VGG surrogate model, each including a complete set of process parameters and product attributes. However, five of the process parameters in each of the 299 test samples were adjustable.

Simplified VGG-16 Convolutional Neural Networks
Convolutional neural networks (CNNs) are a class of deep neural networks that have been widely applied in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain-computer interfaces, and financial time series. A CNN consists of an input layer, hidden layers, and an output layer. In any feed-forward neural network, middle layers are called "hidden" because their inputs and outputs are masked by the activation function and by the final convolution. In a CNN, these hidden layers include layers that perform convolutions. Typically, this includes a layer that does multiplication or finds a dot product, and its activation function is commonly referred to as a rectified linear unit (ReLU). This layer is followed by others, pooling layers, fully connected layers, and normalization layers.
VGG is a popular CNN model containing 13 convolutional layers and three fully connected layers, commonly applied to image classification and pattern recognition. However, it is not directly suitable for building our surrogate model. In this paper, we instead use a simplified VGG model for this task. As the traditional VGG-16 model has a 2D layer structure consisting of many convolution layers, max-pooling layers, and fully connected layers, it is difficult to directly apply it to our steelmaking process-optimization problem with its 1D mapping. Therefore, we modified this VGG-16 model into a 1D structure with 10 layers (seven for 1D convolutional layers and three for fully connected layers). The seven convolutional layers generate powerful features, and three fully connected layers use the extracted features for regression.
More precisely, the differences between the traditional VGG-16 model and our simplified VGG model are shown in Figure 2. The simplified VGG model deletes the final soft-max procedure and maps the process parameters to a single product attribute. It is a one-dimensional CNN model composed of seven convolutional layers and three fully connected layers. The convolutional layers extract the powerful features, and the fully connected layers match the product attributes used when establishing the surrogate model. Our trained simplified VGG first uses three blocks in which two or three convolutional procedures are followed by max-pooling to obtain feature maps with sizes of 3 × 32; they are then straightened into one-dimensional form. The sequential fully connected layers are applied in order to regress the product attribute of Y 1 , Y 2 , or Y 3 .
In this method, the Adam algorithm [42] is used as an optimizer with a loss function equal to the mean square error (MSE) of the true outputs ϕ(x l ) of the surrogate model and the desired outputs y l , as defined in Equation (21):

Process Parameter Optimization Using the Firefly Algorithm
The process parameters of our test samples are divided into 25 fixed parameters and five adjustable ones, with the latter modified such that the product attributes meet the requirements. Therefore, we first assigned each adjustable process parameter to the i-th firefly solution x i , with the structure depicted in Figure 3. Next, we used the FA to search for the optimal solution x best. ; first, we initialized N firefly solutions, x i , i = 1, 2, . . . , N, where each solution was updated using the FA. Table 1. The X and Y is the sets of process parameters and product quality attributes. The X contains the parameters of temperature, steeling time, iron composition, and so on. The product attribute Y includes three quality attributes, i.e., yield stress, tensile stress, and plastic strain ratio.   The fixed process parameters were integrated into each firefly solution x i into the X i . Using the built surrogate model ϕ(X), the corresponding product quality attributes, Y k = ϕ(X i ) for k = 1, 2, and 3, could be calculated. We then defined a fitness function as the brightness of the FA for the purposes of searching for the optimum, as shown in Equation (23).
where the product attributes Y i1 , Y i2 , and Y i3 are the outputs of the surrogate model ϕ(x) with x as its input, and m Y i and σ Y i are the mean and the standard derivation of the i-th product attributes of the training data samples. The steps of the proposed algorithm are described in detail as follows: Step 1. Generate the initial solutions and the given hyper-parameters: In this step, the initial population of N solutions is generated, as denoted by D = [x 1 , x 2 , . . . ., x N ], where x i = [f i2 , f i4 , f i6 , f i8 , f i9 ] for the i-th firefly solution, and the values of x i are assigned from between −1 and 1. This step assigns the parameters of the FA, which are σ, β 0 , the MCN, and γ. The number of cycles l is also set to 0.
Step 2. Firefly movement: Here, each complete process is combined into X i , and its fitness value Fitness(µ(X i )) is computed as the corresponding brightness of the firefly. For each firefly solution x i , this step randomly chooses another brighter solution x j and then moves toward it, according to the following equations: where u jk ∼ U(0, 1) is a random number ranged from 0 to 1, and x ik is the k-th element of solution x i .
Step 3. Select the current best solution: This step will pick the best solution from the solution set and represent it as x best , as described by the following: Step 4. Check the termination criterion: If the cycle number l is equal to the MCN, the algorithm is finished and will output the best solution x best . Otherwise, l increases by one and the best solution x best will randomly walk its position according to Equation (13). It will then return to Step 2 and repeat the process, as described by: x best,k = x best,k + u k , k = 1, 2, . . . , 5 (27) where u k is a random number ranged from 0 to 1. Without a loss of generality, the PSO and ABC algorithms are also used to search for optimal parameters in a similar manner as the FA.

Training Mechanism by Using ML Methods
In experiments, the training parameters and strategy of the simplified VGG model are listed in Table 2, where each entry is optimal for the model. An initial learning rate of 0.001 and 250 training epochs were chosen in order to achieve adequate convergence. The Adam algorithm [25] was used as the optimizer and the mean square error as the loss function. A batch size of 50 was selected, and the average training time was 232.47 s. In order to evaluate the performance of a trained simplified VGG model, we computed the mean absolute error (MAE) (see Equation (28)) of the predicted and actual product attributes. The MAEs of Y 1 , Y 2 , and Y 3 are shown in Table 3, which reveals that the simplified VGG is the best, i.e., it is capable of producing strong correlations between the process parameters and the product attributes.
where Y il is i-th product attributes of l-th test sample.

Experimental Results of the Process Parameter Optimizations
The FA was applied in order to search for the optimal adjustable process parameters of each of the test 299 samples, with the parameter settings of the algorithm given in Table 4. The FA is an iterative method, and a preassigned end condition is therefore needed. In this paper, we set the maximum number of iterations to 50 and the number of initial firefly solutions to 10. The average time spent searching for the optimal process parameters of each test sample was about 0.62 s. Additional details concerning the FA are shown in Figure 4. In our experiments, the 299 test samples were used to evaluate the performance of each method when searching for the optimal process parameters. The basis of evaluation is the success rate: how often the final product attributes fall in their reasonable quality ranges. Our experiments used the following combinations of methods: RF+PSO, RF+ABC, RF+Firefly, SVGG+PSO, SVGG+ABC, and SVGG+Firefly, the results of which are shown in Table 5. There we see that all the resulting test sample product attributes met their product quality requirements when the SVGG+Firefly combination was used, with its success rate of 100% clearly outperforming the other combinations.

Conclusions
The optimization of process parameters is an important problem in steelmaking. In general, surrogate models can be applied to simulate the manufacturing process, although there are many different methods available to build such models. In this paper, we propose a simplified VGG model for this purpose and then compare it with other machine-learning methods. Within the model, different algorithms (PSO, ABC, and FA) were applied to obtain the optimal process parameters. In our experiments, we evaluated different combinations of trained surrogate models and searching methods. Our proposed simplified VGG model demonstrated to be the best method compared with the linear regression, random forests, support vector regression and multi-layer perceptron. The firefly algorithm is an effective search mechanism to optimize the process optimization in the simplified VGG surrogate model. It achieved a success rate of 100% in the 299 test samples, and these perfect results revealed that it has the potential to be effectively applied in different areas of process parameters optimization of product manufacture. Furthermore, it will be also an interesting thing how to develop a more complex and effective CNN model to build the surrogate model for other applications of process parameter optimization.