Novel Prediction Model for Steel Mechanical Properties with MSVR Based on MIC and Complex Network Clustering

: Traditional mechanical properties prediction models are mostly based on experience and mechanism, which neglect the linear and nonlinear relationships between process parameters. Aiming at the high-dimensional data collected in the complex industrial process of steel production, a new prediction model is proposed. The multidimensional support vector regression (MSVR)- based model is combined with the feature selection method, which involves maximum information coefﬁcient (MIC) correlation characterization and complex network clustering. Firstly, MIC is used to measure the correlation between process parameters and mechanical properties, based on which a complex network is constructed and hierarchical clustering is performed. Secondly, we evaluate all parameters and select a representative one for each partition as the input of the subsequent model based on the centrality and inﬂuence indicators. Finally, an actual steel production case is used to train the MSVR prediction model. The prediction results show that our proposed framework can capture effective features from the full parameters in terms of higher prediction accuracy and is less time-consuming compared with the Pearson-based subset, full-parameter subset, and empirical subset input. The feature selection method based on MIC can dig out some nonlinear relationships which cannot be found by Pearson coefﬁcient.


Introduction
The level of steel industry is an important indicator to measure the industrialization of the country. At present, all walks of life have more and more stringent requirements for iron and steel products. The mechanical properties of steel can often mean the difference between a long, efficient life in the most abrasive and wear-intensive applications, and frequent or even catastrophic failure. Understanding these properties is absolutely important because all production activities are ultimately to satisfy the actual quality requirements. To maintain and improve the product quality, energy efficiency, and economic profits, the quality prediction and control based on some mechanical properties are essential and have been investigated quite extensively in recent years [1]. Among numerous indicators, tensile strength, yield strength, and elongation are the most commonly used measurements for product's mechanical property, which are affected by a variety of comprehensive factors [2]. However, the production process of steel products contains complex physical and chemical changes with intricate technological processes, which means that property prediction and control have always been a difficult problem in the metallurgical industry. In the traditional practice, property prediction depends on the experience and destructive test, which are costly, time-consuming, and laborious. If the prediction could consider the relevant process parameters, and accordingly optimize the metal composition and process technology, it can greatly reduce the testing time and improve the production efficiency of iron and steel enterprises. Based on this idea, two main methods for that are the empirical and statistical of feature selection can be implemented as follows: (a) clustering all the process parameters and (b) selecting representative ones for each group. Some researchers have explored the centrality and influence indicators in complex networks to reflect the importance of nodes in the network [21]. The patterns among nodes, including the differences and connections, can also be studied to find the key network participants [22]. However, the key parameters selected based on experience virtually ignore the parameter interactions such as the similarity between them and their importance in the network. Moreover, many feature extraction methods transform the original data set to another by recombining existing features into new features, which may destroy the original physical structure of data and cause the new features to lose their physical meaning. Therefore, based on the characteristics of the steel product data set, all variables can be clustered according to the correlation coefficient, and the relationships between them can be measured by the centrality and influence indicators, so as to complete the feature selection and obtain the input parameters for the subsequent learners.
With the continuous development of data mining technology, artificial intelligence methods such as neural network [23], fuzzy control [24], and expert system [25] have become more and more popular. Among them, support vector machine (SVM) is an efficient learning machine based on statistical learning theory and structural risk minimization principle proposed by Vapnik. It can deal with problems with multiple input and single output. However, problems in the steel production process often have multiple outputs which are not mutually independent. If multiple support vector machine regression (SVR) algorithms are used to estimate multiple output functions, each sample point cannot be treated equally, so the accuracy is poor. Therefore, in order to improve the accuracy of estimation and reduce the computational workload of multidimensional regression problems, multi-output support vector machine regression (MSVR) can be used for performance prediction for the steel products [26].
Motivated by the above considerations, we propose a novel prediction model for steel mechanical properties, with MSVR based on MIC and complex network clustering. In our model, we measure the correlation between features with MIC, employ hierarchical clustering analysis based on the complex network theory, quantitatively evaluate each feature by centrality and influence indicators, then choose a feature subset as a parameter input which could represent a large amount of information. The MSVR is used to predict the mechanical properties and its accuracy can verify our proposed framework. By the case analysis of the practical steel production data in a steel company in Central China, we compare our method with the full-parameter subset input, empirical subset input, and Pearson-based subset input. It turns out that our scheme has the lowest computational complexity and the highest prediction accuracy.
The remaining sections of this article are organized as follows: preliminaries about the correlation evaluation index, theory of complex network, and the performance prediction model are briefly introduced in Section 2; in Section 3, the detailed development of the proposed novel prediction model with MSVR based on MIC and complex network clustering is presented; in Section 4, an actual case of steel production is studied and the comparison analyses of prediction results are provided; and Section 5 gives conclusions.

Correlation Analysis Methods
Correlation analysis is a basic issue in statistics that aims to quantify the association between two variables from limited data, which can be divided into linear and nonlinear. Linear correlation refers to the case that the output and input are in positive proportion or inverse proportion. When two variables share a linear relationship, the Pearson correlation is the standard measure of dependence, while it is not applicable when relationships are highly nonlinear. The nonlinear correlation is more complex and may be formed by the superposition of a variety of complex functional relationships. Therefore, it is natural to Metals 2021, 11, 747 4 of 20 ask how to measure statistical correlation in a way that treats relationships of different types equally.
As is well known, mutual information (MI) is already widely employed to quantify associations no matter what relationship types [27]. Even though it was proposed in the communications systems, MI has been repeatedly proved to be applicable in various statistical problems. In units known as "bits", MI strictly determines how much information one variable reveals about another. The MI between two random variables X and Y is defined in terms of their joint probability distribution p(X, Y) as 11, 747 4 of 20 or inverse proportion. When two variables share a linear relationship, the Pearson correlation is the standard measure of dependence, while it is not applicable when relationships are highly nonlinear. The nonlinear correlation is more complex and may be formed by the superposition of a variety of complex functional relationships. Therefore, it is natural to ask how to measure statistical correlation in a way that treats relationships of different types equally. As is well known, mutual information (MI) is already widely employed to quantify associations no matter what relationship types [27]. Even though it was proposed in the communications systems, MI has been repeatedly proved to be applicable in various statistical problems. In units known as "bits", MI strictly determines how much information one variable reveals about another. The MI between two random variables and is defined in terms of their joint probability distribution ( , ) as On the basis of MI, Reshef et al. proposed the concept of maximal information coefficient (MIC), a statistic measure other than a dependence one [18]. Compared with MI, MIC captures a wider range of associations both functional and not. In principle, MIC is based on the idea that if there is a certain relationship between two variables, a grid can be drawn on the scatter diagram of the two variables and the data can be partitioned to encapsulate this relationship. Indeed, to calculate the MIC of two variables, explore all grids at the maximum resolution and calculate the largest possible mutual information. Therefore, the heart of MIC is a naive mutual information estimate ( , ) computed using a data-dependent grid scheme. Let and respectively denote the number of bins imposed on the and axes. The MIC grid scheme is chosen so that (i) the total number of bins does not exceed some user-specified value and (ii) the value of the ratio where = ( ( , )) is maximized.
The ratio computed using this data-dependent grid scheme is how MIC is defined.
Note that = .
( , ) is always nonnegative and ( , ) = 0 only when and are mutually independent. Besides, MIC values will be greater than zero when and show any correlations, regardless of how nonlinear that relationship is. Moreover, the stronger the correlation is, the larger the value of ( , ).

Complex Network Theory
A network consists of nodes that represent individual entities and links between each other. Actually, whether you realize it or not, we are surrounded by all kinds of networks, including transportation networks, social networks, and manufacturing networks; building networks are a good way of modeling. Based on the findings that a scale-free network has the outstanding features of strong connectivity and survivability, Barabâsi and Albert have further developed for network science a tool called complex network theory to study the topology for networks [28]. We have noticed that increasing network sizes and nontrivial topological structures concur with the increasing richness and variety of attribute information associated with the nodes in network.
Complex network is a kind of abstract model which maps the real complex system. It abstracts the entities in the complex system into nodes and the relationships between entities into lines. It can be divided into weighted network and unweighted network. The former has a binary nature where the edges between nodes are either present or not, while the latter displays a large heterogeneity in the capacity and the intensity of the connections. The adjacency matrix is a binary square matrix with the same row and column label, which is commonly used to represent the actual relationships and construct a complex network. Complex network theory is widely used to study the characteristics of various On the basis of MI, Reshef et al. proposed the concept of maximal information coefficient (MIC), a statistic measure other than a dependence one [18]. Compared with MI, MIC captures a wider range of associations both functional and not. In principle, MIC is based on the idea that if there is a certain relationship between two variables, a grid can be drawn on the scatter diagram of the two variables and the data can be partitioned to encapsulate this relationship. Indeed, to calculate the MIC of two variables, explore all grids at the maximum resolution and calculate the largest possible mutual information. Therefore, the heart of MIC is a naive mutual information estimate I(x, y) computed using a data-dependent grid scheme. Let x and y respectively denote the number of bins imposed on the x and y axes. The MIC grid scheme is chosen so that (i) the total number of bins xy does not exceed some user-specified value B and (ii) the value of the ratio where Z = log 2 (min(x, y)) is maximized.
The ratio computed using this data-dependent grid scheme is how MIC is defined.
Note that B = n 6 . MIC(X, Y) is always nonnegative and MIC(X, Y) = 0 only when X and Y are mutually independent. Besides, MIC values will be greater than zero when X and Y show any correlations, regardless of how nonlinear that relationship is. Moreover, the stronger the correlation is, the larger the value of MIC(X, Y).

Complex Network Theory
A network consists of nodes that represent individual entities and links between each other. Actually, whether you realize it or not, we are surrounded by all kinds of networks, including transportation networks, social networks, and manufacturing networks; building networks are a good way of modeling. Based on the findings that a scale-free network has the outstanding features of strong connectivity and survivability, Barabâsi and Albert have further developed for network science a tool called complex network theory to study the topology for networks [28]. We have noticed that increasing network sizes and nontrivial topological structures concur with the increasing richness and variety of attribute information associated with the nodes in network.
Complex network is a kind of abstract model which maps the real complex system. It abstracts the entities in the complex system into nodes and the relationships between entities into lines. It can be divided into weighted network and unweighted network. The former has a binary nature where the edges between nodes are either present or not, while the latter displays a large heterogeneity in the capacity and the intensity of the connections. The adjacency matrix is a binary square matrix with the same row and column label, which is commonly used to represent the actual relationships and construct a complex network. Complex network theory is widely used to study the characteristics of various networks and further improve the network performance. The relationship between nodes in the network can be quantitatively studied by centrality analysis, binary relationship research, block-modeling analysis, and cohesive subgroup analysis, etc. [29,30].

Complex Network Clustering
Clustering, also known as transitivity, is a typical property of complex networks, where two nodes associated with a common node are likely to be similar. White et al. (1976) proposed the block-modeling theory [31], which can simplify the complex network according to the degree of associations between nodes. Specifically, the nodes are rearranged into blocks by clustering, and the basic characteristics of the whole network can be reflected by each block. Recently, some scholars combined the stochastic block model with clustering to define the relationship between nodes and find subgroups [32,33].
In particular, the first step of block-modeling is to partition the actors, that is, to divide them into different groups based on methods of clustering and scaling. In particular, the Convergent Correlation (CONCOR) procedure is a method of hierarchical clustering for relational data which begins by forming a new square matrix of product-moment correlations between the columns (or rows) of the original data and is found to give results that are highly compatible with analyses and interpretations of the same data using the block-modeling approach [34]. CONCOR is an iterative convergence algorithm, which measures the network structure by repeatedly calculating the correlation matrix. Each iteration of CONCOR contains a hierarchical clustering to achieve partition. According to the correlation matrix between nodes, the data set is divided into different levels and can obtain the tree clustering structure. CONCOR is an iterative convergence algorithm, which measures the network structure by repeatedly calculating the correlation matrix. Each iteration of CONCOR contains a hierarchical clustering to achieve partitions.
The purpose of complex network clustering is to find the subgroups existing in the whole network. According to the correlation, the nodes with high degree of similarity are automatically clustered into one group. Selecting the representative nodes for each group based on the importance and power indicators and eventually forming a representative node set will be better than picking up typical nodes in the whole network. The partition process of the block model is shown in Figure 1, where we can see that several scattered nodes are divided into 16 clusters according to their similarity. The similarity between nodes in one cluster is high, and the importance of each node can be evaluated.
(1976) proposed the block-modeling theory [31], which can simplify according to the degree of associations between nodes. Specifically ranged into blocks by clustering, and the basic characteristics of th be reflected by each block. Recently, some scholars combined the st with clustering to define the relationship between nodes and find su In particular, the first step of block-modeling is to partition the vide them into different groups based on methods of clustering and the Convergent Correlation (CONCOR) procedure is a method of h for relational data which begins by forming a new square matrix of p relations between the columns (or rows) of the original data and is that are highly compatible with analyses and interpretations of the block-modeling approach [34]. CONCOR is an iterative convergen measures the network structure by repeatedly calculating the correl eration of CONCOR contains a hierarchical clustering to achieve pa the correlation matrix between nodes, the data set is divided into di obtain the tree clustering structure. CONCOR is an iterative con which measures the network structure by repeatedly calculating th Each iteration of CONCOR contains a hierarchical clustering to achi The purpose of complex network clustering is to find the subg whole network. According to the correlation, the nodes with high de automatically clustered into one group. Selecting the representative based on the importance and power indicators and eventually form node set will be better than picking up typical nodes in the whole n process of the block model is shown in Figure 1, where we can see nodes are divided into 16 clusters according to their similarity. Th nodes in one cluster is high, and the importance of each node can be

Centrality Evaluation of Nodes in the Complex Network
In the complex network, how to judge the power and importanc depends on its centrality and influence. Based on the actual relations the "power and status" of nodes by the following four commonly us

Centrality Evaluation of Nodes in the Complex Network
In the complex network, how to judge the power and importance of each node mainly depends on its centrality and influence. Based on the actual relationship data, we measure the "power and status" of nodes by the following four commonly used indicators, namely degree, closeness, betweenness, and katz.

Degree Centrality
Degree centrality is defined as the number of links incident upon a node. If the network is directed, then two separate measures of degree centrality are defined, namely, in-degree and out-degree. In-degree is a count of the number of ties directed to the node and out-degree is the number of ties that the node directs to others. In many cases, the degree is the sum of in-degree and out-degree. This index reflects the "power" of a node in the network and nodes with high degree are more likely to be the center of the network.

Betweenness Centrality
Betweenness centrality is a way of detecting the amount of influence a node has over the flow of information in a graph. It is often used to find nodes that serve as a bridge from one part of a graph to another. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex. This index measures the ability of controlling the resources of each actor. If an actor is on the shortest path of many other actor-pairs, its degree is generally low, but it may play an intermediary role so as to be the center of the network.

Closeness Centrality
Closeness centrality is a way of detecting nodes that are able to spread information very efficiently through a graph. The closeness centrality of a node measures its average farness to all other nodes. Nodes with a high closeness score have the shortest distances to all other nodes. This index reflects the inverse distance of nodes to other points. If an actor is closer to other actors, it is easier to transmit information; therefore, it is more likely to be the center of the network.

Katz Centrality
In graph theory, the Katz centrality is used to measure the relative degree of influence of an actor within a social network. Unlike typical centrality measures which consider only the shortest path between a pair of actors, Katz centrality measures influence by taking into account the total number of walks between a pair of actors. Katz centrality computes the relative influence of a node within a network by measuring the number of the immediate neighbors and also all other nodes in the network that connect to the node under consideration through these immediate neighbors. This index considers the direct and indirect relationship between node and other nodes. The shorter the distance between node i and node j, the greater the impact of node i on node j.

Support Vector Regression
In contrast to simple linear regression, SVR gives us the flexibility to define how much error is acceptable in our model and will find an appropriate line (or hyperplane in higher dimensions) to fit the data. Specifically, set a threshold α and calculate the loss of data points when | f (x) − y| > α, supposing that the data points within the threshold are predicted accurately. One of the main advantages of SVR is that its computational complexity does not depend on the dimensionality of the input space. Additionally, it has excellent generalization capability, with high prediction accuracy.
The objective function of SVR is to minimize the coefficients-more specifically, the error term, which is instead handled in the constraints, where we set the absolute error less than or equal to a specified margin, called the maximum error, (epsilon). We can tune epsilon to gain the desired accuracy of our model.
Suppose that x ∈ R d , y i ∈ R, y i is the output of x i , d is the dimension, and l is the number of samples. Given the training set {(x i , y i )} l i=1 , the goal of SVR is to find an optimal equation f from the set of hypothesis equations by minimizing the error term. The optimal equation f is as follows, where w is the weight vector and b is the threshold.

Multidimensional Support Vector Regression
Assume Y = {y 1 , y 2 , y 3 . . .} is the quality index set of steel products and X = X B , X C , X R is the process parameter set from three stages: smelting, continuous casting, and rolling. Each stage consists of many specific process parameters, such as which means the number of variables in the steelmaking stage is a. The mean absolute percentage error (MAPE) can be set as the algorithm evaluation index, and the quality modeling considering the effect of process parameters on quality index can be abstracted as:

Problem Description
The data collected from the intricate process of steel production are high-dimensional and coupled with each other. There are complex linear or nonlinear relationships between them; meanwhile, their impact on product quality is hereditary. If we use the full-parameter data to model, not only is the calculation complex, but also the modeling is often inefficient and cannot well reflect the real problem because of the redundant features. If the important and representative features can be selected from the high-dimensional process data to simplify the complex problem, the subsequent modeling will be simpler and the effect will be more obvious. The emphasis of this paper is how to select the representative feature subset from the full-feature set, and then predict the performance of steel products more accurately.
Taking the throughout process of steel production as an example, after cleaning and deduplication of the original data set O, which means removing the parameters that are completely irrelevant to the mechanical properties, namely the MIC between them being less than 0.05. The left process parameters from three typical stages, namely, steelmaking, continuous casting, and rolling are defined as F = X B , X C , X R = {x 1 , x 2 , x 3 . . . , x m } and the number set of each stage is {a, b, c}, which means that the number of total parameters is (a + b + c = m) [35]. Define the mechanical property set Y = {y 1 , y 2 , y 3 } which contains three indicators: tensile strength, yield strength, and elongation. The purpose of this study is to use a certain feature selection method to obtain representative and low-dimensional feature subset X = {x 1 , x 2 , x 3 . . . , x t }, t a + b + c from high-dimensional variable set {X B , X C , X R }, and perform the subsequent MSVR performance prediction modeling Y T = f (x 1 , x 2 , x 3 . . . , x t ), which could effectively simplify the calculation and improve prediction accuracy at the same time.

Model and Algorithm
Based on the relevant basic theories in Section 2 and the requirements in Section 3.1, we propose an algorithm that firstly uses MIC to measure the linear and nonlinear correlation relationships between high-dimensional parameters. Secondly, we construct a complex network, and quantitatively evaluate each feature by CONCOR clustering method and centrality and influence analysis. Eventually, we could obtain the feature subset that could represent the full parameters efficiently, which could be used as the input of MSVR and to predict the mechanical properties. In order to verify the effectiveness and feasibility of the algorithm, the full-parameter set, empirical subset, and the best feature subsets selected based on MIC and Pearson coefficients are used as input for MSVR respectively, and the method with the least error and the optimal feature subset could be obtained.
The model and algorithm of this paper can be divided into two parts. One is the prediction model based on MSVR, the other is the feature selection algorithm based on correlation measurement and complex network, as shown in Figure 2. and to predict the mechanical properties. In order to verify the effectiveness and feasibility of the algorithm, the full-parameter set, empirical subset, and the best feature subsets selected based on MIC and Pearson coefficients are used as input for MSVR respectively, and the method with the least error and the optimal feature subset could be obtained. The model and algorithm of this paper can be divided into two parts. One is the prediction model based on MSVR, the other is the feature selection algorithm based on correlation measurement and complex network, as shown in Figure 2.

Correlation Measurement
Suppose that ( , ) is the correlation coefficient between and . In this paper, MIC is used to measure the linear and nonlinear correlation between attributes. In order to verify the representation effect of MIC, the Pearson coefficient between attributes is also calculated for modeling.
Create the correlation matrix by the correlation coefficient between features, and construct the complex network that characterizes the correlation between features. This matrix is a symmetric matrix with diagonal 1.

Correlation Measurement
Suppose that C(x i , x j ) is the correlation coefficient between x i and x j . In this paper, MIC is used to measure the linear and nonlinear correlation between attributes. In order to verify the representation effect of MIC, the Pearson coefficient between attributes is also calculated for modeling.
Create the correlation matrix C by the correlation coefficient between features, and construct the complex network that characterizes the correlation between features. This matrix is a symmetric matrix with diagonal 1.

The Clustering Model Based on the Complex Network and CONCOR Algorithm
A complex network is constructed based on the correlation matrix C and the CONCOR algorithm is employed to build the block model. The CONCOR algorithm calculates the Pearson correlation coefficient of the correlation matrix iteratively and carries out the hierarchical clustering, starting from the initial correlation matrix. The flow of the algorithm is shown in Algorithm 1. After the CONCOR, the partition of features is realized. Define the subgroup as G = {g 1 , g 2 , . . . , g t } where t is the number of subgroups, and g i = x 1 , x 2 , . . . , x j , i ≤ t, j a + b + c, where j is the number of features in the subgroup g i . Input : correlation matric C1 and the partition level at which any pair of actors is aggregated. Output : C2, which denotes the correlation coefficient matrix of C1 and blocks represented in terms of a clustering dendrogram clustering graph under different levels.
Step 1 : Calculate C2 which is the Pearson correlation coefficient of C1.
Step 2 : The blocks are given for each level at which any pair of actors is aggregated. Carry out the hierarchical clustering from the max level, and combine two features with the highest similarity. The similarities of partitions from the same level should all reach one corresponding value, and one feature can only exist in one group.
Step 3: Reduce level by 1, which means reducing the corresponding similarity value of clusters, and look for the features with highest similarity to the clustered partitions from the unclustered features, which could cluster by themselves, or be added into the existing partition.
Step 4: Iterate Step 3 until level = 1 when all features enter the same group.

Feature Evaluation
Given the different partitions of different levels and the similarity complex network, we comprehensively evaluate the nodes in each subgroup for feature selection with four centrality and influence indicators that we mentioned before. What should be pointed out is that the measure of degree centrality is based on the weighted matrix which is the initial correlation matrix, while the measures of betweenness, closeness, and Katz centrality are based on the unweighted matrix which is the binarization of initial correlation matrix. Suppose that n denotes the number of nodes in the network.

Degree Centrality
The absolute degree C AD (x) is the sum of the weights between node x and all other nodes, and the relative degree C RD (x i ) is the absolute centrality divided by the maximum possible degree (n − 1).
Betweenness Centrality Define g jk as the geodesic between node j and k, and g jk (x i ) as the number of geodesics that go through the node x i . The absolute betweenness centrality C AB (x i ) is the sum of the probabilities that node x i is on the shortest path between all pairs of points. The rela-tive betweenness C RB (x i ) is the absolute betweenness divided by the maximum possible betweenness n 2 − 3n + 2 /2.
Closeness Centrality Define Farnessx i as the sum of the geodesic distances between node x i and all other nodes, d ij as the geodesic distances between node x i and x j , and the absolute closeness centrality C APi is the reciprocal of Farnessx i . The relative closeness centrality C RP (x i ) is C AP (x i ) divided by the maximum possible closeness 1/(n − 1).
Katz Centrality Katz centrality measures the influence by considering the direct and indirect support or attention between nodes. Define S as a matrix consisting of 0 and 1 that reflects the direct-connection relationships between actors when the path length is 1, and S ij = 1 denotes that the actor j connects to actor i directly and the length is 1. The sum of j-column represents the total number of times that actor j connects to other actors by 1; define S ij 2 as the number of paths that connect the actor i and j by length 2 and S ij 3 by length 3, and so on. Considering that the higher the power of the matrix S ij * , the lower the effect of the influence, so an attenuation factor α is introduced to characterize this performance. The value of α depends on the situation and 1/a ∈ (b, 2b). When α = 0, it decays completely and when α = 1, it does not. For a matrix where the elements are nonnegative, a simple upper limit of the maximum eigenvalue b is the maximum sum of rows.
Define P = [Degree, Betweenness, Closeness, Katz]. In order to eliminate the influence of dimension, we sort the four indicator values and get four ranking values to measure their comprehensive centrality and influence. Define R = [R D , R B , R C , R K , R T ] where R D , R B , R C , R K , R T represent the ranking values of four centrality indicators and the total ranking respectively.

Feature Selection
Suppose that R i T = R i T 1 , R i T 2 . . . R i T p is the total ranking matrix of the features in the subgroup g i , where p denotes the feature number of g i . Select the feature with the top total ranking as the subgroup representation, namely R i T q = minR i T . In this way, explore all subgroups and obtain the feature set {x 1 , x 2 , x 3 . . . , x t }, where t is the number of subgroups.

Mechanical Property Prediction Based on MSVR
The above work can obtain the feature selection results respectively based on the MIC and Pearson correlation characterization. Moreover, in order to verify the effect of our proposed method, the empirical subset and full-parameter subset are used for comparative experiments. Applying the above four feature sets to construct the data set for MSVR modeling, we divide the training set and the test set, and perform cross validation test to verify the error. It should be pointed out that even though we are using the same correlation characterization, different partition levels get different feature selection results, corresponding to different MSVR prediction results.

Case Study and Discussion
In order to test the feasibility and efficiency of the proposed prediction model, we collected a total of 1607 data samples of the whole production process from a steel company in Central China and verified our model. The product is the cold-rolled strip and the steel grades selected in our experiment include DR01, DR02, DR04, DR06, DX51, DX52, DX53, SPCC, SPCD, SPCE, SPCF, SPCG, etc. The data come from four main processes: smelting, continuous casting, hot rolling, and cold rolling. The original parameters influence each other and contain a lot of linear and nonlinear relationships, of which the number is 211. The deduplication process is described as follows: calculate the MIC values between the original parameters and three mechanical properties, and remove the ones completely irrelevant to properties, namely the MIC between them is less than 0.05. Finally, a total of 111 process parameters were obtained as the full-parameter subset. The number of parameters in each process stage is shown in Table 1 Table 2.

Correlation Calculation and Partition Results
The distribution of MIC values among the 111 process parameters is shown in Figure 3. It can be seen that nearly 50% of MIC values are greater than 0.43 and 34 values are more than 0.8, which indicates that there are indispensable correlation relationships between these features. It is necessary to mine these relationships and remove redundant features, so as to clarify the nature of the relationships between features and simplify the input data set of subsequent modeling.
We construct a complex network based on the MIC matrix, and carry out the CONCOR to build a block model. Set the initial clustering level as 4, and Figure 4 shows the number of partitions under different clustering levels. It can be seen that the number of partitions gradually increases with the rise of clustering level, and the clustering stops when the clustering level is 9, meanwhile the number of partitions is the maximum, 71.
Combined with the partition results shown in Figure 5 of which the clustering level is 4 to 9 respectively, it can be seen that the higher the level, the more partitions. This is because the next level of clustering is based on the previous level, which means expanding the feature numbers within a partition by reducing the similarity of the group, so the number of partitions will decrease. The first clustering level is 9, then the next clustering is based on level 9 which expands the members of each group and reduces the partition number. When the clustering level is 1, all features are in the same partition.  We construct a complex network based on the MIC matrix, and carry out COR to build a block model. Set the initial clustering level as 4, and Figure 4 number of partitions under different clustering levels. It can be seen that the partitions gradually increases with the rise of clustering level, and the cluste when the clustering level is 9, meanwhile the number of partitions is the maxim  Figure 5 of which the clust is 4 to 9 respectively, it can be seen that the higher the level, the more partitio because the next level of clustering is based on the previous level, which means the feature numbers within a partition by reducing the similarity of the gro number of partitions will decrease. The first clustering level is 9, then the next is based on level 9 which expands the members of each group and reduces th number. When the clustering level is 1, all features are in the same partition.  We construct a complex network based on the MIC matrix, and carry out the CON COR to build a block model. Set the initial clustering level as 4, and Figure 4 shows t number of partitions under different clustering levels. It can be seen that the number partitions gradually increases with the rise of clustering level, and the clustering sto when the clustering level is 9, meanwhile the number of partitions is the maximum, 71 Combined with the partition results shown in Figure 5 of which the clustering lev is 4 to 9 respectively, it can be seen that the higher the level, the more partitions. This because the next level of clustering is based on the previous level, which means expandin the feature numbers within a partition by reducing the similarity of the group, so t number of partitions will decrease. The first clustering level is 9, then the next clusterin is based on level 9 which expands the members of each group and reduces the partitio number. When the clustering level is 1, all features are in the same partition.  Figure 6, and the MIC values between 5 parameters are shown in Table 3.  Figure 6, and the MIC values between 5 parameters are shown in Table 3.    Figure 6, and the MIC values between 5 parameters are shown in Table 3.   Figure 6. Clustering process of feature 1,8,9,72, 73 (the number in the ellipse is the maximum information coefficient (MIC) value between related parameters).

Feature Evaluation and Selection
As mentioned above, four centrality and influence indicators are selected to evaluate the importance of each parameter and we rank them by category. Table 4 shows the top 20 features with the highest total ranking and their respective rankings of the four indicators. Table 5 shows the detailed information of the top 20 features including the feature name and the cluster number using the MIC-based model at level 4. The last column indicates whether the feature is selected in its cluster. It can be found that the five rankings are highly related. The parameters with the high total rankings tend to rank at the top of the four separate indicators. Among them, the process parameter "TI" ranks respectively 1, 9, 1, 1 at degree, betweenness, closeness, and Katz centrality and the total ranking is 12, which means that this feature owns greater power and is the most representative in the partition.  Finally, the feature selection is based on the partition situation and the feature evaluation results. At different clustering levels, compare the centrality and influence rankings of different features in each partition. The top 1 feature is selected as the representative of the partition, also as a member of the selected feature subset. For example, when the clustering level is 4, 111 features are divided into 16 subgroups. The feature distribution of the first subgroup g 1 is shown in Figure 7. Finally, the feature selection is based on the partition situation and the feature evaluation results. At different clustering levels, compare the centrality and influence rankings of different features in each partition. The top 1 feature is selected as the representative of the partition, also as a member of the selected feature subset. For example, when the clustering level is 4, 111 features are divided into 16 subgroups. The feature distribution of the first subgroup is shown in Figure 7.  ,160,226,247,175,191,183,187,191,187,167,159,165,194,196} There are 15 features in subgroup , and Figure 8 shows the total ranking scatter diagram of each feature. It can be seen that the ranking distribution within the subgroup is relatively concentrated among [150,200], which also verifies the rationality of clustering, that is, the rankings of similar features should also be similar. Among them, the top ranking is feature 93 whose total ranking is 159, so select feature 93 as the representative feature of subgroup and add it into the final feature subset.  There are 15 features in subgroup g 1 , and Figure 8 shows the total ranking scatter diagram of each feature. It can be seen that the ranking distribution within the subgroup is relatively concentrated among [150,200], which also verifies the rationality of clustering, that is, the rankings of similar features should also be similar. Among them, the top ranking is feature 93 whose total ranking is 159, so select feature 93 as the representative feature of subgroup g 1 and add it into the final feature subset. uation results. At different clustering levels, compare the centrality and influence rank-ings of different features in each partition. The top 1 feature is selected as the representative of the partition, also as a member of the selected feature subset. For example, when the clustering level is 4, 111 features are divided into 16 subgroups. The feature distribution of the first subgroup is shown in Figure 7.
There are 15 features in subgroup , and Figure 8 shows the total ranking scatter diagram of each feature. It can be seen that the ranking distribution within the subgroup is relatively concentrated among [150,200], which also verifies the rationality of clustering, that is, the rankings of similar features should also be similar. Among them, the top ranking is feature 93 whose total ranking is 159, so select feature 93 as the representative feature of subgroup and add it into the final feature subset.  The rest can be deduced by analogy. Select the representative features of all partitions at level 4, and then expand the clustering level. Finally, the feature subsets at level 4-9 are obtained, as shown in Table 6 In addition, we can discover that the clustering at level 9 is most concise with the least feature numbers and the highest correlation in each partition. Therefore, it can be estimated that the representative features selected at this level may have the best prediction effect, which can also be proved in the follow-up article.

MSVR Property Prediction Model
According to the feature selection results, the original sample data are divided into the training set and test set at the ratio of 8:2 to train the MSVR model. Three mechanical properties are selected, which are lower yield strength, tensile strength, and elongation, separately. The mean absolute percentage error (MAPE) is chosen as the evaluation index of the effectiveness of the proposed algorithm. We calculate three MAPE values and the average of them to represent the prediction accuracy. We choose four parameter sets as the input, which are MIC-based subset, Pearson-based subset, full-parameter subset, and empirical subset to perform the comparison experiment. Figure 9 shows the MAPE comparison between feature selection results based on MIC-based subset, full-parameter subset, and the empirical subset. It can be seen that starting from level 5, four kinds of prediction error (including three MAPE values of three mechanical properties and their average) of our proposed algorithm are all lower than the other two input sets. In addition, as the number of selected features increases from level 4 to level 9, the growth rate slows down and the prediction error decreases. At level 9, the number of features reaches the maximum 71, while the four MAPE values all reach the lowest. As shown in Figure 10, when the clustering level is 9, the prediction error of the timal feature subset is significantly lower than that of the full-parameter and empir subset. Therefore, it can be concluded that the feature selection method proposed in paper can select a small number of parameters from the full-parameter set to repres As shown in Figure 10, when the clustering level is 9, the prediction error of the optimal feature subset is significantly lower than that of the full-parameter and empirical subset. Therefore, it can be concluded that the feature selection method proposed in this paper can select a small number of parameters from the full-parameter set to represent the whole, and the prediction effect is better. Figure 9. Error comparison between MIC-based feature selection and full-parameter, empirica subset modeling.

MSVR Prediction Results at Different Clustering Levels
As shown in Figure 10, when the clustering level is 9, the prediction error of the timal feature subset is significantly lower than that of the full-parameter and empi subset. Therefore, it can be concluded that the feature selection method proposed in paper can select a small number of parameters from the full-parameter set to repre the whole, and the prediction effect is better. In order to verify that the MIC-based feature selection method can characterize nonlinear correlation relationships between features more reasonably, we use the Pea coefficient matrix to represent the initial correlation, and compare the prediction e under two correlation measures. The prediction error of "lower yield strength" (left) the average error of three mechanical properties (right) are shown in Figure 11. It ca seen that the overall error of MIC-based feature selection method is lower than th Pearson-based method. With the increase of clustering level, the prediction accuracy ference between the two methods gradually becomes smaller. In order to verify that the MIC-based feature selection method can characterize the nonlinear correlation relationships between features more reasonably, we use the Pearson coefficient matrix to represent the initial correlation, and compare the prediction error under two correlation measures. The prediction error of "lower yield strength" (left) and the average error of three mechanical properties (right) are shown in Figure 11. It can be seen that the overall error of MIC-based feature selection method is lower than that of Pearson-based method. With the increase of clustering level, the prediction accuracy difference between the two methods gradually becomes smaller. When the clustering level is 4, the prediction accuracy of MIC method is 1.69% higher than that of Pearson and only one feature 10 is coincident, which means that the two similarity measurement methods are quite different, among which MIC is better. Apparently, compared with the Pearson coefficient, MIC can widely explore the linear and nonlinear relationship between process parameters.
To sum up, the feature selection method based on MIC and complex network clus- When the clustering level is 4, the prediction accuracy of MIC method is 1.69% higher than that of Pearson and only one feature 10 is coincident, which means that the two similarity measurement methods are quite different, among which MIC is better. Apparently, compared with the Pearson coefficient, MIC can widely explore the linear and nonlinear relationship between process parameters.
To sum up, the feature selection method based on MIC and complex network clustering can represent the global situation with fewer features and better prediction effect than the full-parameter subset. At the same time, compared with the empirical subset and the similarity measurement based on Pearson coefficient, our model has higher prediction accuracy. It should be pointed out that no matter whether the feature selection is based on MIC or Pearson coefficient, the prediction accuracy is both higher than that of fullparameter subset, which indicates that there are a lot of linear and nonlinear relationships in the original data set. If it can be well mined and analyzed, the difficulty of subsequent modeling can be greatly reduced.