Metaheuristics and Support Vector Data Description for Fault Detection in Industrial Processes

: In this study, a system for faults detection using a combination of Support Vector Data Description (SVDD) with metaheuristic algorithms is presented. The presented approach is applied to a real industrial process where the set of measured faults is scarce. The original contribution in this work is the industrial context of application and the comparison of swarm intelligence algorithms to optimize the SVDD hyper-parameters. Four recent metaheuristics are compared hereby to solve the corresponding optimization problem in an efﬁcient manner. These optimization techniques are then implemented for fault detection in a multivariate industrial process with non-balanced data. The obtained numerical results seem to be promising when the considered optimization techniques are combined with SVDD. In particular, the Spotted Hyena algorithm outperforms other metaheuristics reaching values of F1 score near 100% in fault detection.


Introduction
Currently, machine learning and nature-inspired algorithms are being applied in several research fields to obtain optimal results. Some real applications include medical diagnosis based on the patient's symptoms [1], fraud detection in economic transactions [2], identification of patterns of investment in order to buy/sell in a more efficient manner [3], image detection to predict city traffic, machine failure and the design of autonomous vehicles, among others [4]. There are some previous studies regarding fault detection and fault diagnosis based on Dynamic Weight Principal Component Analysis (PCA) [5], Principal polynomial analysis [6], PCA and a Bayesian network [7], deep convolutional neural network [8], Hidden Markov Model and Bayesian Network [9], among others. However, statistical assumptions about the distribution of the process data should be made in some of these approaches. Support Vector Data Description (SVDD) [10] and Artificial Neural Networks (ANN) algorithms share the same concept using the linear learning model for pattern recognition. ANN tries to converge to a local minimum using the gradient descent learning algorithm and suffers from overfitting problems. On the other hand, SVDD tends to find a global solution during training since the complexity of the model has been taken into account as a structural risk in SVDD formation. ANN minimizes only empirical risk learned from training samples and SVDD considers not only the empirical risk but the structure risk. Thus, SVDD training results show better generalization capability than those obtained with ANN.
In industry, machine learning classifiers are implemented to focus on faults detection. In [11,12], reviews of machine learning applications in the manufacturing industry are presented. The use of artificial neural networks in the modeling and optimization of processes is emphasized as well as the application of SVM for quality assessment in industries. Another application of SVM for the early failure prediction in the oil and gas industry is shown in [13]. In [14], a monitoring platform using Artificial Neural Network and the Support Vector Machine is proposed and applied to the prediction of the performance of aeronautical engines and health diagnosis. Nevertheless, conventional algorithms are also created to perform two-class or multi-class classification tasks. Because of this, data containing all class of information on the processes are required. Recently, machine learning algorithms have been used in fault detection, but there is sometimes the downside of having more information available from one of the kind. For example, if a relatively new machine is in operation, it is very likely that only the data corresponding to normal operation are available. In this study, a method for fault detection is proposed which is able to deal with processes or machines where observations or data about the faults are scarce for the training phase by using the one class classifier known as Support Vector Data Description (SVDD) [10]. The choice of hyper-parameters is one of the most important features for SVDD. These parameters are associated with the hyper-sphere as well as with the kernel function chosen for the classification. In this study, these hyper-parameters are simultaneously optimized using metaheuristic techniques. Furthermore, a comparison among four optimization techniques is reported in order to evaluate the quality of the obtained results. The results are satisfactory when applied to a process of fault detection at low computational cost.
There are some studies covering different SVDD algorithm applications [15][16][17]-most of them optimizing the hyper-parameters using approaches like grid search, which is computationally expensive. To obtain these parameters in a more efficient manner, some authors have considered metaheuristic algorithms. Particle swarm optimization (PSO) and genetic algorithms can be often found in literature. Ref. [18] presents a brief general description about hyper-parameter optimization in Support Vector Machines (SVM). In this context, the ant colony algorithm is chosen in [19] for feature selection and parameter optimization in SVM for fault diagnosis. Ref. [20] used genetic algorithms to optimize parameters corresponding to the kernel function. Optimization of the hyper-sphere parameters together with those associated with the Gaussian radial base kernel function are studied by [21] using grid search. Ref. [22] obtained optimal results by the combination of grid search and PSO.
In the last few decades, researchers have developed several nature-inspired optimization algorithms that mimic some biological behaviors or physical phenomena. Techniques based on swarm intelligence mimic the socially intelligent behavior of groups of species. Search algorithms start with a group of randomly generated solutions generally naming a population evolving throughout successive generations and promote the population improvement throughout the iterations. In this study, a performance comparison is presented corresponding to four of these metaheuristics for calculation of the hyper-parameters. Besides PSO, which has been widely used for these purposes, the considered algorithms in this study have demonstrated their efficiency in solving optimization problems applied to engineering. Furthermore, the performance is tested for the Spotted Hyena Optimization (SHO) algorithm, Krill herd (KH) algorithm, and Squirrel Search Algorithm (SSA).
The SHO is a recent metaheuristic based on swarms which mimics the social behavior of these animals in nature. Among the vast amount of metaheuristics, SHO has shown great advantages in the exploration of diverse search spaces compared to other state-of-the-art approaches [23]. SHO has been implemented to solve a wide range of engineering problems like prediction of materials resistance during cutting processes [24]. SHO is used in combination with neural networks to improve the prediction ability of these algorithms. In [25], two complex engineering problems with restrictions like the design of bar armor and the design of multiple disk clutch brakes are solved. The results show the efficiency of SHO in solving these problems compared to other optimization algorithms like PSO and ACO. It showed its applicability in environments of a high dimension with a low computational cost. Ref. [26] implements SHO for design optimization in aerodynamic surface and optical buffer problems where the obtained numerical results are then compared with algorithms like Grey Wolf Optimizer, Genetic Algorithm, and PSO, among others. Likewise, the binary version of SHO has been developed and used to build wrapping approaches for feature selection in different sets of UCI data [27]. Moreover, SHO has been hybridized with other algorithms like PSO to improve its abilities to solve various engineering problems [28]. Recently, Ref. [29] implements an SHO version able to deal with multi-objective problems for the prediction of characteristics in gene selection through its combination with machine learning algorithms like SVM.
The KH algorithm is a recent nature-based algorithm inspired by the individual krill herding behavior. This algorithm was introduced in [30]. The KH algorithm works to achieve the minimum distance between individual krill and the nearest food. This algorithm has been successfully used in solving many problems of numerical optimization, electric and power system problems, text grouping, breast cancer detection, and the training of neural networks [31][32][33]. More recently, the possibility of applying this algorithm for the grouping of text documents is studied in [34], while the same problem is addressed in [35] using a hybrid algorithm based on KH. The grouping based on the KH algorithm is proposed in [36] for the network of wireless sensors. A method for diagnosing bearing failure based on KH and a kernel extreme learning machine is proposed in [37]. An improvement of KH applied to fault diagnosis with a Support Vector Machine to solve the power transformers' fault diagnosis problem based on the analysis of dispersed gases is presented in [38].
The SSA is a metaheuristic approach based on the behavior of flying squirrels which are a diversified nocturnal tree rodent group that are highly adjusted to gliding locomotion. The SSA mimics the dynamic food search behavior of flying squirrels and their effective form of locomotion, known as gliding, which is a very effective mechanism for traveling long distances. The algorithm has been recently proposed [39,40]. Moreover, it has proven its good performance in some applications, for example in [41], is applied for the optimization of a backpropagation artificial neural network (BPNN) using a multi-objective method based on SSA to optimize the main parameters of a continuous galvanization process for advanced DP Steels. In [42], SSA is used to optimize a complex problem where combined heat and energy distribution for various regions is modeled integrating renewable energy sources. In [43], a hybrid algorithm based on the combination of SSA and the optimization of invasive weed is proposed. This algorithm is combined with the Support Vector Machine and the deterministic maximum likelihood algorithm to perform the classification of air quality levels. Recently, a Chaotic SSA variant for optimum programming of multiple tasks in an infrastructure cloud environment as service is reported in [44].
The main contribution of this paper consists of a fault detection system using a combination of SVDD with some optimization methods like SHO, KH, SSA, and the well-known PSO. The effectiveness of the used different meteheuristics, for the parameters optimization corresponding to the hyper-sphere as well as those associated with the Gaussian radial basis kernel used in SVDD, is compared. As it will be shown, promising results are obtained when the mentioned swarm intelligence algorithms are combined with SVDD and applied to fault detection in a real industrial problem.
The rest of the paper is organized as follows: theoretical background is presented in Section 2, the proposed methodology for fault detection is described in Section 3, and this is followed by the industrial application section and finally some conclusions and future work are discussed in the last section.

Support Vector Data Description
Tax and Duin [10] proposed the SVDD classification method which determines a close boundary around the data set for a given class: a hyper-sphere characterized by a center and a radius R ≥ 0 that defines a separation between the inner region with high data density and the outer region with low density. The data that lie right at the limit of the hyper-sphere are called the support vectors while those outside are the outliers. Let {x i : i = 1, 2, 3, . . . , N} be a column vector set and x 2 = x · x the training set for which a description must be specified and assume that x i 's show variances in all given directions. The data set delimitation of the inner region of the hyper-sphere will be minimized with an error function which minimizes the possibility of accepting outliers and such a function is defined as: where ξ i are the slack variables that will penalize the largest distances, and C is a control parameter of the trade-off between volume of the hyper-sphere and the errors [45]. Applying the Lagrange multipliers method, the following equations arise: with α i ≥ 0 as Lagrangian multipliers and γ i > 0. The dual formulation of the equations in (2) can be obtained by solving the KKT conditions, which reads In case a given x i satisfies x i − a 2 < R 2 + ξ i , then α i = 0; otherwise, when x i satisfies x i − a 2 = R 2 + ξ i , the corresponding Lagrange multiplier α i is strictly greater than zero.
The vectors x i corresponding to α i > 0 represent the set of vectors necessary to characterize a set of data, and this set of vectors can be called support vectors of the description [46]. When there is a new vector z, the distance to the center of the sphere can be computed. If this distance is smaller than R, z is accepted as a new vector in the description of the data, that is, Note that, if the inner product in Equation (2) is replaced with a kernel function K (x i , x i ) a description for nonlinear data sets can then be obtained. In this manner, data are mapped to a higher dimension feature space by means of the kernel function, and this makes the nonlinear data separable [47]. Thus, the problem can be reformulated as follows: where α i remains as the Lagrange multipliers and K x i , x j is a kernel function used as a functional mapping.

Spotted Hyena Optimizer (SHO)
Proposed by Dhiman and Kumar [48], SHO is a recent optimization technique that mimics the behavior of spotted hyenas when hunting. Spotted hyenas are social animals, and they hunt for prey by means of trusted friends groups and through their great ability to recognize their prey. These groups can be of up to 100 hyenas, and this is why the hunting method tends to be very effective and gets results in short periods of time. The main stages of this algorithm include searching, surrounding, and attacking the prey, in addition to other spotted hyena-seeking behaviors. In SHO, there is a search agent leader, and it is assumed that it knows the location of the prey. In this way, the other agents update their position to form friend groups around the leader. Next, the mathematical models corresponding to the mentioned several stages of this algorithm are described.

Encircling Prey
In this stage, the best current potential solution is considered like the prey. In that way, the other hyenas update their position around it. The mathematical model corresponding to this behavior is as follows: where − → h decreases linearly from 5 to 0 in the course of the highest number of iterations. r − → d 1 and r − → d 2 are random vectors in [0, 1].

Hunting
Given that spotted hyenas hunt in "trusted friend" groups, the searching agents must form conglomerates around the best agent. The following equations model such behavior, where − → P h defines the position of the first best hyena, − → P k shows the position of the rest of the hyenas, and N refers to the number of hyenas calculated as follows: similarly, − → M is a random vector in [0, 1].

Attacking the Prey
The mathematical formulation for attacking the prey reads where − → P (x + 1) saves the best solution and updates the positions of other searching agents according to the position of the best searching agent.

Searching for Prey (Exploration)
The vector − → B previously defined provides random values for the exploration during all iterations. Therefore, this mechanism effectively allows for avoiding local optima even in the final iterations.

Krill Herd Algorithm (KH)
The KH algorithm is inspired by the simulation of small crustaceans (Krill) behavior which live underwater. These crustaceans have the ability to form large swarms to avoid predators. The fitness function in the KH algorithm used to solve global optimization problems is based on the density of the swarm and the location of the food. Each krill migrates toward the area of highest density and at the same time continues to search for the places that contain the most food. Increasing density and foraging are used as a means to bring krill to global optimum levels at the end.
During the moving process, each krill moves towards the best option based on three essential movements: (i) movement generated by other krill; (ii) food search activity; (iii) physical diffusion.
The equation describing the krill moving process is as follows: where N i is the motion produced by other krill, F i is the food search motion, and D i is the random diffusion of the ith krill individual. The direction of induced motion α i is decided by the following parts: target effect, local effect, and a repulsive effect. For a krill individual, this movement can be defined as , ω n , and N old i denote the maximum induced speed, the inertia weight, and the last motion, respectively.
The food searching motion is influenced by two components: the food location and the previous experience about food location. For the ith krill, this motion can be expressed as follows: where V i is the feeding speed, ω f is the inertia weight, F old i is the last feeding motion, and The physical diffusion is essentially a random process. This motion can be computed based on a maximum diffusion speed D max and a random directional vector δ as follows: The position in KH from t to t + ∆t is given by the following equation: The interested reader is referred to [30] for more detailed information about the KH algorithm.

Squirrel Search Algorithm SSA
Recently, Ref. [39] proposed an innovative nature-inspired algorithm for optimization, the Squirrel Search Algorithm, which has been very efficient in solving unconstrained numerical optimization problems. The algorithm mimics the strategies of flying squirrels in searching for food sources and escaping predators. A summer and winter phases are considered since the motion dynamics are different depending on the season. This strategy allows it to escape from local minima, thus raising the likelihood of reaching the global optimum. The algorithm considers a certain number of flying squirrels in a forest. It is assumed that each squirrel is located on a tree. Each squirrel searches for food by gliding among trees looking for the best food source and there are three types of trees: normal tree (no food), oak tree (acorn nuts food source), and hickory tree (hickory nuts food source). It is supposed that there is a population of N flying squirrels in the forest, one at hickory tree, N f s at acorn trees, and the rest (1 ≤ N f s ≤ N) at normal trees. Each squirrel is represented by a vector with D components corresponding to the dimension of the problem. Initially, the flying squirrels are in a random position to start the algorithm and the location of the squirrels can be represented by the following expression: Since each row represents one squirrel, this matrix can be initialized in a random manner with a uniform distribution between (0, 1) with the lower and upper dimensions of each squirrel as FS U and FS L , respectively. Considering the full matrix, then the fitness evaluation corresponding to the location for each squirrel gives a fitness vector with the value of the objective function. This vector is arranged in ascending order in order to identify the best value associated with the best food source (hickory tree F h ), another food source (acorn tree F a ), and the squirrels in normal trees F n , in such a manner that each squirrel can be identified.
Considering the case when foraging squirrels do not run into a predator, flying squirrels then look for better food sources in the forest, which implies that F h remains unchanged. The destination of F a is F h and the destination of F n is random between F a or F h . In the case when they find a predator, they are forced to seek and find shelter in a random location. Their behavior can be mathematically described as follows: Case 1. The flying squirrels in acorn trees move to hickory trees, according to the following equation: if r ≥ P dp random location, otherwise Case 2. Some of the flying squirrels on normal trees move to the acorn trees looking for better food and some that have already been fed move to hickory trees in order to store food. The new locations read: where r ∈ (0, 1) is a random number, P dp represents the predator appearance probability, t is the current iteration, G c a constant, and d g is the gliding distance. The detailed calculation of these parameters are introduced in [39]. The season changes that help the algorithm to escape from local optima are considered by calculating the season constant as follows: where T is the maximum number of iterations and c is the current iteration. Moreover, the condition is verified if S t c < S min . If this happens, the flying squirrels are relocated according to the following equation: All parameters suggested in [39] have been used.

Particle Swarm Optimization
Particle Swarm Optimization (PSO) is a population search algorithm based on the simulation of the social behavior of birds, bees, or a school of fishes [49]. In this stochastic search technique, a multidimensional vector represents a particle in a multidimensional search space. These particles move toward the next position depending on the velocity vector associated with them. The velocity is updated based on the current velocity and the best position it has explored so far. This algorithm used the global best solution concept to obtain the optimal solution. At each iteration, the global best solution is recorded and updated [50].
Let y i = (y i1 , y i2 , ..., y id ) t be the i-th particle of the swarm in a d-dimensional search space with a corresponding velocity v i = (v i1 , v i2 , ..., v id ) t . Thus, the equations that conduct the particle's movements read where c 1 and c 2 are acceleration coefficients, and φ 1 and φ 2 are random variables with uniform distribution in [0, 1]. p ibest and p gbest are the best local and global particle positions so far, respectively. w denotes the inertia weight which shows the effect of previous velocity vector on the new vector.

Methodology for Fault Detection
The methodology presented in this study consists of pre-processing of data, that is, cleaning and structuring data to obtain a matrix of m observations with n process variables. Once this is done, the different metaheuristics are implemented to optimize the training of a one-class classifier (SVDD), that is, the hyper-parameters are optimally found that improve the abilities of SVDD for faults detection in industrial processes where there is not much faulty information available. In particular, advantages of the exploration and exploitation of metaheuristic algorithms are taken in order to tune the hyper-parameters C in SVDD and s for the RBF kernel (16). Once the SVDD training is optimized, it will be ready to monitor new data and detect possible faults in the multivariate industrial process. Figure 1 shows the process of the described methodology: With the purpose of performance comparison during the optimization and training processes of SVDD, some metaheuristic algorithms such as SHO, KH, SSA, and PSO are implemented which show great efficiency in solving engineering problems. Thus, four approaches arise and need to be compared, that is: (1) SHO-SVDD, (2) KH-SVDD, (3) SSA-SVDD, and (4) PSO-SVDD. The metric under consideration is the well known F1 score (17) due to the fact that this metric takes into account precision and sensitivity. Such metric is computed based on the true and false positives (TP and FP) and negatives (TN and FN) (18) and (19). Moreover, this is highly relevant in the detection of faults in industrial processes. The F1 score value is found between 0 and 1, that is, a value of 1 means that the algorithm can detect the faults without errors. In this sense, the possible solutions generated by the different metaheuristics (particles, group of hyenas, etc.) for both hyper-parameters (C and s) are used in the training and testing phases of SVDD yielding several values of F1 score. Therefore, the different metaheuristics will adjust their solutions through the iterations in order to maximize the F1 score value: F1 score = 2 · Re · Pr Re + Pr (17)

Industrial Application
The four approaches described above are applied to the injection moulding process of car pedals in a local automotive industry. The implementation is carry out in a plastic injection machine with the capability to produce four pieces per cycle and save the relevant information in each run i.e., the parameters used to produce such pieces are automatically stored in a database. Subsequently, the pieces are classified as good or bad through several quality tests carry out by the technical staff. The injection moulding process involves 36 variables listed in Table 1. This data set was selected by the expert personnel of the process and consists of variables of diverse nature like temperatures, pressure, time, etc., corresponding to several machine operational features such as heating power, mold position, and injection time. Since this machine just recently started operating, the primary data provided by the company include 154 observations of normal operation mode. Once the data are cleaned up and structured, the training of the one-class classifier takes place. As soon as the algorithm is trained in order to determine the system's capability, 64 new observations provided by the company are tested. This new data set includes observations corresponding to machine faults, that is, measurements where the injection process does not meet the quality standards. Then, the F1 score metric is calculated and the metaheuristics tune the SVDD hyper-parameters in order to achieve a better F1 score. The parameters of the different metaheuristics were selected according to their authors and the corresponding values are specified in Table 2. The number of iterations was selected in experimental testing where it was determined as the maximum amount of iterations that produced changes in the results. The search range is from 0.001 to 600 for the hyper-parameters. Table 3 shows the different values and descriptive statistics of the F1 score metric and the computational times corresponding to 30 runs using each of the four approaches. As it can be seen, the SHO algorithm reaches the highest mean value of F1 score (0.9702) with a small variability i.e., standard deviation (std) equal to 0.0074. Furthermore, this approach presents the highest F1 score values in 80% of the experiments while SSA, KH, and PSO achieve similar performance only in 16, 13, and 0%, respectively. On the other hand, SHO presents the least computational time in the 100% of the runs achieving execution times two and even three times less than SSA and PSO. In addition, SHO presents the minimum variability with respect to the remaining approaches. In order to confirm the statistical significance of the results, a Mardia test is first performed to determine if the data present multivariate normal distribution. Since the data set is not normally distributed, the non-parametric ANOVA test of the Kruskal-Wallis is used. Figures 2 and 3 show the corresponding box plots for the implemented approaches. As it can be seen in Figure 2, SHO reaches the highest median F1 score value with low variability. On the other hand, the remaining approaches present the equal median F1 score values. However, KH shows the highest variability on its results. Although SSA sometimes yields similar results than those of SHO, they are so sporadic throughout the runs which is the reason for the boxplot to show these values as outliers. Figure 3 clearly depicts that SHO achieves the best computational times since the highest outliers of SHO are below of the lowest valued reported by KH. Tables 4 and 5 present the numerical results of the ANOVA test where for both cases (F1 score and computational time) the p-value is much lower than the significance level of 0.05. Thus, there is a statistical significance of the difference between the means of the results obtained by the four approaches. On the other hand, the high error in the sum of the squares (SS) in the F1 score analysis shows that there is a great variability which can not be explained by the predictors. Once considerable differences have been detected when using the Kruskal-Wallis ANOVA, a post hoc Tukey test is performed to determine the mean with the most significant difference. This analysis compares the means of all treatments with the mean of every other treatment and the best available method is considered in cases when confidence intervals are desired (for details, refer to [51]). Figures 4 and 5 show the graph of the estimates and comparison interval. Each group mean is represented by a little circle and the interval is represented by a line extending out from the circle. Two group means are significantly different if their intervals are disjoint and they are not significantly different if their intervals overlap. As it can be seen, the SHO approach does not overlap with any other in both cases (F1 score and computational times), i.e., the Tukey test concludes that there is a significant difference between the SHO algorithm and the remaining approaches. This shows that SHO-SVDD offers a higher performance for fault detection in the plastic injection machine (F1 score is equal to 0.9702 in average) with a shorter computational training time. Finally, the overfitting effect is analyzed to test the SHO-SVDD generalization performance. The complete data set (normal and mixed operation samples) is used to implement a K-fold cross validation procedure. In addition, 5-fold (test 1) and 10-fold (test 2) cross validation are computed 30 times, and F1 score mean value is calculated i.e., a total of 150 and 300 training-test experiments are computed, respectively. The obtained F1 score mean values are 0.9701 for test 1 and 0.9714 for test 2, both are very close to F1 score mean value of 0.9702 obtained in the previous analysis. Considering this, it is concluded that the SHO-SVDD approach for fault detection presents a good generalization for this particular data set.

Conclusions and Future Work
In this paper, a methodology for fault detection in multivariate processes is presented. This proposal is able to work in industrial processes and machines which are recently operated. It can deal with little data about faults by the implementation of a one class classifier (SVDD). Such algorithm is optimized during its training stage, that is, its hyper-parameters are tuned using a recent metaheuristic based on the behavior of spotted hyenas (SHO). In order to evaluate its computational performance, SHO has been compared with three other metaheuristics (KH, SSA y PSO) that have been successfully implemented for solving several engineering problems. The approaches were tested using data taking from a plastic injecting machine in the automotive industry. The results after 30 runs show that SHO reaches higher values in the F1 score metric, that is, it shows higher performance for fault detection in considerably shorter computing time with respect to the rest of the considered approaches. Furthermore, a non-parametric statistical analysis was performed to prove the statistical significance of the superior performance of SHO. The Kruskal-Wallis test and the post hoc analysis show that there is a significant statistical difference between SHO and the other metaheuristics. Moreover, SHO-SVDD was tested for generalization performance using 5-fold and 10-fold cross validation. The obtained F1 score mean values were 0.9701 and 0.9714, respectively, which are very close to the F1 score mean value of 0.9702 obtained by the proposed analysis. Considering this, it is concluded that the SHO-SVDD method presents a good generalization to classify this data set. Based on this, the SHO-SVDD approach seems to be the best to carry out fault detection for this particular industrial application. As future work, some feature selection approaches will be implemented to obtain a subset of variables that allows for performing a fault diagnosis in order to trace back operational issues or other problems in the process.

Conflicts of Interest:
The authors declare no conflict of interest.