Anomaly Detection in Automotive Industry Using Clustering Methods—A Case Study

: In automotive industries, pricing anomalies may occur for components of different products, despite their similar physical characteristics, which raises the total production cost of the company. However, detecting such discrepancies is often neglected since it is necessary to ﬁnd the problems considering the observation of thousands of pieces, which often present inconsistencies when speciﬁed by the product engineering team. In this investigation, we propose a solution for a real case study. We use as strategy a set of clustering algorithms to group components by similarity: K-Means, K-Medoids, Fuzzy C-Means (FCM), Hierarchical, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Self-Organizing Maps (SOM), Particle Swarm Optimization (PSO), Genetic Algorithm (GA) and Differential Evolution (DE). We observed that the methods could automatically perform the grouping of parts considering physical characteristics present in the material master data, allowing anomaly detection and identiﬁcation, which can consequently lead to cost reduction. The computational results indicate that the Hierarchical approach presented the best performance on 1 of 6 evaluation metrics and was the second place on four others indexes, considering the Borda count method. The K-Medoids win for most metrics, but it was the second best positioned due to its bad performance regarding SI-index. By the end, this proposal allowed identify mistakes in the speciﬁcation and pricing of some items in the company.


Introduction
Efficiency in the cost management at the automotive companies is extremely important to generate competitive products in a global open market. Financial and managerial successes are linked to a fast adaption to current daily challenges, with free competition. Innovative products with reduced launch times are frequently requested, and customers perceive a high quality. Moreover, flexibility and competitive prices are prerequisites to survive in the automotive business. These challenges are increasingly present in the industry and mainly influence a company's responsiveness, versatility, speed and adaptability to changes. Simultaneously, the industry is always looking for cost reduction [1][2][3]. The work developed by Pan and Lu [15] proposed constructing a model for grouping parts to replace the conventional manufacturing process by additive manufacturing (AM). The current evaluations of (AM) are manually done, which reduces the scope of the application. In this model, the grouping of components is done automatically by clustering algorithms for other manufacturing processes by additive manufacturing (AM).
On Zhong, Xu and Pan [16] developed a non-threshold consensus model that combines the minimum cost and maximum consensus-increasing for multi-attribute large-group decision-making (MALGDM). The main differences between traditional clustering techniques is that instead of using a predefined threshold and a maximum number of iterations, a termination index is developed to terminate the consensus reaching process (CRP), considering different consensus metrics.
In addition, Kong et al. [17] introduced a two-mode modularity clustering method with new similarity measures for parts and machines using an ordinal part-machine matrix. The proposed method considers both incidences, transition between parts and machines and find optimal clusters. The method produces reasonable cell formation solutions in terms of several performance measures.
Bodendorf, Merkl and Franke [18] make a literature review on intelligent cost estimation methods for parts to be procured in the manufacturing industry is carried out by text mining. Consequently, in this paper, approaches derived from Multitask Learning and Explainable Machine Learning can be found. A combination of methods considered most suitable for predictive analytics to estimate procurement costs is presented.
Finally, Chan, Lu and Wangon [19] developed a new cost estimation framework based on big data analytics tools. The manufacturing cost associated with a new job can be estimated based on similar ones in the past. The new framework is implemented and demonstrated for additive manufacturing, where the similarities of the 3D geometry of parts and printing processes are established by identifying relevant features.
In this sense, a challenge for identifying pricing anomalies is to monitor whether the classifications (labels) available in the registration data are adequate for the correct grouping of parts, which should have similar manufacturing costs. In the case study addressed here, two classifications previously existing in the database were evaluated, the NCM and the class number. The NCM, which is an acronym for Mercosur Common Nomenclature, is a hierarchical code of categories. Brazilian taxes, descriptions, and rates are attached to each NCM code [20][21][22]. The class number is data created by product engineering when creating new parts. The product engineers freely choose the definition of this number. Consequently, it presents a subjective character, being susceptible to different interpretations. Both labels may show inconsistencies during the sample analysis.
In this investigation, we address an actual case study from an automaker. Nowadays, there is an immediate need to expand the company's focus to include analyzing large sample sets, which can be multidimensional and complex for manual analysis.
This way, clustering techniques are used to identify pricing anomalies in similar parts, using their specifications as a basis. Clustering methods are tools used on many research fronts, including data mining so that significant knowledge is extracted from apparently unstructured samples [23][24][25][26]. In addition, clustering is a way to group data and identify patterns coherently and unsupervised [27,28]. Among the different techniques, the literature points out those computational intelligence techniques are an alternative to traditional methods such as K-Means in real problems. However, the current knowledge about this task is not enough to define the best algorithm for all cases [10,23,29].
To the best of our knowledge, the use of clustering methods for anomaly detection considering components of an automaker is an unprecedented proposal as a case study. Therefore, this study contributes to an understanding of complex data related to group parts without knowing the classification in advance. It also extends the experience to the existing knowledge through previous research.
This paper was organized as follows. Section 2 presents a description of each technique used; Section 3 shows the metrics used for cluster quality evaluation, while Section 4 presents the database, results, and critical analysis. The conclusions are in Section 5.

K-Means
The K-Means is a well-known clustering algorithm because it is easy to implement and understand, requests low computational effort, and presents fast convergence [10,30]. It has been adapted to many challenging problems in distinct domains [31,32].
The K-Means initialization starts with the random generation of the centroids' position. The data are allocated to each group, and then a given sample will belong to the cluster with the shortest distance to the respective centroid. After this phase, the new positions of the centroids are recalculated, and they move to the geometric center of the cluster formed in the current iteration [9].
The algorithm runs to minimize the sum of the internal distance (Sum of Squares Within Clusters-SSE) between the data and the centroids by Equation (1). In this equation, the measure dist is the Euclidean distance given by Equation (2): where c k is the centroid of cluster k, x i is the i-th object of the k-th cluster, K is the number of clusters, and n q is the number of elements on each cluster and where d is each number of dimensions of each object on the data set, and D represents the last dimension of each object [10].
where n k is the number of objects in the respective cluster. This process is repeated until a stop criterion is reached. The most used are the number of iterations or a predefined limit of a similarity measure.
We highlight that the strong dependence on the initialization is widely known. In addition, the user must specify the number of clusters a priori [25,33].

K-Medoids
Another method of partial grouping is K-Medoids [34], in which a data set X is stored in K clusters. This algorithm minimizes the differences between each object in a cluster and its representative object (centroid), similar to the K-Means.
Initially, centroids need to be defined, in this case, known as Medoids. The most common way to proceed with this step is to determine as centroids n k samples from the data group, drawn randomly. This is done to minimize the effects of K-Means random initialization [35].
Thus, it is expected to find a single partition of the data in the K clusters. Each one has a more representative point: the most centrally located concerning some measure, for example, Euclidean distance [36]. Such an idea reduces noise and discrepancies, making the method more robust than K-Means [35]. The execution steps are like the K-Means case.

Fuzzy C-Means (FCM)
Fuzzy C-Means (FCM) has become an essential method with many grouping-clustering applications for real-world problems [37]. Among the diffuse clustering methods, FCM is one of the best known for its simplicity and efficiency. However, it shows some weaknesses, particularly its tendency to fall to local lows at minimum local points.
This algorithm treats clusters as flexible groups in which each data object has presented a degree of association. These degrees are evaluated between 0 and 1, with a high value representing a high degree of similarity between the current object under study and the group [38].
According to [39], FCM has shown important characteristics: the number of groups can be defined automatically; it works well with overlapping data and is robust for initialization, noise, and outliers.
The FCM defines an association matrix U = u ij being N the number of objects. Note that each pattern is named i for each cluster j and n k is the number of objects in the respective cluster.
The objective of the method is to minimize the cost function J m described in Equation (4): in which . it is usually the Euclidean norm, and m is the inaccuracy coefficient provided by the user. After that, the degree of relevance and the central positions are calculated by Equations (5) and (6):

Hierarchical Clustering
Hierarchical clustering algorithms generate groups at successive levels, in which the current grouping process is created based on the previous hierarchical level. Unlike the partial approach, hierarchical clustering does not need to specify the number of clusters in advance [13].
However, the computational cost required to perform these methods is high, which is unfeasible for a large data set. Hierarchical approaches create a dendrogram structure, consisting of a tree structure representing the hierarchical sequence of nested partitions of the data set [14]. The procedure creates successive cluster levels, in which the current group is based on the solution obtained in the previous level [40].
There are two types of hierarchical algorithms: divisive and agglomerative [41,42]. In the divisive case, the process begins with all data elements in a single cluster divided into smaller clusters based on proximity until the criteria related to the total number of clusters is obtained [41]. In the agglomerative approach, every element is initially an individual cluster. Figure 1 shows simulated results of divisive and agglomerative methods.

Hierarchical Clustering
Hierarchical clustering algorithms generate groups at successive levels, in which the current grouping process is created based on the previous hierarchical level. Unlike the partial approach, hierarchical clustering does not need to specify the number of clusters in advance [13].
However, the computational cost required to perform these methods is high, which is unfeasible for a large data set. Hierarchical approaches create a dendrogram structure, consisting of a tree structure representing the hierarchical sequence of nested partitions of the data set [14]. The procedure creates successive cluster levels, in which the current group is based on the solution obtained in the previous level [40].
There are two types of hierarchical algorithms: divisive and agglomerative [41,42]. In the divisive case, the process begins with all data elements in a single cluster divided into smaller clusters based on proximity until the criteria related to the total number of clusters is obtained [41]. In the agglomerative approach, every element is initially an individual cluster. Figure 1 shows simulated results of divisive and agglomerative methods.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Density-based clustering algorithms are usual in the literature [43]. The best-known algorithm of this category is the Density-Based Spatial Clustering of Applications with Noise-DBSCAN [44]. The method can find clusters of an arbitrary shape, detecting them through high-density spheres and merging the nearby spheres forming clusters.
The DBSCAN uses a simple estimate of the minimum density level, based on a threshold for the number of neighbors, minPts, within the radius (with an arbitrary distance). Objects with more neighboring minPts within that radius (including the query point) are the central point.
The DBSCAN intends to find areas that satisfy this minimum density separated by lower density areas. For efficiency reasons, DBSCAN does not estimate density between points. Instead, all neighbors within the radius by a central point are considered part of the same cluster (known as direct attainable density). If any of these neighbors is chosen again as a central point, their neighborhoods will be included transitively (reachable density). Non-essential points in this set are called boundary points, and all points in the same set are connected to density. Points that are not accessible by density are considered noise and do not belong to any group [45].

Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Density-based clustering algorithms are usual in the literature [43]. The best-known algorithm of this category is the Density-Based Spatial Clustering of Applications with Noise-DBSCAN [44]. The method can find clusters of an arbitrary shape, detecting them through high-density spheres and merging the nearby spheres forming clusters.
The DBSCAN uses a simple estimate of the minimum density level, based on a threshold for the number of neighbors, minPts, within the radius ε (with an arbitrary distance). Objects with more neighboring minPts within that radius (including the query point) are the central point.
The DBSCAN intends to find areas that satisfy this minimum density separated by lower density areas. For efficiency reasons, DBSCAN does not estimate density between points. Instead, all neighbors within the radius ε by a central point are considered part of the same cluster (known as direct attainable density). If any of these neighbors is chosen again as a central point, their neighborhoods will be included transitively (reachable density). Non-essential points in this set are called boundary points, and all points in the same set are connected to density. Points that are not accessible by density are considered noise and do not belong to any group [45]. Figure 2 illustrates the working principle of the DBSCAN and the elements. The minPts parameter is four, and the circles indicate the radius ε. Note that N is a noise point, A is a central point and, B and C are border points.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 6 of 23 Figure 2 illustrates the working principle of the DBSCAN and the elements. The minPts parameter is four, and the circles indicate the radius . Note that N is a noise point, A is a central point and, B and C are border points.

Self Organizing Maps (SOM)
Self-Organizing Maps (SOM) are artificial neural network architectures that are adjusted using unsupervised learning methods to cluster data sets, including missing values. They can be used to analyze and visualize complex multivariable data with multiple parameters [46].
The network presents advantages over other multivariate approaches because it can deal with nonlinearities in a system [47,48]. Because it can be elaborated using data without mechanical knowledge of the system, it can deal with noisy, irregular, or missing samples. In addition, it is easy to update and interpret information with many variables or parameters using visualization resources [49][50][51].
The SOM network is a matrix of M = m × m of artificial neurons. If these m 2 neurons are arranged in a grid or a plane, the network is called two-dimensional since it maps high-dimensional input vectors to a two-dimensional surface. For a given network, the input vector x = (x1, x2, …, xn) has a fixed dimension n. These n components are connected to each neuron in the matrix. A synaptic weight wij is defined for connecting the i-th component of the input vector to the j-th neuron. Therefore, a n-dimensional vector wj of synaptic weights is associated with each neuron j [52]. Equation (7) shows the function of the closest distance to the winning neuron j: where wij are the weights at the initial iteration = 0, which are small randomly generated values, is the number of the current iteration, and xj is a randomly selected vector from the training data set. Equation (8) shows the update function of the winning weight vectors and their neighbors: where ( ) is the function of the learning rate that decreases exponentially with the iterations; it is calculated by Equation (9): where 0 is the initial learning rate and the number of iterations, both user-defined data. In Equation (10), the neighborhood order function can be observed:

Self Organizing Maps (SOM)
Self-Organizing Maps (SOM) are artificial neural network architectures that are adjusted using unsupervised learning methods to cluster data sets, including missing values. They can be used to analyze and visualize complex multivariable data with multiple parameters [46].
The network presents advantages over other multivariate approaches because it can deal with nonlinearities in a system [47,48]. Because it can be elaborated using data without mechanical knowledge of the system, it can deal with noisy, irregular, or missing samples. In addition, it is easy to update and interpret information with many variables or parameters using visualization resources [49][50][51].
The SOM network is a matrix of M = m × m of artificial neurons. If these m 2 neurons are arranged in a grid or a plane, the network is called two-dimensional since it maps high-dimensional input vectors to a two-dimensional surface. For a given network, the input vector x = (x 1 , x 2 , . . . , x n ) has a fixed dimension n. These n components are connected to each neuron in the matrix. A synaptic weight w ij is defined for connecting the i-th component of the input vector to the j-th neuron. Therefore, a n-dimensional vector w j of synaptic weights is associated with each neuron j [52]. Equation (7) shows the function of the closest distance to the winning neuron j: where w ij are the weights at the initial iteration t = 0, which are small randomly generated values, t is the number of the current iteration, and x j is a randomly selected vector from the training data set. Equation (8) shows the update function of the winning weight vectors and their neighbors: where α(t) is the function of the learning rate that decreases exponentially with the iterations; it is calculated by Equation (9): where α 0 is the initial learning rate and T the number of iterations, both user-defined data. In Equation (10), the neighborhood order function can be observed: d 0 being the initial topological neighborhood.

Particle Swarm Optimization (PSO)
The Particle Swarm Optimization (PSO) algorithm is the most used, popular, and famous swarm-inspired method for optimization [10]. It was inspired by the social behavior of animals that behave with flock characteristics, simulating their collective intelligence [13,53]. Beyond the optimization applications with real data, PSO has been widely used in binary, combinatorial, and clustering optimization problems [9]. In optimization problems, positions represent a candidate solution to the current task. However, in clustering, this depends on the coding and parameters [25].
The learning process of the particles comes from two sources: their own experience, called cognitive learning, and the combined learning of the whole cluster (or a topological neighborhood) called social learning. The first case is represented by the best position of the particle (pBest) reached until the current iteration, and the best position represents social learning achieved considering, for example, the entire population (gBest). Together, cognitive and social learning are used to calculate the velocity of particles and their next position [13].
The most used encoding scheme for partial clustering considers a particle a complete candidate solution [10]. In this case, the agent contains the spatial coordinates of all n k centroids concatenated in a vector. The PSO can be applied directly, using the position and velocity, according to Equations (11) and (12), respectively: where ω is the inertia weight, x i is the position of the particle, v i is the speed, pBest i is the best position ever found by each particle, gBest is the best position found by the group, q 1 and q 2 are two previously defined constants, and r 1 and r 2 are two numbers randomly generated in the interval [0, 1]. The algorithm is often initialized by randomly scattering particles over the search space. The same process generates the initial speeds, but they are initialized equal to zero in some other cases. The inertia weight ω is often less than one, and it is used to avoid divergence in the response. It is also common to limit the speed of the particles to an

Genetic Algorithm (GA)
Evolutionary computing draws ideas from evolutionary biology to develop search and optimization techniques to solve complex problems. Evolutionary algorithms are rooted in the Darwinian Theory of the evolution of species. Darwin proposed that a population of individuals capable of reproducing is conditioned to (genetic) variation followed by natural selection results in new populations increasingly adapted to their environment. This proposal was very radical when it was first formalized at the end of the 1850s. It suggested that a simple reproduction process with variation and selection would be sufficient to produce complex life forms [54][55][56].
The Genetic Algorithm (GA) models genetic evolution, in which the characteristics of individuals are expressed using genotypes. The leading operators are selection (to model the survival of the fittest), recombination through the application of a crossover operator (to model reproduction), and mutation. The goal of the mutation is to introduce new genetic material into an individual, that is, to add diversity to the genetic characteristics of the population [57]. As in the PSO, an individual (chromosome or agent) presents a complete candidate solution for a given problem [58].
The chromosome based on gene configuration defines the genotypes. Each chromosome is evaluated by a cost function (fitness). A certain number of chromosomes were defined at the start parameter composes the first generation. Based on the optimization of the problem, the following generations are created based on selection, crossover, and mutation operators [59].
In the crossover, genes are exchanged on a specific chromosome, generating new individuals with another configuration. Figure 3 presents the classic one-point crossover [54,60].
Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 The chromosome based on gene configuration defines the genotypes. Each chr some is evaluated by a cost function (fitness). A certain number of chromosomes defined at the start parameter composes the first generation. Based on the optimizati the problem, the following generations are created based on selection, crossover, and tation operators [59].
In the crossover, genes are exchanged on a specific chromosome, generating new viduals with another configuration. Figure 3 presents the classic one-point crossover [54 Mutation operation corresponding to a spurious gene change after crossover, lustrated in Figure 4.

Differential Evolution (DE)
Differential Evolution (DE) is another technique inspired by the evolution of sp In the literature, it has proved to be a viable candidate for optimization and clust problems. In this case, agents are called vectors and are stochastically generated as in and GA [61].
The positions of the vectors provide valuable information about the cost fun scenario. As long as a good uniform randomization method is used to build the i population, the initial agents provide a good representation of the entire search s with relatively large distances between them. As the search progresses, the distance tween the vectors decrease, with everyone converging on the same solution. The m tude of the initial distances between individuals is influenced by the population size inversely proportional way [57].
In DE, a population of NP candidate solutions indicated as xi,G, where denotes agent and represents the generation of the population, is used. As in GA, the oper are crossover, mutation, and selection [62].
The optimization process starts with selecting a vector, named target vector three others were chosen at random: xr1, xr2, and xr3. In the next step, the mutation ope is applied, generating a new donor vector vi,G through Equation (13): Mutation operation corresponding to a spurious gene change after crossover, as illustrated in Figure 4.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 The chromosome based on gene configuration defines the genotypes. Each chro some is evaluated by a cost function (fitness). A certain number of chromosomes w defined at the start parameter composes the first generation. Based on the optimizatio the problem, the following generations are created based on selection, crossover, and tation operators [59].
In the crossover, genes are exchanged on a specific chromosome, generating new viduals with another configuration. Figure 3 presents the classic one-point crossover [54, Mutation operation corresponding to a spurious gene change after crossover, a lustrated in Figure 4.

Differential Evolution (DE)
Differential Evolution (DE) is another technique inspired by the evolution of spe In the literature, it has proved to be a viable candidate for optimization and cluste problems. In this case, agents are called vectors and are stochastically generated as in and GA [61].
The positions of the vectors provide valuable information about the cost func scenario. As long as a good uniform randomization method is used to build the in population, the initial agents provide a good representation of the entire search sp with relatively large distances between them. As the search progresses, the distance tween the vectors decrease, with everyone converging on the same solution. The ma tude of the initial distances between individuals is influenced by the population size i inversely proportional way [57].
In DE, a population of NP candidate solutions indicated as xi,G, where denotes agent and represents the generation of the population, is used. As in GA, the opera are crossover, mutation, and selection [62].

Differential Evolution (DE)
Differential Evolution (DE) is another technique inspired by the evolution of species. In the literature, it has proved to be a viable candidate for optimization and clustering problems. In this case, agents are called vectors and are stochastically generated as in PSO and GA [61].
The positions of the vectors provide valuable information about the cost function scenario. As long as a good uniform randomization method is used to build the initial population, the initial agents provide a good representation of the entire search space, with relatively large distances between them. As the search progresses, the distances between the vectors decrease, with everyone converging on the same solution. The magnitude of the initial distances between individuals is influenced by the population size in an inversely proportional way [57].
In DE, a population of NP candidate solutions indicated as x i,G , where i denotes each agent and G represents the generation of the population, is used. As in GA, the operators are crossover, mutation, and selection [62].
The optimization process starts with selecting a vector, named target vector, and three others were chosen at random: x r1 , x r2 , and x r3 . In the next step, the mutation operator is applied, generating a new donor vector v i,G through Equation (13): where F is a user-defined constant from [0, 2], and i = 1, . . . , NP, r 1 , r 2 , r 3 ∈ {1, . . . , NP}. Then, a crossover operator, also called recombination, incorporates successful solutions into the population. The trial vector u i,G is created by the binomial combination of the target vector x i,G and the donor vector v i,G elements. Each element of the trial vector comes from x i,G with crossover probability C r ∈ [0, 1] selected along with the population, as in Equation (14) [62]: Finally, the selection operator differs from GA. In this case, the target vector x i,G is compared with the trial vector u i,G, and the corresponding vector with the best fitness value is taken into next-generation, agreed approach, as in Equation (15):

Clustering Metrics
A critical step on the encoding success of clustering algorithms is the metrics that evaluate the quality of the formed groups, which are focused on similarity or dissimilarity distances. Towards this statement, it can be found that metrics focused on similarity optimize the cluster cohesion. On the other hand, methods that work with dissimilarity aim to present as much as possible more separate groups.

Sum of Squared Errors (SSE)
The Sum of Squared Errors (SSE) is a measure of group compaction. It is a minimization index since the smaller the value, the cluster is more compact [63][64][65].
Consider X = {x 1 , x 2 , . . . , x n } as a dataset with n samples. Suppose that the samples in X present rigid labels belonging to clusters k without overlap, being K = {c 1 , c 2 , . . . , c k } the centroids. The clustering algorithm seeks to find the ideal partition P = {P 1 , P 2 , . . . , P m }, positioning the k centroids iteratively [61]. This metric can be defined by Equation (1) [10], being the same used as the objective function to the K-Means.:

Sum of Squares within Clusters (SSW)
Sum of Squares Within Clusters (SSW) is a metric of group compaction and a minimization index. What sets it apart from the SSE is the non-square result of the summation equation [10,13,14]. This measure is dated by Equation (1), as showed in the previous Section 2.

Sum of Squares between Clusters (SSB)
The Sum of Squares Between Clusters (SSB) measures how outlined the groups are. It is a maximization index. The higher the value, the more distinct the groups formed by the respective algorithms [66,67]. It is a criterion for measuring the separation of groups and can be calculated by Equation (16): where c k is the centroid of cluster k and c 1 are the other clusters [10].

Calinski-Harabasz Index (CH)
The Calinski-Harabasz index (CH) is a composition between the SSW and SSB, considering the number of centroids and the data. It is a maximization index [14,[61][62][63]. Equation (17) shows the relationship of the elements exposed above: The CH metric creates a relationship between SSW, SSB, the total number of elements in the sample, and k, the number of centroids [67].

WB Index
The WB index is also a composition between the SSW and SSB indexes, adding a relationship between these metrics and the number of centroids. However, in this case, it uses only the number of centroids. It is a minimization metric [14,[66][67][68].
Equation (18) presents the relation of the elements, as established by [67]:

Silhouette Index (SI)
The Silhouette Index (SI) is a metric used to assess cluster validity. Its value can vary from (−1 < SI < 1). It measures how similar observed data is to other observations in its group compared to the samples allocated to the group closest to it. The SI values close to −1 indicate the data was erroneously entered in the target group, while values close to zero show the data could be in its target group as well as in some other group. SI values close to 1 indicate that the data are correctly allocated [10,69,70].
Equation (19) presents the formula for calculating the SI Index: where N is the number of clusters, and a i is given by Equation (20): 20) and N k is the number of objects in a specific cluster, b i is given by Equation (21): It means that a i is the mean distance between object i and all other data points in the same cluster, and b i is the minimum mean distance of object i to all points in any additional cluster h, of which this object does not belong.

Case Study, Results, and Discussions
This work was developed in an actual data set supplied by an automotive industry from Ponta Grossa-PR, Brazil, that cannot be nominated due to confidentiality aspects agreed between parts. There is an immediate need to expand the company's focus to include the analysis of large sample sets. The best practice towards cost efficiency on the supply chain of components can be replicated, multidimensional and complex for manual analysis.

Industry Data
The data collection stage was initiated throughout a meeting with a multi-functional team of the company, including specialists from the following areas: manufacturing, engineering, product engineering, purchasing, quality, materials, logistics, production, and supplier development. In this event, the problem was discussed together with the business development team. The objective was grouping parts by similarity for future cost comparison, following the steps above: Step 1-Load the complete SAP system database: at this step, the industry material planners identified all items by part number and checked by stock movements over the last three months. Note that just recently purchased parts with interest in cost analysis are included on the initial data set; Step 2-Select continuous quantitative data: at this step, product engineering, manufacturing engineering, and accounting departments were invited to discuss the attributes of each part number downloaded. Therefore, clarifications on the nature of the data could be discussed for further classification considering quantitative and qualitative aspects. Observe that this case study only considers quantitative data; Step 3-Eliminate components with missing data: some selected parts were incomplete. We highlight that most of the clustering does not work well with incomplete data. Therefore, part numbers with an incomplete data set were suppressed; Step 4-Consider NCM (Mercosur Common Nomenclature) as the lower limit of the number of groups to be created: clustering techniques' challenge is determining the number of clusters in advance. Based on this premise, several discussions with manufacturing and product engineering teams were carried out to identify the potential interval of dataset labels. The conclusion was that the NCM could be a good reference as lower limit (200) and engineering class number as the upper limit (621).
The first loading of systemic information raised 4267 items with 14 dimensions, among which the multi-functional team selected the following dimensions: Weight, Length, Width, Height, Volume, and Density. At the end of the 3rd step, 2765 parts were included in the database, which presented six dimensions for each sample.
As guided by the multi-functional team in the 4th and 5th steps, 200 different NCM codes and 621 different engineering class codes were used as a reference for defining the number of groups. With this background, it was expected that this study could show an ideal number of groups between 200 and 621.
For this reason, a search of the number of clusters was made with the following pre-established quantities: 60, 160, 260, 360, 460, 560, 660, 760, and 860, as most of the addressed algorithms requires the number of groups a priori. Values of 200 and 621 were purposely not included because manual analysis found inconsistencies in both cases.

Pre Processing Stages
Before the first run of the corresponding algorithms, it is necessary to perform a statistical exploration analysis of the data set. We noted some key points that interfere with the overall process quality of machine learning techniques as described below.
A logarithmic scaling resulting in the following weight data distribution in Figure 5 was applied to avoid a biased response.

Pre Processing Stages
Before the first run of the corresponding algorithms, it is necessary to perform tistical exploration analysis of the data set. We noted some key points that interfere the overall process quality of machine learning techniques as described below.
A logarithmic scaling resulting in the following weight data distribution in Fig  was applied to avoid a biased response. This process was replicated to the remaining dimensions selected with similar r compared to the distribution before and after scaling.
Another point observed was that the corresponding dimensions presented a numerical magnitude difference, and this implies a loss of significance to dimension small magnitudes generating tendentious results. A normalization process known score was applied to avoid this behavior, which is defined by Equation (22): This process was replicated to the remaining dimensions selected with similar results compared to the distribution before and after scaling.
Another point observed was that the corresponding dimensions presented a high numerical magnitude difference, and this implies a loss of significance to dimension with small magnitudes generating tendentious results. A normalization process known as z-score was applied to avoid this behavior, which is defined by Equation (22): where x j is the input data, x j and σ j are the samples mean and the standard deviation of the j-th attribute, respectively. The transformed dimension presents a zero mean and a variance of 1.
The last preprocessing step used in this work was the PCA (Principal Component Analysis), a method based on the projection of data in a smaller subspace. This process aims at the possibility of reducing the number of dimensions while maintaining its significance in this projected subspace. Dimensional reduction techniques lessen the computational cost in multidimensional problems without harming the result due to preserving the importance of the original dimensions.
The well-known Iris dataset was used to validate the option for the PCA application. This dataset contains three labeled classes of flowers: Setosa, Versicolour, and Virginica, including 50 samples each, presenting four attributes: sepal length, sepal width, petal length, and petal width. Therefore, we consider the following scenarios: (a) Four dimensions sepal length, sepal width, petal length, and petal width without PCA application; (b) PCA application using only the first principal components; (c) PCA application using only the first and second principal components.
Then, we applied the nine clustering algorithms described in Section 3, generating Table 1, which presents the percentage of correct classifications: for each scenario: Note a slight variation from each of the scenarios chosen, which sustain the application of PCA to the Iris dataset. Indeed, in scenarios (b) and (c), the algorithms often overcame their performances considering the total number of dimensions.
After this validation, we applied the PCA to the automotive dataset. It was observed that each of the six principal components shows the following respective significances: 0.57, 0.18, 0.13, 0.09, 0.05, and 0. Note that the four first components present 95% of the energy of the signal.

Computational Results
In this section, we present the results achieved by the nine algorithms described in Section 2 [71]. We present the results separated by an evaluation metric. For the stochastic methods, we present the best of 30 independent simulations. Figure 6 presents the SSE index evolution throughout a different quantity of groups formed.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 23 Graphical analysis shows a clear disadvantage of the DBSCAN algorithm. It is because of the input parameters, which cannot be translated as noise. For this purpose, even isolated data from the predefined radius must be considered as a group. Due to the distribution of the data in the dimensional space, many items fell into this situation. In the analysis of the groups, the large number of single items within several clusters generated by DBSCAN was precise.
It is possible to observe that the FCM algorithm did not perform better than the others. Again, this result can be attributed to the input data distribution, which caused the algorithm to fall to a local minimum, harming the result.
Another point to highlight is the positive result of the curves for the K-Medoids, Hi- Graphical analysis shows a clear disadvantage of the DBSCAN algorithm. It is because of the input parameters, which cannot be translated as noise. For this purpose, even isolated data from the predefined radius must be considered as a group. Due to the distribution of the data in the dimensional space, many items fell into this situation. In the analysis of the groups, the large number of single items within several clusters generated by DBSCAN was precise.
It is possible to observe that the FCM algorithm did not perform better than the others. Again, this result can be attributed to the input data distribution, which caused the algorithm to fall to a local minimum, harming the result.
Another point to highlight is the positive result of the curves for the K-Medoids, Hierarchical, and K-Means algorithms with a distance from the other curves and the corresponding proximity.
In Figure 7 is plotted the SSW index. Graphical analysis shows a clear disadvantage of the DBSCAN algorithm. It is because of the input parameters, which cannot be translated as noise. For this purpose, even isolated data from the predefined radius must be considered as a group. Due to the distribution of the data in the dimensional space, many items fell into this situation. In the analysis of the groups, the large number of single items within several clusters generated by DBSCAN was precise.
It is possible to observe that the FCM algorithm did not perform better than the others. Again, this result can be attributed to the input data distribution, which caused the algorithm to fall to a local minimum, harming the result.
Another point to highlight is the positive result of the curves for the K-Medoids, Hierarchical, and K-Means algorithms with a distance from the other curves and the corresponding proximity.
In Figure 7 is plotted the SSW index.  In the same way as it was observed in the SSE, for the SSW index, there is a clear advantage to the K-Medoids, Hierarchical, and K-Means algorithms, and a disadvantage to the DBSCAN and FCM algorithms, for the reasons already discussed. Figure 8 presents the results for the SSB index. In the same way as it was observed in the SSE, for the SSW index, there is a clear advantage to the K-Medoids, Hierarchical, and K-Means algorithms, and a disadvantage to the DBSCAN and FCM algorithms, for the reasons already discussed. Figure 8 presents the results for the SSB index. It is essential to point out that the SSB goal is to maximize the distance between centroids. This means that better distinct groups are formed, as the centroids are far apart.
The same reason explains the advantage of DBSCAN as the disadvantage found in the SSW for this algorithm. Many centroids ended up being formed with unique elements, which increased the sum of the Euclidean distances between the centroids of the groups included.
At the other end of the exact Figure, there is a disadvantage in the design of the groups formed in applying SOM and FCM algorithms. The performance highlights K- It is essential to point out that the SSB goal is to maximize the distance between centroids. This means that better distinct groups are formed, as the centroids are far apart.
The same reason explains the advantage of DBSCAN as the disadvantage found in the SSW for this algorithm. Many centroids ended up being formed with unique elements, which increased the sum of the Euclidean distances between the centroids of the groups included.
At the other end of the exact Figure, there is a disadvantage in the design of the groups formed in applying SOM and FCM algorithms. The performance highlights K-Medoids, Hierarchical, and K-Means. Figure 9 presents the results for the CH index. It is essential to point out that the SSB goal is to maximize the distance between centroids. This means that better distinct groups are formed, as the centroids are far apart.
The same reason explains the advantage of DBSCAN as the disadvantage found in the SSW for this algorithm. Many centroids ended up being formed with unique elements, which increased the sum of the Euclidean distances between the centroids of the groups included.
At the other end of the exact Figure, there is a disadvantage in the design of the groups formed in applying SOM and FCM algorithms. The performance highlights K-Medoids, Hierarchical, and K-Means. Figure 9 presents the results for the CH index.  Remembering that the higher the CH, the better the quality of the clusters, or else, the more cohesive and outlined are the groups formed. This metric is calculated by Equation (18), representing a relationship between SSB, SSW, amount of data (n), and the number of centroids (k). In this way, equating the correlation at once to maximizes the SSB and minimizes the SSW. Once again, we observe the same tendency regarding the previous metrics. Figure 10 shows the results for the WB index. Remembering that the higher the CH, the better the quality of the clusters, or else, the more cohesive and outlined are the groups formed. This metric is calculated by Equation (18), representing a relationship between SSB, SSW, amount of data ( ), and the number of centroids ( ). In this way, equating the correlation at once to maximizes the SSB and minimizes the SSW. Once again, we observe the same tendency regarding the previous metrics. Figure 10 shows the results for the WB index. From Figure 10, we observe that the evolution curve of the WB index decreases significantly from the 360 groups formed for the different algorithms. This means that there is no significant reduction in the value of the index starting from this point. This index From Figure 10, we observe that the evolution curve of the WB index decreases significantly from the 360 groups formed for the different algorithms. This means that there is no significant reduction in the value of the index starting from this point. This index uses a ratio between SSW and SSB, which may indicate the groups' formed quality, as much as the cohesion of the groups (SSW) and the distinction between the groups (SSB). Note that for the other metrics, this point is not easy to identify.
As mentioned, the multi-functional team of the company indicated the number of groups among the interval 200 to 621. The best performances were reached by the Hierarchical model, followed by K-Medoids and K-Means. Figure 11 shows the SI index. From Figure 10, we observe that the evolution curve of the WB index decreases significantly from the 360 groups formed for the different algorithms. This means that there is no significant reduction in the value of the index starting from this point. This index uses a ratio between SSW and SSB, which may indicate the groups' formed quality, as much as the cohesion of the groups (SSW) and the distinction between the groups (SSB). Note that for the other metrics, this point is not easy to identify.
As mentioned, the multi-functional team of the company indicated the number of groups among the interval 200 to 621. The best performances were reached by the Hierarchical model, followed by K-Medoids and K-Means. Figure 11 shows the SI index. Considering the Silhouette metric, the value is between [−1, 1], and the closer to 1, the more precise the formation of the groups. Then, we observe an advantage of the Hierarchical algorithm and a worse performance for FCM. The other algorithms have a more stabilized behavior concerning this metric, obtaining intermediate values. There is a significant difference in results and an inconclusive definition of the ideal number of centroids.
For each metric and algorithm application was checked an overall performance (Table 2), considering each result. The first place was awarded 9 points, the second 8, and so on, until the last place winner received 1 point. On overall performance, the Hierarchal, was the general winner, slightly superior to K-Medoids, even though K-Medoids was the first in 4 of 6 different indexes when using the Borda count method [72]. It is an important finding since the literature strongly indicates that these classic methods tend to reach an inferior performance to more recent and sophisticated models, such as DBSCAN, FCM, and SOM. We also highlight that K-Medoids presented a lousy performance for SI, which led to a worse position in the ranking than the Hierarchical approach. Moreover, it seems to be clear that the advantages in their use are in the initialization and as actual data instead of a randomly generated centroid, as the ranking position of the K-Medoids is higher than the K-Means.
In this sense, we can resort to the no-free lunch theorem, since this investigation presents some particular premises, which are not usually addressed in the benchmark or other real-world problems. We expected that the FCM and SOM might achieve the best performances due to their complex clustering schemes. Once again, we can state that more than one algorithm should be used for clustering tasks.
The bio-inspired algorithm (GA, PSO, and DE) presented similar performances, highlighting the GA. Despite showing global search potential, they could not find the optimum global point in this study, and probably fell in local minimum points. Other bio-inspired approaches can be found in the literature, which can improve and expand the results reached in this investigation [73][74][75][76][77][78].
Another remark found in the literature is the computational cost regarding the hierarchical clustering method. We highlight that the Hierarchical approach is a deterministic one that does not depend on initialization. Therefore, a stochastic method like K-Medoids may request more aggregate computational effort, depending on the database, since it needs 30 independent runs.
It is essential to calculate multiple metrics, as only one might be insufficient to conclude the comparative analysis. It is also crucial since only one metric may not indicate the correct number of final clusters.

PCA Assessment for the Best Algorithms
The last analysis was performed to verify the influence of PCA on the results. The PCA potentially changes the classification based on the Borda count method ( Table 2). In this sense, we ran the top 3 classified algorithms following the same premises of Section 4.3, considering the 6 original numerical dimensions, without PCA: K-Means, K-Medoids, and Hierarchical. Figure 12 shows the values obtained for the same 6 indexes from Figures 6-11, presenting the corresponding behavior before and after PCA application. On the right, we plot the same curves of the aforementioned figures, considering these 3 algorithms.
Initially, we highlight that the numerical values in the ordered axis are not comparable with and without PCA, since the number of dimensions is different, so the metrics' values are also distinct.
For SSE, SSW, and SI, one can observe the same tendency in the curve, revealing that the use of PCA did not change the comparative performance. Considering the SSB, CH, and WB indexes, we noted that the Hierarchical method increased its performance, overcoming the K-Means with PCA, and presenting a close behavior regarding the K-Medoids. i. 2021, 11, x FOR PEER REVIEW 18 of 23 Figure 12. Shape comparison before and after the application of PCA to each Index. Figure 12. Shape comparison before and after the application of PCA to each Index.
In this regard, we elaborate a new ranking based on the Borda count method in Table 3 considering only Hierarchical, K-Means and K-Medoids:  Table 3 reveals the same tendency as Table 2. The K-Medoids and Hierarchical methods stood out, but the PCA favored the performance of the last. Due to the small number of contender algorithms, the lousy performance of K-Medoids and the outstanding performance of Hierarchical regarding the SI-index are barely visible. Because of this, the high number of wins achieved by the K-Medoids increased its position in the ranking. Therefore, it is likely that the ranking in Table 2 should not be much modified in the top positions with or without PCA, corroborating the findings presented in Section 4.2.
In general, it can be graphically verified that the influence of the PCA application does not interfere in the shape of the curve for the different indexes and algorithms despite slight ranking changes. Moreover, we noted a decrease in the average running time of K-Means and K-Medoids of around 2% and 8% regarding the Hierarchical algorithm. We advocate using PCA and more than one clustering method in real applications, especially for large datasets.

Company Feedback on Results
At the end of the computational step, with the clustering results, and after verifying the best performance of the Hierarchical, K-Medoids, and K-Means algorithms, the authors return to the automotive company to evaluate the findings. A meeting was held with the product engineering, manufacturing engineering, controllership, and purchasing teams to analyze the results critically. It was detected that three driver seats were in a cluster while the other was in a single group. When observing the original dataset, we found that the information in one of the dimensions of the single-seat was wrong, which led to a distortion in the results.
Other similar errors have been observed, mainly in small components whose basic geometric dimensions X, Y, and Z sometimes correspond to the packaging itself and not the actual dimensions of the part. In other cases, it was observed as a set of pieces and not as a single sample. A similar situation has occurred with the weight of the components. Once the data were corrected, the new groups formed were coherent.
Finally, we highlight that this investigation is a case study that tabulates results and displays a rank on the performance between 9 clustering algorithms considering six indexes primarily found in the literature. Although we are not introducing new or improved clustering techniques, clustering methods to identify price anomalies in an automaker is a relevant contribution to the field.

Conclusions and Future Work
In the automotive industry, it is widespread to develop new products with similar characteristics designed by different development teams, which increases the complexity of parts communization abroad for all products. Due to the high volume of data generated by multiple teams, that do not use data mining tools, this investigation can press the production cost by the supply chain increase of complexity. In this context, the present work used clustering tools to recognize the similarity between pieces of real databases