Benchmarking Studies Aimed at Clustering and Classiﬁcation Tasks Using K-Means, Fuzzy C-Means and Evolutionary Neural Networks

: Clustering is a widely used unsupervised learning technique across data mining and machine learning applications and ﬁnds frequent use in diverse ﬁelds ranging from astronomy, medical imaging, search and optimization, geology, geophysics, and sentiment analysis, to name a few. It is therefore important to verify the effectiveness of the clustering algorithm in question and to make reasonably strong arguments for the acceptance of the end results generated by the validity indices that measure the compactness and separability of clusters. This work aims to explore the successes and limitations of two popular clustering mechanisms by comparing their performance over publicly available benchmarking data sets that capture a variety of data point distributions as well as the number of attributes, especially from a computational point of view by incorporating techniques that alleviate some of the issues that plague these algorithms. Sensitivity to initialization conditions and stagnation to local minima are explored. Further, an implementation of a feedforward neural network utilizing a fully connected topology in particle swarm optimization is introduced. This serves to be a guided random search technique for the neural network weight optimization. The algorithms utilized here are studied and compared, from which their applications are explored. The study aims to provide a handy reference for practitioners to both learn about and verify benchmarking results on commonly used real-world data sets from both a supervised and unsupervised point of view before application in more tailored, complex problems.


Introduction
The concept of machine intelligence (M.I.) has been studied and expanded upon over the course of many years. There are a wide variety of applications that the various methods and algorithms belonging to the discipline can be applied to, and as technology grows and improves, more techniques have become feasible to implement given a spurt in computational capacity. The ability to classify data into groups of interest has a wide range of applications in many fields from medical imaging to astronomy. In addition to that, it is particularly important to businesses that rely on field data to make decisions about resource allocation. It is thus important to verify their effectiveness to ensure the validity of the end results generated from metrics returned by the algorithms. This work attempts to explore classification and clustering algorithms, with a focus on the successes and limitations for each. To do so, a set of benchmarking data sets of varying sizes and complexities are taken and run through implementations of the algorithms, and the performance indices are calculated and recorded for later analysis. The algorithms explored in this work are K-Means and Fuzzy C-Means as well as an evolutionary feed-forward neural network with a fully connected topology particle swarm optimization (PSO), which serves as the weight update scheme. K-Means is a widely known algorithm and has been utilized in many different situations. While it is known for being flexible and simple to implement, it is extremely sensitive to its initialization parameters and tends to become stuck in local optima [1]. The Fuzzy C-Means (FCM) algorithm takes a soft clustering approach, utilizing a membership function that allows a data point to belong to multiple clusters at once. Similar to K-Means however, models can often become stagnant and fall into local optima [2]. Branching out somewhat from clustering algorithms is the classifier utilizing a feed-forward neural network with the PSO optimizer instead of traditional gradient descent. Neural networks can have promising performance for pattern recognition and classification problems, as they are able to approximate a function that can represent the patterns found in the input data [3]. Unless adequate care is taken, however, the network can end up over-fitting or under-fitting the data, and its optimizer may be trapped in local minima. To remedy this, it is necessary to properly propagate the node weights at the various iterations of the training process. This is achieved by using a gradient-free swarm algorithm viz. the PSO, which fine-tunes the network weights. While PSO can be used in conjunction with other algorithms, it is also able to act as a standalone optimizer. PSO works well for finding optimization solutions globally, as it can sample a large region of the search space within a given computational budget, but it must be parametrized properly to avoid trapping in a local optimum [4]. There are a number of parameters that influence the functionality of the algorithm, and while this can allow it to be reworked to better fit an individual data set, incorrect tuning can also cause its performance to suffer. Thus, hybridizing PSO with another classifier algorithm can help eliminate the weaknesses of both [1] for computational applications as in [5].
There have been several studies that have looked at integrating swarm optimizers within feed-forward neural networks in order to maximize performance gains irrespective of the nature of the cost function across a variety of swarm intelligence algorithms [6][7][8].
In general, swarm optimizers are a collection of guided random search techniques that draw inspiration from natural or physical processes and use multi-agent group dynamics to formulate laws of motion and convergence [9][10][11][12][13][14]. These algorithms are distributed in nature and are well-known for their efficacy in approximately solving real-world engineering problems such as job shop scheduling [15], vehicle routing [16,17], digital filter design [18,19], portfolio optimization [20,21], and more.
The goal of this work is to explore the performance of these algorithms and comment on the successes and limitations of each, and by doing so, form a better understanding of how they can influence the results so that when they are applied to real world problems, the meaning behind the results is better illuminated. A flowchart representing the organization of the paper can be found in Figure 1. The subsequent sections are structured as follows: Section 2.1 details the K-Means algorithm. Section 2.2 does the same for the Fuzzy C-Means algorithm. Section 2.3.1 describes a feed-forward neural network. Section 2.3.2 describes the PSO algorithm. Section 3 details the experimental setup. Section 4 covers the results and discussion of the experimentation, and finally, Section 5 contains the conclusions reached and possible future work in the area.
Objective of the research: To put the main objective of the experiment; in other words, the goal is to take a more intimate look at the set of algorithms and attempt to understand not just how they function, but what that functionality means for the situations to which they are applied. Every algorithm has strengths and weaknesses that can be studied and worked around, but by understanding these strengths and weaknesses, a new perspective can be gained on how they can apply to the results. While it is easy enough to apply the algorithm and take the results at face value, understanding how those results came about allows us to make more sense of how the data set is interacting with the algorithm and if any biases or misrepresentations are influencing our ability to understand the model. This is not an easy task, however. Not all algorithms can be broken apart easily and making sense of any information gleaned from doing so can be difficult given the complexity that goes into these algorithms. Furthermore, the data being analyzed play a large part in this process, so a decent understanding of the data sets is also incredibly important. It is hoped that this work will serve as a tutorial for practitioners looking at data clustering and will influence their choice of model selection, parameter choices, and expected results for certain popular data sets. Objective of the research: To put the main objective of the experiment; in other words, the goal is to take a more intimate look at the set of algorithms and attempt to understand not just how they function, but what that functionality means for the situations to which they are applied. Every algorithm has strengths and weaknesses that can be studied and worked around, but by understanding these strengths and weaknesses, a new perspective can be gained on how they can apply to the results. While it is easy enough to apply the algorithm and take the results at face value, understanding how those results came about allows us to make more sense of how the data set is interacting with the algorithm and if any biases or misrepresentations are influencing our ability to understand the model. This is not an easy task, however. Not all algorithms can be broken apart easily and making sense of any information gleaned from doing so can be difficult given the complexity that goes into these algorithms. Furthermore, the data being analyzed play a large part in this process, so a decent understanding of the data sets is also incredibly important. It is hoped that this work will serve as a tutorial for practitioners looking at data clustering and will influence their choice of model selection, parameter choices, and expected results for certain popular data sets.

K-Means Algorithm
The K-Means algorithm is a well-known clustering algorithm that has been widely applied in many different computational problems [5,22,23]. It is an iterative algorithm, which means that it runs continuously from a given starting point and generates new approximate solutions by adding improvements to the results from previous runs. As K-Means is a clustering algorithm, its goal is to assign all data points from a given data set into a predefined number of clusters. To do this, it attempts to minimize the Euclidean distance between each of the data points belonging to the distinct cluster groups. The Euclidean distance between two points (in Euclidean space) is simply the length of the line segment between the two chosen points. It can be computed from the Cartesian

K-Means Algorithm
The K-Means algorithm is a well-known clustering algorithm that has been widely applied in many different computational problems [5,22,23]. It is an iterative algorithm, which means that it runs continuously from a given starting point and generates new approximate solutions by adding improvements to the results from previous runs. As K-Means is a clustering algorithm, its goal is to assign all data points from a given data set into a predefined number of clusters. To do this, it attempts to minimize the Euclidean distance between each of the data points belonging to the distinct cluster groups. The Euclidean distance between two points (in Euclidean space) is simply the length of the line segment between the two chosen points. It can be computed from the Cartesian coordinates of the chosen points using the Pythagorean theorem and is also the Pythagorean distance.
p, q are two points in the Euclidean n-space, and q i and p i are the Euclidean vectors initiated at a chosen origin. By minimizing the Euclidean distance between points in the group, a higher similarity score can be determined from the groupings. Furthermore, the points in the cluster groups are represented by the cluster centroid, which can be determined by calculating the average of all of the points in the cluster, much like how the center of mass is computed in an n-body problem. Given this context, it can be said that the K-Means algorithm attempts to determine the positions of the centroids that give us the best results in terms of grouping data points that share similar signatures or that are members of the same cluster.
There are many benefits to using the K-Means algorithm [24]. First off, it is comparatively simple to implement and understand. Given that there are only a handful of calculations needed for the vanilla version of the algorithm and that those calculations are simple to both understand and implement, the algorithm can be put together in a functional way much more easily than many other similar algorithms. Second, execution is fast. As one may expect from the simple design, the iterations of the algorithm can be run very quickly, leading to quick convergence. K-Means is also a very flexible algorithm. Its structure makes data sets of all sizes usable for the purpose of clustering, making it perfect for situations where many data sets of varying sizes and complexities need to be worked with.
K-Means does have its drawbacks, however [24]. The algorithm does not determine the optimal set of clusters on its own, relying on the user to provide them. It is also very sensitive to the initialization of parameters. Changes in how the cluster centroids are initialized or how the data set is ordered or formatted can end up changing the results put forward by the algorithm. The algorithm also assumes that the clusters in the data will be spherical, which means that its effectiveness is much more limited when working with data sets with varied cluster shapes.
The standard implementation scheme of the K-Means algorithm is structured below [1]: 1. Randomly initialize the k cluster centroids in space.
2. Repeat until a termination condition is satisfied. Assign each data point in space to the cluster centroids that have the closest distance to the point. The Euclidean distance of data point y p to the centroid in d-dimensional space is given as: Recalculate the cluster centroids based on the definition of centroid. As such, the centroid for cluster j is determined as follows: where C j is the subset of data points belonging to the cluster j, and n j is the number of data points in this cluster. The clustering process terminates when one of the following conditions is satisfied: 1.
The number of iterations exceeds a predefined maximum.

2.
When change in the cluster centroids is negligible.

3.
When there is no cluster membership change.
An interested reader may take a look at J. MacQueen's seminal work on the methods for the classification and analysis of multivariate observations [5] for a deeper understanding of the foundations of the K-Means algorithm.

Fuzzy C-Means Algorithm
Similar to the K-Means algorithm, the Fuzzy C-Means (FCM) algorithm attempts to group a set of N objects into C clusters [25,26]. In other words, the objective is to divide the object set D = {D 1 , D 2 , D 3 , . . . D N } into C clusters (1 < C < N) with = { 1 , 2 , 3 , . . . , c } representing the cluster centers. Every data point being inves-tigated is assigned to a cluster, the centroids of which are randomly initialized. This assignment is done using the membership function µ ij defined in [2]: d ij = x i − y j represents the distance between the ith center and the jth data point, d rj = x r − y j is the distance between the rth center and the jth data point, and m [1, ∞) is a fuzzifier. The FCM algorithm uses an iterative gradient descent, an optimization algorithm used to find the extremum of a function, to compute centroids. The centroids, in turn, are updated using the following equation [27]: The objective function to be minimized by the algorithm can be formulated as the sum of membership weighted Euclidean distances: Through the recursive calculation of Equations (5) and (6), the algorithm can be stopped once a termination condition is met. Similar to other algorithms that utilize gradient descent, FCM can stagnate to local optima in a multidimensional search space. Therefore, the utilization of a stochastic optimization approach to compute cluster centroids can help mitigate this problem to a great extent [28,29].

Feed Forward Neural Network with a Particle Swarm Optimization (PSO) Driven Weight Update Scheme
For this work, an implementation of a neural network using particle swarm optimization as the weight optimization function was utilized, and while the two are different topics, as they are intrinsically linked in this work; this section will contain descriptions of both.

Feed-Forward Neural Network
A neural network consists of several connected functional units or neurons that come together as parallel information processing units in order to solve classification and regression tasks. Working together, these units can separate the input space into a discrete number of classes, i.e., they can approximate the underlying function that maps inputs to outputs. For this work, a feed-forward neural network is used. This is a multi-layered network of neurons consisting of an input and output layer of nodes along with several hidden layers that link them. The hidden layers learn the intermediate mappings that eventually are propagated to form the final mapping in the form of the output. Within a hidden layer, each neuron is connected to the outputs of a subset (or all) of the neurons in the previous layer each multiplied by a number called a weight. Neural networks learn by changing their weights to approximate a function representative of the input patterns. Instead of traditional gradient descent, the PSO algorithm is used as the optimizer to fine tune the network weights as the neural network learns the optimal decision boundary on the benchmarking data sets. A simple model of the discussed network utilizing a fully connected PSO is shown in Figure 2.
hidden layer, each neuron is connected to the outputs of a subset (or all) of the neurons in the previous layer each multiplied by a number called a weight. Neural networks learn by changing their weights to approximate a function representative of the input patterns. Instead of traditional gradient descent, the PSO algorithm is used as the optimizer to fine tune the network weights as the neural network learns the optimal decision boundary on the benchmarking data sets. A simple model of the discussed network utilizing a fully connected PSO is shown in Figure 2.
(a) (b) Figure 2. (a) A feed-forward neural network with n hidden layers, p input variables, and q output variables, with weights W; (b) fully connected topology [30] in PSO where each particle is connected to all other particles in the swarm (shown using 12 particles).
While there are various types of propagation in neural networks, the feed-forward network only passes the data between nodes in one direction, toward the output nodes. In this type of algorithm, the first step is to randomly initialize the weights before tuning the values to minimize the error. This is followed by passing the data forward between layers, applying the weights and evaluating the predictions. These steps will repeat for each layer. The output for the layers in the network can be represented by the following equation [31]: In this equation is the input and is the activation function in layer . The output of said layer, , becomes the input for the subsequent layer. Once trained, the While there are various types of propagation in neural networks, the feed-forward network only passes the data between nodes in one direction, toward the output nodes. In this type of algorithm, the first step is to randomly initialize the weights before tuning the values to minimize the error. This is followed by passing the data forward between layers, applying the weights and evaluating the predictions. These steps will repeat for each layer. The output for the layers in the network can be represented by the following equation [31]: In this equation X i is the input and f i is the activation function in layer i. The output of said layer, X i+1 , becomes the input for the subsequent layer. Once trained, the network should be able to approximate a function that can fulfil the design goal for the algorithm [31].
There are several problems that can arise if the network is not properly trained. Things such as overfitting, results being trapped by local minima, and the vanishing gradient problem can all affect the results depending on how the algorithm is structured. For this reason, many techniques such as altering the weight initialization technique have been used to help mitigate some of these issues [31].

Particle Swarm Optimization (PSO)
The particle swarm optimization (PSO) algorithm is a metaheuristic global optimization paradigm that has received prominence over the last two decades owing to its easy implementation with efficient performance over multidimensional problems that cannot be solved using traditional deterministic algorithms. Its design was inspired by the movement of a flock of birds [1]. For the scope of the algorithm, each "bird" would be represented by a particle, while the "flock" is represented by the swarm. To utilize the swarming mechanisms of the PSO, each particle is initialized by the algorithm and is then placed in the search space. After this initialization, the program is run iteratively, with each particle utilizing a velocity calculation to determine how they "fly" through the search space. In this computation, three different parameters are used. First, the inertial weight of the particle, which represents how much the previous direction of movement influences the next. Second, the social weight, which determines how much the global best position influences the particles' movement. Finally, the cognitive weight, which determines how much the location of an individual particle's personal best location influences its movement. The equations for calculating the velocity and position after each iteration is as follows [2]: C 1 and C 2 are constants representing the social and cognitive weights, r 1 and r 2 are independent and identically distributed random numbers between 0 and 1. x ij and v ij represent the position and velocity, respectively, of the ith particle in ith dimension, whereas p ij (t) and p gj (t) are the personal best (pbest) and global best (gbest) positions. In the velocity update scheme, ω represents the inertial weight of the ith particle, and the following two terms add in guided vectors towards areas of attraction (also known as local attractors) in the particle's direction of movement. The personal best update uses a greedy update scheme with regard to a cost minimization goal, modeled using the following equation: In this equation, set f is the cost, and p i is the personal best of the particle. The global best p g is the element with the lowest cost result from the set of personal bests of a particular particle. While PSO can cover a large hyperplane of the search space, finding the right values of the parameter set that lead to the best outcome is a time-consuming task that requires careful analysis.

Parameter Settings for PSO
In our analysis of the clustering benchmarks using the K-Means implementation, the centroids were determined through a randomly selected group of points. To ensure adequate variability in the starting points, each run of the algorithm had its starting points determined independently, and the mean and standard deviation of the performance metrics were reported. The K-Means algorithm expects the number of clusters to be set beforehand, so care was taken to do the same for each of the data sets considered. The Fuzzy C-Means implementation is similar to K-Means in how starting centroid locations needs to be determined before the model begins to run. Just as with K-Means, the centroids were randomly chosen; however, unlike K-Means, the starting centroids were randomly chosen locations within the search space of the data set instead of from the existing points. In addition to this, the membership matrix needed to be initialized. Since each data point has three factors determining membership with all clusters, each point was given a corresponding list of factors and was randomly assigned one cluster to have a strong correlation with. The parameters for the feed-forward neural network with particle swarm optimization (FNN-PSO) for the experiments were the inertia weight ω and the cognitive and social parameters C0 and C1. For the inertia weight, an upper bound and a lower bound were chosen to be the range determinants, and during execution, a randomly generated value inside this range was chosen. This was repeated for each particle in the swarm during the optimization phase of the neural network. To ensure optimal results for all benchmarking data sets, the inertia weight range and the social and cognitive parameters were tailored to each individual data set. Table 1 contains the values used for each data set. The parameter selection was conducted by linearly decreasing the inertia weight between the stated ranges in Table 1 and by conducting a grid search with increments of 0.1 between 0 and 1 to determine the optimum cognitive and social factor values. The neural network parameters also needed to be initialized for each run. To ensure that the data set could be properly processed, the number of input nodes was set to the number of attributes for each individual data set, and to compliment this, the number of output nodes was set to one less than the number of input nodes. As for the number hidden layers, a static 20 was used for all of the data sets. This was done to add a bit more consistency to the repeated runs of each data set. While there is overlap between many of the factors utilized for this experiment, given how sensitive the implementation was to each of them, adequate fine tuning was needed to ensure that the algorithm generated the best results. The best example for this collection is the Seeds data set. Unlike K-Means, the neural network with PSO did not have the number of categories/clusters initialized at the start, so the algorithm needed to generate the categories for each run. The Seeds data set contains three categories, and while the outside of this the data itself does not appear to be much more complex than the other sets, obtaining a properly usable set of models from the initialization was much harder for this data set than it was for the other data sets. The Seeds data set took longer than any other data set to find the proper parameters for, and it still was the data set that the model struggled the most with to obtain accurate results. The further the initialization parameters strayed from their optimal values, the more difficult it was to obtain accurate results from the model. At times, the model would end up with a prediction of the incorrect number of categories, usually ending up stagnated on one prediction category.

Data Sets
The data sets utilized for the benchmarking are five well known ones. All but the Iris data are taken from the UCI Machine Learning Repository [32], with the Iris data set being taken from the scikit-learn library [33]. These are described below: 1.
R. A. Fisher's Iris data set covering three species of Iris flowers (Setosa, Versicolor, and Virginica) containing a total of 150 instances with 4 attributes each. The attributes are as follows: sepal length, sepal width, petal length and petal width. The first two attributes i.e. Sepal Length and Sepal Width are shown in Figure 3.
The data sets utilized for the benchmarking are five well known ones. All but the Iris data are taken from the UCI Machine Learning Repository [32], with the Iris data set being taken from the scikit-learn library [33]. These are described below: 1. R. A. Fisher's Iris data set covering three species of Iris flowers (Setosa, Versicolor, and Virginica) containing a total of 150 instances with 4 attributes each. The attributes are as follows: sepal length, sepal width, petal length and petal width. The first two attributes i.e. Sepal Length and Sepal Width are shown in Figure 3.  3. Seeds data set, which is used to identify 3 different varieties of wheat: Kama, Rosa, and Canadian. The data set contains 210 instances with 7 attributes each. The attributes are as follows: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. The first two attributes i.e. Area and Perimeter are shown in Figure 5.  The data sets utilized for the benchmarking are five well known ones. All but the Iris data are taken from the UCI Machine Learning Repository [32], with the Iris data set being taken from the scikit-learn library [33]. These are described below: 1. R. A. Fisher's Iris data set covering three species of Iris flowers (Setosa, Versicolor, and Virginica) containing a total of 150 instances with 4 attributes each. The attributes are as follows: sepal length, sepal width, petal length and petal width. The first two attributes i.e. Sepal Length and Sepal Width are shown in Figure 3.  3. Seeds data set, which is used to identify 3 different varieties of wheat: Kama, Rosa, and Canadian. The data set contains 210 instances with 7 attributes each. The attributes are as follows: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. The first two attributes i.e. Area and Perimeter are shown in Figure 5.

3.
Seeds data set, which is used to identify 3 different varieties of wheat: Kama, Rosa, and Canadian. The data set contains 210 instances with 7 attributes each. The attributes are as follows: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. The first two attributes i.e. Area and Perimeter are shown in Figure 5.
The data sets utilized for the benchmarking are five well known ones. All but the Iris data are taken from the UCI Machine Learning Repository [32], with the Iris data set being taken from the scikit-learn library [33]. These are described below: 1. R. A. Fisher's Iris data set covering three species of Iris flowers (Setosa, Versicolor, and Virginica) containing a total of 150 instances with 4 attributes each. The attributes are as follows: sepal length, sepal width, petal length and petal width. The first two attributes i.e. Sepal Length and Sepal Width are shown in Figure 3.  3. Seeds data set, which is used to identify 3 different varieties of wheat: Kama, Rosa, and Canadian. The data set contains 210 instances with 7 attributes each. The attributes are as follows: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient, and length of kernel groove. The first two attributes i.e. Area and Perimeter are shown in Figure 5.

4.
Mammographic Mass data set containing 961 instances and is used to classify data into two categories, benign and malignant. Each instance contains 6 attributes based on BI-RADS data and the patient's age. The attributes are as follows: BI-RADS assessment, age, shape, margin, density, and severity. The first two attributes i.e., BI-RADS Assessment and Age are shown in Figure 6.
4. Mammographic Mass data set containing 961 instances and is used to classify data into two categories, benign and malignant. Each instance contains 6 attributes based on BI-RADS data and the patient's age. The attributes are as follows: BI-RADS assessment, age, shape, margin, density, and severity. The first two attributes i.e. BI-RADS Assessment and Age are shown in Figure 6.

Missing Value Replacement:
In the cases of the Breast Cancer data set and the Mammographic Mass data set, the data contained instances where some attribute values were missing and thus needed to be dealt with for the algorithms to function. To this end, the average value for each attribute was calculated, and any missing values were replaced with that average.

Performance Indices
Measuring the performance indices allows us to determine the effectiveness of a given model to correctly classify instances in each data set. The clustering indices are as follows:  Accuracy: This is also known as the Rand index. The accuracy is described as the percentage of correct decisions made by the algorithm. The formula for accuracy is as follows: In this equation, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and finally, FN is the number of false negatives. A true positive is the outcome in binary classification when the model under test correctly predicts the existence of a condition. A true negative is the outcome when the same model correctly predicts the absence of a condition. A false positive represents an error in binary classification tasks such that the result falsely indicates the presence of a certain condition.

5.
Sonar Data set, which consists of 208 instances with 60 attributes each. The instances are classified into either mines or rocks. The attributes represent the energy within a specific frequency band. The first two attributes i.e. Frequency Band 1 and Frequency Band 2 are shown in Figure 7.
4. Mammographic Mass data set containing 961 instances and is used to classify data into two categories, benign and malignant. Each instance contains 6 attributes based on BI-RADS data and the patient's age. The attributes are as follows: BI-RADS assessment, age, shape, margin, density, and severity. The first two attributes i.e. BI-RADS Assessment and Age are shown in Figure 6.

Missing Value Replacement:
In the cases of the Breast Cancer data set and the Mammographic Mass data set, the data contained instances where some attribute values were missing and thus needed to be dealt with for the algorithms to function. To this end, the average value for each attribute was calculated, and any missing values were replaced with that average.

Performance Indices
Measuring the performance indices allows us to determine the effectiveness of a given model to correctly classify instances in each data set. The clustering indices are as follows:  Accuracy: This is also known as the Rand index. The accuracy is described as the percentage of correct decisions made by the algorithm. The formula for accuracy is as follows: In this equation, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and finally, FN is the number of false negatives. A true positive is the outcome in binary classification when the model under test correctly predicts the existence of a condition. A true negative is the outcome when the same model correctly predicts the absence of a condition. A false positive represents an error in binary classification tasks such that the result falsely indicates the presence of a certain condition.

Missing Value Replacement:
In the cases of the Breast Cancer data set and the Mammographic Mass data set, the data contained instances where some attribute values were missing and thus needed to be dealt with for the algorithms to function. To this end, the average value for each attribute was calculated, and any missing values were replaced with that average.

Performance Indices
Measuring the performance indices allows us to determine the effectiveness of a given model to correctly classify instances in each data set. The clustering indices are as follows: • Accuracy: This is also known as the Rand index. The accuracy is described as the percentage of correct decisions made by the algorithm. The formula for accuracy is as follows: In this equation, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and finally, FN is the number of false negatives. A true positive is the outcome in binary classification when the model under test correctly predicts the existence of a condition. A true negative is the outcome when the same model correctly predicts the absence of a condition. A false positive represents an error in binary classification tasks such that the result falsely indicates the presence of a certain condition. On the other hand, a false negative is an error that implies that the result incorrectly fails to indicate the presence of some condition, but it is indeed present.
• F-Score: Another means of calculating accuracy. It is calculated using the precision and recall of a model the formula for F-Score is as follows: • Inter-Cluster Distance: The sum of the distances between each cluster centroid. Larger values indicate a greater separation between clusters, meaning less overlap between the clusters in the model. The formula for inter-cluster distance is as follows [34]: where c j represents the centroid of cluster j. • Intra-Cluster Distance: The sum of the distances between individual data points and their respective parent centroid. Smaller values indicate more compact clusters and are therefore desired. The formula for intra-cluster distance is as follows [34]: k is the number of clusters, c j represents the centroid j, and n refers to the number of data instances. The section x i − c j 2 represents the distance between data instances and their respective centroid. • Quantization Error: The sum of the distances between data points and their parent centroid divided by the total number of data points belonging to the cluster and then summed over all clusters and averaged for all of the clusters in the model. The formula for quantization error is as follows [34]: In this equation, k is the number of clusters, c j is the centroid of cluster j, N j represents the number of data points in cluster j, and finally, the section x i − c j 2 represents the distance between data instances and their respective centroid. As the PSO neural network is not a clustering algorithm and rather performs a classification task; it uses different indicators to measure performance. As discussed before, these include the global best loss and indices such as F-score and accuracy.
The algorithms were implemented in Python 3.9.1 and were executed with an AMD Ryzen 5 3600 6-Core Processor at 3.6 GHz. The experimental results for 50 trials are tabulated and are then analyzed. Table 2 details the data sets used in this work. Table 2. Data Set Information.

K-Means Implementation
For the K-Means implementation, the vanilla version of the algorithm was not altered. Before the start of the algorithm, the data set needed to be split into the base data and the classifications associated with each data point. To achieve this, two containers were set, with the first containing a list of the attribute values for each data point with any missing values replaced, and the second containing a list of the corresponding classifications. From this point, the centroids were determined. Originally, the centroids were statically chosen from the data set, but this method was replaced with a randomly selected set of points. While the original method was able to return predictable results, the goal of the experiment was to study the algorithm's overall performance across a range of choices, and by varying the initialization of the centroids, a much larger range of the attribute space was available to be investigated. The algorithm was set to terminate after a set number of iterations were completed if there were no significant performance gains.
After the algorithm finished each of its runs, the final positions of the centroids were utilized to calculate the performance indices for the run. Once calculated, the results were stored in a separate table to later be collected and averaged. The mean value and the standard deviation over 50 runs were calculated and stored for comparison. The results for the K-Means runs can be found in Table 3.

Trapping in Local Optima
For the K-Means algorithm and many other clustering algorithms, the starting points can have a strong influence on the performance of the model. Just as there are good starting points that allow the algorithm to find appropriate centroids to represent the clusters, there are also starting points that cause the algorithm to incorrectly model the data. The Iris data, as an example, is quite susceptible to poorly initialized centroids. Figure 8 shows a model with centroids that correctly represent the data, and Figure 9 shows a model where the starting centroids cause the model to incorrectly model the data. missing values replaced, and the second containing a list of the corresponding classifications. From this point, the centroids were determined. Originally, the centroids were statically chosen from the data set, but this method was replaced with a randomly selected set of points. While the original method was able to return predictable results, the goal of the experiment was to study the algorithm's overall performance across a range of choices, and by varying the initialization of the centroids, a much larger range of the attribute space was available to be investigated. The algorithm was set to terminate after a set number of iterations were completed if there were no significant performance gains.
After the algorithm finished each of its runs, the final positions of the centroids were utilized to calculate the performance indices for the run. Once calculated, the results were stored in a separate table to later be collected and averaged. The mean value and the standard deviation over 50 runs were calculated and stored for comparison. The results for the K-Means runs can be found in Table 3.

Trapping in Local Optima
For the K-Means algorithm and many other clustering algorithms, the starting points can have a strong influence on the performance of the model. Just as there are good starting points that allow the algorithm to find appropriate centroids to represent the clusters, there are also starting points that cause the algorithm to incorrectly model the data. The Iris data, as an example, is quite susceptible to poorly initialized centroids. Figure 8 shows a model with centroids that correctly represent the data, and Figure 9 shows a model where the starting centroids cause the model to incorrectly model the data. In the dimensions of the data, the grouping on the top left of the graph represents a single cluster, while the larger mass is split into the remaining two. However, in a situation where the initialization places two of the centroids in the upper left group, there is a chance that the algorithm will assign two of the clusters to the upper left grouping and leave the remaining centroid to represent the larger grouping in the lower right. This can In the dimensions of the data, the grouping on the top left of the graph represents a single cluster, while the larger mass is split into the remaining two. However, in a situation where the initialization places two of the centroids in the upper left group, there is a chance that the algorithm will assign two of the clusters to the upper left grouping and leave the remaining centroid to represent the larger grouping in the lower right. This can come as a result of the K-Means algorithm's weakness with regard to getting stuck on local minima [1]. The random initialization of the points is what was used to help mitigate some of the sensitivity to different initialization.

Fuzzy C-Means Implementation
As with the K-Means algorithm, the base functionality of the algorithm was not altered. Before processing, the data were split into two parts. The first part contained the data points themselves, and the other contained their respective classifications. From there, the centroids were initialized. Unlike K-Means, for the Fuzzy C-Means, implementation of the points was not initialized from points in the data set. Instead, the data were used to set up the range of space where the values are located, and a random set of positions in that space were chosen to act as the centroids. Additionally, unlike K-Means, the membership matrix needed to be initialized. To do so, a membership array was created for each data point in the set, with a number of values equal to the number of classifications/clusters. After the matrix itself was created, each data point was randomly assigned one cluster to have a strong relationship with. As with the other algorithms, the randomness introduced in the initialization facilitated the study of the algorithm's performance over a larger range of the attribute space. The algorithm was set to terminate after a set number of iterations were completed if there were no significant performance gains.
After each run of the algorithm was completed, the final positions of the centroids were used to calculate the various performance indices of the run. After being calculated, the results were stored in a separate table for later collection and averaging. The mean value and standard deviation of over 50 runs were calculated and stored for comparison. The results from the FCM runs can be found in Table 4.

Feed-Forward Neural Network Tuned by a Particle Swarm Optimization Algorithm
While the results of the K-Means and FCM algorithms were promising, to better understand the grouping of the benchmarking data from a predictive analysis point of view, a feed-forward neural network using a PSO optimizer was implemented. Using a supervised learning algorithm such as this that functions quite differently than the clustering algorithms utilized in the experiment allowed us to look at the problem with a different perspective. The results were good, as was expected, but the range of values were notably different than the ones returned by the other algorithms, and as such, different deductions could be made about the functionality of both types of algorithms that were utilized.
As for the implementation itself, similar to the K-Means implementation, the data needed to be split into groups of the data and the classifications. The training data and labels were set up with the raw data and the categories, respectively, and any fixes/replacement of values to the data set were applied. Once finished, the values for all of the necessary nodes were initialized. For this implementation, the dimensionality of the number of inputs, outputs, and hidden nodes were computed, and care was taken to fit the network with the data. The number of input nodes was set to match the number of attributes for the data set being analyzed, and the number of output nodes was set to be one less than the number of input nodes. For the number of hidden nodes, the number was set to 20 for each data set.
Furthermore, the data sets were split into training and testing groupings, with 70% of the data used to train the model and the remaining 30% used to evaluate the accuracy of the model. After each of the runs, the accuracy, global best loss, and F-score were calculated and stored for later use. The results from the 50 runs were stored, and the mean and standard deviation were calculated and stored in Table 5.

Results and Related Discussion
Tables 3-5 contain the results from the three algorithms used in this experiment. They contain the mean and standard deviation of the performance metrics calculated over 50 runs. Figure 10 plots the accuracy of the K-Means algorithm over its iterations, and Figure 11 plots the clustering model using K-Means in three dimensions. Figure 12 plots the accuracy of the FCM algorithm over its iterations, and Figure 13 plots the clustering model using FCM in three dimensions. Figure 14 Figure 11. Visualization of the cluster centers using K-Means on experimental data sets.            Seeds Figure 21. Accuracy of the algorithms on the Breast Cancer data set.     The K-Means algorithm was the first algorithm that was implemented for the experiment, so it was often used as a baseline for comparing the other algorithms. While the performance did not excel in many categories, the results were all within the expected range based on results from other similar experiments and in the literature [2]. One potential problem to note is that the standard deviation for the performance metrics from the Iris results are noticeably higher than the results from the other data sets.   The K-Means algorithm was the first algorithm that was implemented for the experiment, so it was often used as a baseline for comparing the other algorithms. While the performance did not excel in many categories, the results were all within the expected range based on results from other similar experiments and in the literature [2]. One potential problem to note is that the standard deviation for the performance metrics from  Figure 24. Accuracy of the algorithms on the Sonar data set.
The K-Means algorithm was the first algorithm that was implemented for the experiment, so it was often used as a baseline for comparing the other algorithms. While the performance did not excel in many categories, the results were all within the expected range based on results from other similar experiments and in the literature [2]. One potential problem to note is that the standard deviation for the performance metrics from the Iris results are noticeably higher than the results from the other data sets. This was found to be due to the algorithm's tendency to become stuck at local minima [1], which is detailed more in Section 5. The results from the Fuzzy C-Means algorithm have averages that are more or less similar to the results from K-Means. However, the standard deviation for those values is much lower. Despite randomly initializing the starting centroids, the model always seemed to converge to positions that effectively presented the same performance metrics. It can be inferred that the 'fuzzy' aspect of the model is what is contributing to this, allowing for any particular point to belong to each cluster centroid to a certain degree, but when it came time to classify, the highest weighted relationship with a centroid (no matter how small the difference between the various weights) was what determined which cluster the point belonged to. Fuzzy C-Means also did not seem to share the issue for the Iris data set with K-Means. This could imply that the membership matrix also allows better exploration of the data sets in certain conditions.
As for the neural network, while it does not share most of the performance metrics with the other two algorithms, accuracy and F-Score, which it does share, seem to have a significant improvement on most of the data sets. The Mammographic Mass and Sonar data sets in particular have a notable improvement over the two clustering algorithms. The most likely factor in this is the difference in how the algorithm determines the classification for the data points. As the neural network does not rely on the Euclidean distance to determine which points belong to which category and instead uses the approximation function generated by the network, the overlapping of the cluster areas that caused problems for the clustering algorithms is not apparent. It is also worth noting however that the performance of the Seeds data set was notably worse for the network. As shown in Figure 14, the Seeds data set struggles to converge, unlike most of the data sets. Looking at the standard deviations for the accuracies may offer some insight into this discrepancy. The network appears to have more volatility in its predictions then both clustering algorithms, and so it is possible that the setup of the nodes and activation function adds some problematic handling of the data that was not apparent in K-Means or Fuzzy C-Means.

Concluding Remarks and Observations from Related Studies
The goal of this work was to investigate the strengths and weaknesses of several clustering and classification algorithms to help validate the performance of each of them and to gain a better understanding of where they might be applicable. The simpler and relatively computationally cheap K-Means is well suited for general tasks but tends to fall short on more complex data sets. Fuzzy C-Means is similarly suited for general tasks and has the bonus of being able to better handle the ambiguity between clusters using a probabilistic membership matrix. Still, the limitations of utilizing a deterministic cluster update scheme can leave it falling short in data sets with overlapping clusters. Furthermore, both algorithms can end up trapped in local optima, creating an inaccurate model. The same problem may be looked at from a classification standpoint, where a feed-forward neural network optimized by PSO performs reliably on all of the data sets considered. The sensitivity to the initialization parameters implies that care must be taken to ensure that the algorithm returns meaningful results for any individual data set. However, there is no guarantee that even the most optimal set of parameters will result in a perfect classification, as that would depend on the complexity of the underlying mapping represented by the dataset.
While there is still work that can be conducted in this area, the experiments done in this paper have achieved some of the insights that were hoped for. While the clustering data sets performed admirably on some data sets, others were only slightly better than guesswork. This shows that while spatially grouping data can be adequate in some situations, it can only go so far. Furthermore, the improved exploration provided by the "Fuzzy" aspects of FCM did allow it to perform more consistently and even better in some situations than K-Means. Still, this strength did not allow it to overcome the fundamental weakness in spatially grouped data with overlapping clusters. The FNN-PSO had its strengths as well, relying not on spatial clustering, but approximating an equation to represent the data set, meaning overlapping data was not an issue. It did however fall short in the complexity of the implementation, especially with regard to the Seeds data set, where even the most optimal parameters that were found did not allow it to properly approximate the data. Additionally, as with the Breast Cancer data set, it did not perform much better than the simpler clustering algorithms despite this extra complexity. Going forward, it will be important to keep these things in mind, as knowing how the data set works is just as important as finding an algorithm that works well for the data set.
Future work in this area would include moving beyond benchmarking data sets and into more complex data in order to apply some of the insights gained from these introductory experiments. Furthermore, exploring the functionality of the algorithms on a step-by-step basis could give further insights into the exact strengths and weaknesses of the algorithms and their applicability to a large variety of interesting problems.

Related Work and Performance
While the results from the experiments here have given some of the desired insights, to validate some of these results, it is productive to compare them to the results reported in other, similar experiments. Below is a list of works that cover clustering and classification tasks utilizing the same data sets investigated in this paper.

Observations on Iris
In Soni and Patel [35], the K-medoids or partitioning around medoid method was utilized for benchmarking purposes. The K-medoids method was proposed to help reduce the effects of outliers in the K-Means algorithm. In this technique, a set number of random points is chosen before the distances between the data points, and the centroids are calculated. The results of clustering accuracy vs. true classes for the K-means algorithm was 88.7%, and for K-medoids, it was 92%. Bação, Lobo and Painho [36] investigated self-organizing maps (SOM) on the Iris data set. Self-organizing maps were utilized here in a clustering context, and function similar to K-means in how the data patterns were mapped to neurons in an n-dimensional grid and were then grouped based on distance in the output space. The results from this study show that self-organizing maps can perform well and are less prone to local optima than K-means. Essentially, this method allows earlier exploration of the search space and can avoid issues that K-means may have where the data have multiple optima.

Observations on Breast Cancer
Bayrak, Kirci and Ensari [37] cover the utilization of support vector machines and artificial neural networks on the Wisconsin Breast Cancer (Original) data set. Support vector machines are often used for pattern recognition problems and function by separating the classes with a hyperplane consisting of support vectors of critical samples from each of the classes. The results for the best implementation of a SVM put the classification accuracy at about 96.9957%. In Lavanya and Rani [38], the classification and regression trees (CART) method was tested on the Breast Cancer (Original) data set. This method uses recursive partitioning in a regression tree algorithm hybridized with bagging, a bootstrap aggregation method. Based on the experiments, this hybrid method was able to give a classification accuracy of about 97.85%.

Observations on Seeds
In Eldem [39], a deep neural network is utilized to test the effectiveness of a model trained on the Seeds data set. For this experiment, both the base data set and a synthetic data set were utilized to test the performance of the model. Multiple models were trained using a combination of these two data sets. First off, a model with only the base data set was trained on a 70-30 split, one trained was on the synthetic data set and tested on the base data set, and finally, a model utilizing a combination of both the base and synthetic data sets combined and utilizing a 70-30 train-test split was determined. The model utilizes categorical cross entropy as its loss function. The results show a 100% success rate across its tests, including the tests utilizing the synthetic data set.

Observations on Mammographic Mass
Williamson, Vijayakumar and Kadam [40] investigate the performance of the random forest algorithm on the Mammographic Mass data set. Before the model was run, multiple setup steps were utilized. First, missing value handling was completed using a K-nearest neighbors implementation; then, the values of the data set were standardized using minmax normalization, and finally, the chi-square test of independence was used to determine which attributes should be used for testing. After these pre-processing steps, a classification model using the random forest algorithm was created. With the proposed method, an accuracy of 84.7% was recorded.

Observations on Sonar
Mosavi, Khishe and Ebrahimi [41] utilized the Sonar data set to benchmark performance of a variety of online multi kernel classifier (OMKC) algorithms. In essence, the functionality of the algorithms proposed for this paper are based on the idea of training the multi kernel classifier and optimizing the linear combination simultaneously. The proposed variations in structure serve to improve both the accuracy and efficiency of operation and utilize stochastic methods for updating the kernel weights. The results of these various models have ranges in accuracy between 61.11% and 79.95%, with the best performance belonging to the implementation utilizing the stochastic selection of the kernels and the stochastic updating of the kernel weights.
While the methods used vary widely, the range of values produced from these works are in range with the ones produced by the experiments in this paper. For reference, one may also look at [42][43][44][45] as implementations of a related nature for benchmarking classification and clustering jobs.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The publicly available datasets analyzed in this work are accessible from UCI Machine learning Repository at https://archive.ics.uci.edu/ml/index.php (accessed on 10 February 2021) and the scikit-learn library at https://scikit-learn.org/stable/ (accessed on 10 February 2021).