Next Article in Journal
Design and Joint Dynamics of Human Recumbent Rehabilitation Training Devices
Previous Article in Journal
Research on Consensus Algorithm for Intellectual Property Authentication Based on PBFT
Previous Article in Special Issue
Modeling Investment Decisions Through Decision Tree Regression—A Behavioral Finance Theory Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A k-Means Algorithm with Automatic Outlier Detection

Department of Mathematics, University of Connecticut, Storrs, CT 06268, USA
Electronics 2025, 14(9), 1723; https://doi.org/10.3390/electronics14091723
Submission received: 6 March 2025 / Revised: 17 April 2025 / Accepted: 22 April 2025 / Published: 23 April 2025
(This article belongs to the Special Issue Knowledge Engineering and Data Mining, 3rd Edition)

Abstract

:
Data clustering is a fundamental machine learning task found in many real-world applications. However, real data usually contain noise or outliers. Handling outliers in a clustering algorithm can improve the clustering accuracy. In this paper, we propose a variant of the k-means algorithm to provide data clustering and outlier detection simultaneously. In the proposed algorithm, outlier detection is integrated with the clustering process and is achieved via a term added to the objective function of the k-means algorithm. The proposed algorithm generates two partition matrices: one provides cluster groups and the other can be used to detect outliers. We use both synthetic data and real data to demonstrate the effectiveness and efficiency of the proposed algorithm and show that the clustering performance of the proposed approach is better than other, similar methods.

1. Introduction

Data clustering or cluster analysis refers to the process of dividing a set of objects into groups or clusters such that objects in the same cluster are similar to each other and objects from different clusters are quite distinct [1,2,3]. Data clustering has found applications in a wide range of areas [4]. For example, data clustering is also used to sample insurance policies in order to build predictive models [5].
In the recent decades, many clustering algorithms have been developed by researchers from different fields [2,3,6]. Among these clustering algorithms, the k-means algorithm is one of the oldest and most commonly used clustering algorithms [7,8]. The k-means algorithm is also considered one of the top-ten algorithms in data mining [9]. Given a set of n data points X = { x 1 , x 2 , , x n } , the k-means algorithm tries to divide the dataset into k clusters by minimizing the following objective function:
P ( U , Z ) = l = 1 k i = 1 n u i l x i z l 2 ,
where U = ( u i l ) n × k is an n × k partition matrix, Z = { z 1 , z 2 , , z k } is a set of cluster centers, and · is the L 2 norm or Euclidean distance. The partition matrix U satisfies the following conditions
u i l { 0 , 1 } , i = 1 , 2 , , n , l = 1 , 2 , , k ,
l = 1 k u i l = 1 , i = 1 , 2 , , n .
In the k-means algorithm, k is the desired number of clusters specified by the user. The optimization problem of minimizing (1) is NP-hard [10]. To minimize the objective function, the k-means algorithm starts from k initial cluster centers selected randomly from the dataset and then keeps updating the partition matrix U and updating the cluster centers Z alternately until some criterion is met.
One drawback of the k-means algorithm is that it does not have a built-in mechanism to detect outliers and its result is sensitive to outliers. Improving the k-means algorithm to handle outliers and noise data has the following benefits. First, considering outliers in the clustering process can improve the clustering accuracy of the k-means algorithm [11]. Second, outlier detection is an important data analysis task in its own right and has important applications in intrusion detection and fraud detection [12]. As a result, several clustering algorithms based on the k-means algorithm have been developed to handle outliers. See, for example, [11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26].
The existing clustering algorithms for handling outliers can be divided into two categories: multi-stage algorithms and single-stage algorithms. As its name indicates, a multi-stage algorithm performs clustering and detects outliers at multiple stages. Examples of multi-stage algorithms include [11,13,15,16,18,20]. The Outlier Removal Clustering (ORC) algorithm [11] is one of the earliest approaches proposed to enhance k-means by addressing the impact of outliers. Unlike multi-stage algorithms, single-stage algorithms integrate outlier detection into the clustering process. Examples of single-stage algorithms include [12,14,21,22,23,24,25,26,27].
Among the single-stage algorithms, the algorithms proposed in [14,27] are extensions of the fuzzy c-means algorithm. Others are variants of the well-known k-means algorithm. For example, the ODC (Outlier Detection and Clustering) algorithm [12], the NEO-k-means (non-exhaustive overlapping k-means) algorithm [22] and the k-means algorithm [21], and the KMOR (k-means clustering with outlier removal) algorithm [23] are extensions of the k-means algorithm that can provide data clustering and outlier detection simultaneously. A major difference between the KMOR algorithm and the other k-means variants (i.e., ODC, NEO-k-means and k-means) is that KMOR was based on the idea of an outlier cluster that was introduced by [28]. In KMOR, every data point in the outlier cluster has a constant distance from all data points. The constant distance is chosen to be the scaled average of the squared distances between all normal data points and their corresponding cluster centers.
However, one drawback of the KMOR algorithm is that it requires two parameters to control the number of outliers. Setting values for these parameters poses some challenges for real-world applications. In this paper, we propose the KMOD (k-means clustering with outlier detection) algorithm, which removes the drawback of the KMOR algorithm. In the proposed KMOD algorithm, the scaled average squared distance is incorporated into the objective function in an elegant way. There are two major differences between the KMOD algorithm and the KMOR algorithm:
  • KMOD requires only one parameter to control the number of outliers, while KMOR requires two parameters to do this.
  • In KMOD, outliers are used to update cluster centers with lower weights, while in KMOR, outliers are not used to update cluster centers.
Similarly to ODC, NEO-k-means, k-means--, and KMOR, the KMOD algorithm can also provide data clustering and outlier detection simultaneously.
The remaining part of this paper is organized as follows. In Section 2.1, we provide brief descriptions of some clustering algorithms that can conduct data clustering and outlier detection simultaneously. In Section 3, we present the KMOD algorithm in detail. In Section 4, we demonstrate the effectiveness of the KMOD algorithm using both synthetic data and real data. In Section 5, we conclude the paper with some remarks.

2. Related Work

In this section, we provide a review of the clustering methods that can detect outliers during the clustering process. In particular, we review the ODC algorithm, the NEO-k-means algorithm, the k-means-- algorithm, and the KMOR algorithm.

2.1. The ODC Algorithm

The ODC (Outlier Detection and Clustering) algorithm proposed by [12] is a modified version of the k-means algorithm. In the ODC algorithm, a data point that is at least γ times the average distance from its centroid is considered an outlier. At each iteration, the average distance is calculated as follows:
d a v g = l = 1 k x C l x z l k | X | ,
where k is the desired number of clusters, X is the dataset after outliers are removed, · is the Euclidean distance, and { z 1 , z 2 , , z k } is a set of cluster centers.
Although the ODC algorithm is a modified version of the k-means algorithm, the ODC algorithm does not have an objective function that incorporates outliers explicitly. In fact, the ODC algorithm uses the following S S E / S S T ratio to measure the quality of the clustering results:
S S E S S T = l = 1 k x C l x z l 2 l = 1 k x C l x x ¯ 2 ,
where C l is the lth cluster and x ¯ is the mean of all the points in C 1 C 2 C k . For a fixed k, a lower value of the ratio indicates a better result. Note that data points that are considered as outliers are excluded from the calculation of S S E / S S T .
In the ODC algorithm, outlier detection is not reversible in the sense that if a data point is considered an outlier at a particular iteration of the iterative process, then the data point cannot be changed back to be a normal point later.

2.2. The NEO-k-Means Algorithm

The NEO-k-means (non-exhaustive overlapping k-means) algorithm proposed by [22] can identify outliers during the clustering process. Its objective function is defined as
F ( U , Z ) = j = 1 k i = 1 n u i j x i z j 2 ,
where k is the desired number of clusters, Z = { z 1 , z 2 , , z k } is a set of cluster centers, and U = ( u i j ) n × k is a binary partition matrix. The matrix U satisfies the following conditions:
Tr ( U T U ) = ( 1 + α ) n , i = 1 n I j = 1 k u i j = 0 β n .
where Tr(·) is the trace of a matrix and I ( · ) is an indicator function.
In addition to the desired number of clusters k, NEO-k-means requires another two parameters: α and β . The parameter α controls the degree of overlap among clusters. The parameter β controls the number of outliers. The maximum number of data points that can be considered outliers is β n . The objective function of the NEO-k-means algorithm is equivalent to that of the k-means algorithm when α = 0 and β = 0 .
The NEO-k-means algorithm minimizes the objective function iteratively. Given a fixed set of cluster centers, the algorithm first calculates the n k distances between every data point and its center. Then, it updates the binary matrix U by assigning the first n β n data points to their closest centers. Then, it creates α n + β n more assignments by taking the α n + β n minimum distances from the n k distances. Given the binary matrix, the algorithm updates the cluster centers to the means of the clusters. The above process is repeated until some stopping criterion is met.

2.3. The k-Means-- Algorithm

The k-means-- algorithm proposed by [21] provides data clustering and outlier detection simultaneously. The objective function of this algorithm is defined as
E ( X , Z , L ) = x X L d ( x | Z ) 2 ,
where X is the dataset, L is the set of outliers, Z is the set of cluster centers, and
d ( x | Z ) = min z Z d ( x , z ) .
Here, d ( · , · ) is a distance function. This algorithm requires two parameters, k and l, which specify the desired number of clusters and the desired number of top outliers, respectively.
In the k-means-- algorithm, an iterative procedure that is similar to k-means’ is used to minimize the objective function. In each iteration, the l top outliers are removed from the clusters before the cluster centers are updated. The l top outliers are defined to be the data points that have the l largest distances from their centers.

2.4. The KMOR Algorithm

Like the k-mean-- algorithm, the KMOR (k-means with outlier removal) algorithm proposed by [23] can also conduct data clustering and outlier detection simultaneously. Let k be the desired number of clusters. Motivated by the fuzzy clustering algorithm proposed by [27], the KMOR algorithm divides a dataset into k + 1 groups, including k clusters and a group of outliers.
Mathematically, KMOR aims to minimize the following objective function:
P ( U , Z ) = i = 1 n l = 1 k u i l x i z l 2 + γ u i , k + 1 D ( U , Z )
subject to the following condition:
i = 1 n u i , k + 1 n 0 ,
where 0 n 0 < n and γ 0 are parameters, · is the L 2 norm, Z = { z 1 , z 2 , …, z k } is a set of cluster centers, and U = ( u i l ) n × ( k + 1 ) is a n × ( k + 1 ) binary partition matrix (i.e., u i l { 0 , 1 } ) such that
l = 1 k + 1 u i l = 1 , 1 i n ,
and
D ( U , Z ) = 1 n i = 1 n u i , k + 1 l = 1 k i = 1 n u i , l x i z l 2 .
The two parameters n 0 and γ work together to control the number of outliers. The maximum number of data points that can be considered outliers is n 0 . When γ = 0 , then the number of outliers produced by the algorithm will be n 0 . When γ is large, then the number of outliers will be 0. As a result, the objective function of KMOR will be equivalent to that of k-means when n 0 = 0 or γ .
The KMOR algorithm iteratively minimizes the objective function. Given a binary matrix U, the cluster centers are updated to be the means of the clusters. Given a set of cluster centers, the data points are assigned to clusters and the group of outliers as follows. If a data point’s distance to its center is greater than γ D ( U , Z ) , the point will be considered an outlier. If the number of such points is more than n 0 , then the n 0 points that are furthest from their centers will be assigned to the group of outliers. If a data point’s distance to its center is not greater than γ D ( U , Z ) , then the data point will be assigned to its nearest cluster. This process is repeated until some stopping criterion is met.

3. The KMOD Algorithm

In this section, we present the KMOD (k-means with outlier detection) algorithm, which is an extension of the KMOR algorithm described in the previous section.
Let X = { x 1 , x 2 , , x n } be a dataset containing n data points, each of which is described by d numerical attributes. Let k be the desired number of clusters. Let U = ( u i l ) n × ( k + 1 ) is a n × ( k + 1 ) binary partition matrix, such that
u i l { 0 , 1 } , 1 i n , 1 l k + 1 ,
l = 1 k + 1 u i l = 1 , 1 i n .
The binary matrix U has k + 1 columns. The last column of U is used to indicate whether a data point is an outlier; that is, u i , k + 1 = 1 if x i is an outlier and u i , k + 1 = 0 if otherwise. If u i , k + 1 = 0 , then u i l = 1 for some l { 1 , 2 , , k } , where l is the index of the cluster to which x i belongs. The binary matrix U divides the dataset X into k + 1 groups, which include k clusters and one group of outliers.
Let V = ( v i l ) n × k is an n × k binary matrix such that
v i l { 0 , 1 } , 1 i n , 1 l k ,
l = 1 k v i l = 1 , 1 i n .
Unlike U, the binary matrix V has k columns. The binary matrix V is a partition matrix that assigns all data points in X into the k clusters. If v i l = 1 , then the data point x i is assigned to the lth cluster.
The objective function of the KMOD algorithm is defined as
P ( U , V , Z ) = i = 1 n l = 1 k u i l x i z l 2 + u i , k + 1 γ n l = 1 k j = 1 n v j l x j z l 2 = i = 1 n l = 1 k u i l x i z l 2 + u i , k + 1 D ( V , Z ) ,
where γ 0 is a parameter, Z = { z 1 , z 2 , , z k } is a set of cluster centers, · is the L 2 norm, and D ( V , Z ) is the scaled average squared distance between all data points and their corresponding cluster centers, i.e.,
D ( V , Z ) = γ n l = 1 k j = 1 n v j l x j z l 2 .
After some algebraic manipulation, we can write the objective function (12) as follows:
P ( U , V , Z ) = i = 1 n l = 1 k u i l + v i l γ p 0 x i z l 2 ,
where
p 0 = 1 n j = 1 n u j , k + 1 ,
that is, p 0 is the proportion of outliers. From Equation (14), we can see the following. If x i is not an outlier, then u i l = v i l = 1 for some l { 1 , 2 , , k } and the contribution of x i to the objective function is
( 1 + γ p 0 ) x i z l 2 .
If x i is an outlier, then u i l = 0 for all l = 1 , 2 , , k and v i l = 1 for some l { 1 , 2 , , k } . In this case, the contribution of x i to the objective function is
γ p 0 x i z l 2 .
Thus, we can rewrite the objective function as
P ( U , V , Z ) = i I N ( 1 + γ p 0 ) x i z m i 2 + i I O γ p 0 x i z m i 2 = i I N x i z m i 2 + γ p 0 i = 1 n x i z m i 2 ,
where m i is the index of the center to which x i is closest, and I N and I O are the index sets of normal points and outliers, respectively. From Equation (15), we can see that a point will be considered an outlier if the reduction in the first term is higher than the increase in the second term.
The goal of the KMOD algorithm is to find U , V , and Z, such that the objective function defined in Equation (12) is minimized. Like the k-means algorithm, the KMOD algorithm starts with a set of k initial cluster centers and then keeps updating U, V, and Z alternatively until some stopping criterion is met. The rules used to update U, V, and Z are summarized in the following three theorems.
Theorem 1.
Let Z = Z 🞶 and U = U 🞶 be fixed. If at least one data point is an outlier, i.e., u i , k + 1 🞶 = 1 for some i, then the binary matrix V that minimizes the objective function (12) is given as follows:
v j l = 1 , if x j z l 🞶 = min 1 s k x j z s 🞶 , 0 , otherwise ,
for j = 1 , 2 , , n and l = 1 , 2 , , k . If no data points are outliers, i.e., u i , k + 1 🞶 = 0 for all i = 1 , 2 , , n , then the objective function (12) is independent of V.
Proof. 
When Z = Z 🞶 and U = U 🞶 , the objective function (12) becomes
P ( U 🞶 , V , Z 🞶 ) = i = 1 n l = 1 k u i l 🞶 x i z l 🞶 2 + u i , k + 1 🞶 γ n l = 1 k j = 1 n v j l x j z l 🞶 2 .
If u i , k + 1 🞶 = 0 for all i = 1 , 2 , , n , then the objective function is independent of V. If u i , k + 1 🞶 = 1 for some i, then the objective function is minimized if the following function
D ( V , Z 🞶 ) = γ n l = 1 k j = 1 n v j l x j z l 🞶 2
is minimized. Since the rows of V are independent of each other, D ( V , Z 🞶 ) is minimized if
l = 1 k v j l x j z l 🞶 2
is minimized for all j = 1 , 2 , , n . Since v j l { 0 , 1 } , the above equation is minimized if
v j l = 1 , if x j z l 🞶 = min 1 s k x j z s 🞶 , 0 , otherwise .
This proves the theorem.    □
Theorem 2.
Let V = V 🞶 and Z = Z 🞶 be fixed. Then, the partition matrix U that minimizes the objective function (12) is established as follows:
u i l = 1 , if x i z l 🞶 = η i 0 , otherwise ,
for i = 1 , 2 , , n and l = 1 , 2 , , k , where
η i = min x i z 1 🞶 , , x i z k 🞶 , D ( V 🞶 , Z 🞶 ) .
For l = k + 1 , we have u i , k + 1 = 1 s = 1 k u i s .
Proof. 
Since V = V 🞶 and Z = Z 🞶 are fixed and the rows of the partition matrix U are independent of each other, the objective function P ( U , V 🞶 , Z 🞶 ) is minimized if, for each i = 1 , 2 , , n , the following function
l = 1 k u i l x i z l 2 + u i , k + 1 D ( V 🞶 , Z 🞶 )
is minimized. Note that u i l { 0 , 1 } for l = 1 , 2 , , k + 1 and
l = 1 k + 1 = 1 .
Equation (18) is minimized if
u i l = 1 , if x i z l 🞶 = η i 0 , otherwise ,
for l = 1 , 2 , , k . This completes the proof.    □
Theorem 3.
Let U = U 🞶 and V = V 🞶 be fixed. Then, the cluster centers Z that minimize the objective function (12) are derived as follows:
z l s = i = 1 n u i l 🞶 x i s + i = 1 n u i , k + 1 🞶 γ n j = 1 n v j l 🞶 x j s i = 1 n u i l 🞶 + i = 1 n u i , k + 1 🞶 γ n j = 1 n v j l 🞶 = j = 1 n u j l 🞶 + v j l 🞶 γ n i = 1 n u i , k + 1 🞶 x j s j = 1 n u j l 🞶 + v j l 🞶 γ n i = 1 n u i , k + 1 🞶
for l = 1 , 2 , , k and s = 1 , 2 , , d .
Proof. 
When V = V 🞶 and U = U 🞶 , the objective function (12) becomes
P ( U 🞶 , V 🞶 , Z ) = i = 1 n l = 1 k u i l 🞶 x i z l 2 + u i , k + 1 🞶 γ n l = 1 k j = 1 n v j l 🞶 x j z l 2 .
The above equation is minimized if its derivative with respect to z l s is equal to zero for all l = 1 , 2 , , k and s = 1 , 2 , , d ; that is,
P z l s = i = 1 n 2 u i l 🞶 ( x i s z l s ) 2 u i , k + 1 🞶 γ n j = 1 n v j l 🞶 ( x j s z l s ) = 0 ,
which leads to Equation (19). This completes the proof.    □
Note that we can also use Equation (14) to prove Theorem 3 and we can write Equation (19) as follows:
z l s = j = 1 n u j l 🞶 + v j l 🞶 γ p 0 x j s j = 1 n u j l 🞶 + v j l 🞶 γ p 0 ,
where p 0 is the proportion of outliers found in the dataset. From the above equation, we see that the center of a cluster is calculated as the weighted average of the data points that are closest to the center of the cluster. However, a normal point has a weight of 1 + γ p 0 and an outlier is given a weight of γ p 0 . Thus, the impact of outliers on cluster centers is also reduced in the KMOD algorithm.
The pseudo-code of the KMOD algorithm is shown in Algorithm 1. The KMOD algorithm requires four parameters: k, γ , δ , and N m a x . Parameter k is the desired number of clusters for a particular dataset. Parameter γ is a positive real number used to control the number of outliers. In general, a smaller value for γ will result in more outliers. If γ is close to zero, then all data points will be assigned to the outlier group. The last two parameters, δ and N m a x , are used to terminate the iterative process. Some default values of γ , δ , and N m a x are shown in Table 1.
The difference between KMOD and KMOR is that the former uses outliers with low weights to update the cluster centers but the latter does not use outliers to update centers. In addition, KMOD does not require users to set a maximum number of outliers. Instead, the number of outliers is controlled by parameter γ .
Algorithm 1: The KMOD Algorithm
Electronics 14 01723 i001

4. Numerical Experiments

In this section, we conduct numerical experiments based on both synthetic data and real data to show the effectiveness of the KMOD algorithm. All algorithms were implemented in Java and the source code is available at https://github.com/ganml/jclust, accessed on 21 April 2025.

4.1. Validation Measures

In our experiments, we used the following validation measures to measure the accuracy of the clustering results: the adjusted Rand index (ARI) [29], the normalized mutual information (NMI) [30], and the classifier distance [12]. The first two measures are commonly used to measure the accuracy of clustering algorithms when the labels of a dataset are known. The classifier distance is used to measure the accuracy of outlier detection when outlier information is provided.
Let C = { C 1 , C 2 , …, C k 1 } be a partition found by a clustering algorithm and let B = { B 1 , B 2 , , B k 2 } be the true partition. Let n i j = | C i B j | , n i · = | C i | , n · j = | B j | , and n be the total number of data points. Then, the adjusted Rand index is defined as follows:
R = n 2 i = 1 k 1 j = 1 k 2 n i j 2 i = 1 k 1 n i · 2 j = 1 k 2 n · j 2 1 2 n 2 i = 1 k 1 n i · 2 + j = 1 k 2 n · j 2 i = 1 k 1 n i · 2 j = 1 k 2 n · j 2 .
It can be seen that the value of R ranges from −1 to 1. A higher ARI value indicates a more accurate result.
The normalized mutual information is defined as follows:
N M I = 2 I ( C , B ) H ( C ) + H ( B ) ,
where H ( · ) denotes the entropy of a partition and I ( · , · ) denotes the mutual information score between two partitions. Mathematically, they are defined as
H ( C ) = i = 1 k 1 n i · n log n i · n ,
H ( B ) = j = 1 k 2 n · j n log n · j n ,
and
I ( C , B ) = H ( C ) + H ( B ) + i = 1 k 1 j = 1 k 2 n i j n log n i j n .
The NMI has a value in the range [ 0 , 1 ] . Similar to the ARI, a higher value of the NMI indicates a more accurate result.
The classifier distance is defined as the Euclidean distance between a binary classifier on the Receiver Operating Curve graph and the perfect classifier:
D = r F P 2 + ( 1 r T P ) 2 ,
where
r F P = n F P n F P + n T N , r T P = n T P n T P + n F N .
Here, n T P , n F P , n F N , and n T N denote the number of data points in the categories of true positive, false positive, false negative, and true negative, respectively. The notion of true or false indicates whether a point was presented as an outlier while the notion of positive or negative indicates whether a point was found to be an outlier. A lower value for the classifier distance indicates a more accurate result.

4.2. Experimental Setup

We compared the performance of KMOD with other four similar algorithms described in Section 2.1. All of these algorithms require some major parameters, such as the number of outliers and the distance threshold, to determine outliers. For KMOD, KMOR, and ODC, the distance threshold is required. For KMOR, k-means--, and NEO-k-means, the number of outliers is required. Table 2 shows six configurations of the parameters for these algorithms, denoted by bold letters from A to F. For example, the distance threshold was set to integers from 2 to 7. The number of outliers was set as a percentage of the total number of data points, which ranges from 0% to 20%. These parameter values cover a wide range, which allows us to see the performance of the algorithms in different settings. Note that, for the NEO-k-means algorithm, we set α = β because we did not allow clusters to overlap in our experiments.
Since all the algorithms start with initial cluster centers randomly selected from the dataset, we ran each algorithm 100 times with different sets of initial cluster centers. In all runs, we set the desired number of clusters to the true number of clusters.

4.3. Results on Synthetic Data

We created two synthetic datasets to test the performance of the KMOD algorithm. The two synthetic datasets are shown in Figure 1. The first dataset has two clusters and six outliers. The two clusters in this dataset contain 40 and 60 points, respectively. The second dataset has eight clusters and sixteen outliers. Each cluster in the second dataset contains 100 data points.
Figure 2 shows the boxplots of the performance measures produced by 100 runs of the five algorithms on the first synthetic dataset. The parameter configurations are shown in Table 2. From the boxplots of the ARI and the NMI, we can see that the KMOD algorithm and the ODC algorithm produced 100% accurate clusters most of the time. The other three algorithms produced many partially accurate clusters. Looking at the boxplots of the classifier distance, we see that the KMOD algorithm performed the best among the five algorithms. In most cases, KMOD produced a classifier distance of zero. The k-means--, NEO-k-means, and ODC produced volatile results for some parameter configurations. Since this synthetic dataset contains only 106 data points, all algorithms converged pretty quickly in most cases.
Figure 3 shows the boxplots of the performance measures obtained from 100 runs of the five algorithms on the second synthetic dataset. The boxplots of the AIR and the NMI show that the five algorithms have a similar median performance in most cases. For parameter configuration A, the KMOD algorithm obtained the best performance and the ODC had the worst performance. The boxplots of the classifier distance show that, for parameter configurations A and B, the KMOD algorithm achieved the best performance. From the boxplots of the runtime, we can see that NEO-k-means is the slowest algorithm. The other four algorithms have similar runtimes.
From the boxplots of the classifier distances shown in Figure 3c, we can see some interesting patterns regarding the performance of these algorithms in terms of outlier detection. The k-means-- algorithm and the NEO-k-means algorithm use a percentage of the total number of data points to control the number of outliers. From parameter configurations A to B, the percentage increases from 0% to 20%. We can see that the median classifier distance of k-means-- and NEO-k-means decreases from the parameter configuration A to B. This pattern indicates that a large percentage parameter value is needed to include the outliers. The pattern of KMOR is not monotonic because the number of outliers is controlled by the interaction between two parameters. The pattern of ODC shows jumps and does not change gradually because the data points identified as outliers in ODC cannot become normal points anymore. The pattern of KMOD shows gradual changes when parameter γ increases from 2 to 7. The median classifier distance was the lowest when γ was set to 2. The experiment on the second synthetic data shows that using γ = 2 as the default parameter value is a good choice.

4.4. Results on Real Data

We obtained several real datasets from the UCI machine learning Repository [31] to compare the performance of the KMOD algorithm and the other four algorithms. Table 3 presents some information about these real datasets. The real datasets have different sizes and different dimensionalities.
The wine dataset contains records from three different origins, which can be used as classes for the purpose of cluster analysis. The gesture phase dataset contains records extracted from videos of people gesticulating. The anuran calls dataset contain records created by segmenting 60 audio records that belong to four different families, eight genera, and ten species. In our experiments, we divided the dataset into 10 clusters, each of which represents a species. The shuttle dataset is a relatively large dataset that contains 58,000 records. The shuttle dataset consists of a training set and a test set. In our experiments, we used the training set that contains 43,500 records from seven classes.
Since the real datasets do not have labels for outliers, we assigned the outliers found by the clustering algorithms to their nearest cluster centers before calculating the two performance measures: the ARI and the NMI. As a result, we did not calculate the classifier distances for real datasets.
Figure 4 shows the boxplots of the performance measures produced by 100 runs of the five algorithms on the wine dataset. The boxplots of the ARI and the NMI show that the KMOD algorithm and the KMOR algorithm had a similar performance on the wine dataset. However, the KMOD algorithm outperformed the KMOR algorithm when parameter configurations A to C were used. Since the wine dataset is small, all algorithms converged quickly in most cases.
Figure 5 shows the boxplots of the performance measures obtained from 100 runs of the five algorithms on the gesture phase dataset. From the boxplots of the ARI and the NMI, we see that all five algorithms achieved a similar performance on this dataset in terms of accuracy. Since the gesture phase dataset is relatively large, we can see the runtime differences in the boxplots of the runtime. The NEO-k-means algorithm was the slowest of the five algorithms. If we look at the median runtime, we can see that the KMOD algorithm was the fastest of the five algorithms.
Figure 6 shows the boxplots of the performance measures produced by 100 runs of the five algorithms on the anuran calls dataset. From the boxplots of the ARI and the NMI, we see that the KMOD algorithm and the KMOR algorithm have a similar average performance. When the parameter configuration F was used, the average accuracy of the k-means-- algorithm and the NEO-k-means algorithm decreased. The boxplots of the runtime shows that the NEO-k-means algorithm was the slowest algorithm. The average runtime was similar for the other four algorithms.
Figure 7 shows the boxplots of the performance measures produced by 100 runs of the five algorithms on the shuttle dataset. From the boxplots of the ARI and the NMI, we can see that the KMOD algorithm achieved the best overall performance across the different parameter configurations. The performance of k-means--, NEO-k-means, and ODC varied a lot across different parameter configurations.
From the boxplots of the runtime, we can see that the KMOD algorithm was the fastest of the five algorithms when applied to this dataset, which is a relatively large dataset. We expect the KMOD algorithm to be faster than the KMOR algorithm because the latter sorts distances to find outliers to satisfy the constraints imposed by the algorithm. In the KMOD algorithm, distance sorting is not required.
In summary, the experiments on synthetic data showed that the KMOD algorithm is able to cluster data and detect outliers simultaneously. The experiments on real data showed that the KMOD algorithm is able to achieve a better overall performance and is faster than other algorithms. The experiments on real data also showed that the KMOD algorithm is not sensitive to the parameter used to control the number of outliers.

5. Conclusions

In this paper, we proposed a KMOD algorithm that is able to perform data clustering and outlier detection simultaneously. In the KMOD algorithm, outlier detection is a natural part of the clustering process. This is achieved by introducing an “outlier” cluster that contains all the points that are considered outliers. The KMOD algorithm produces two partition matrices: one partition matrix divides a dataset into k clusters and an “outlier” cluster and the other partition matrix divides a dataset into k clusters without labeling outliers. As a result, the KMOD algorithm can be used to detect outliers and can also be used like a normal clustering algorithm.
An advantage of the KMOD algorithm is that it requires only one parameter to control the number of outliers. This parameter is intuitive in that it is a distance threshold to determine whether a data point is an outlier. As a result, setting a value for this parameter is straightforward for any dataset.
We compared the KMOD algorithm with other four similar algorithms (i.e., k-means--, KMOR, NEO-k-means, and ODC) using both synthetic datasets and real datasets. The test results have shown that the KMOD algorithm outperformed the other algorithms in terms of accuracy and speed. In particular, the KMOD algorithm was less sensitive to the parameter used to control the number of outliers than other algorithms.
The KMOD algorithm is a variant of the k-means algorithm. As a result, the KMOD algorithm inherits the advantages as well as the disadvantages of the k-means algorithm. The KMOD algorithm is simple to implement and can be used in applications where the k-means algorithm is used. Like the k-means algorithm, the KMOD algorithm is also sensitive to initial cluster centers. Other efficient cluster center initialization methods [32] can be used to improve the KMOD algorithm.
It is worth mentioning that outliers can also be handled before clustering algorithms are applied. For example, Fränti and Yang [33] proposed a data cleaning procedure that employs a medoid shift to reduce noise in the data prior to clustering. This method processes each data point by computing its k-nearest neighbors (k-NN) and replacing the point with the medoid of its neighborhood. Another example in this line of research is the mean-shift outlier detection method proposed by Yang et al. [34].

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Gan, G. Data Clustering in C++: An Object-Oriented Approach; Data Mining and Knowledge Discovery Series; Chapman & Hall/CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
  2. Aggarwal, C.C.; Reddy, C.K. (Eds.) Data Clustering: Algorithms and Applications; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
  3. Ezugwu, A.E.; Ikotun, A.M.; Oyelade, O.O.; Abualigah, L.; Agushaka, J.O.; Eke, C.I.; Akinyelu, A.A. A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Eng. Appl. Artif. Intell. 2022, 110, 104743. [Google Scholar] [CrossRef]
  4. Oyewole, G.J.; Thopil, G.A. Data clustering: Application and trends. Artif. Intell. Rev. 2022, 56, 6439–6475. [Google Scholar] [CrossRef] [PubMed]
  5. Gan, G. Application of data clustering and machine learning in variable annuity valuation. Insur. Math. Econ. 2013, 53, 795–801. [Google Scholar] [CrossRef]
  6. Xu, R.; Wunsch, D. Clustering; Wiley-IEEE Press: Hoboken, NJ, USA, 2008. [Google Scholar]
  7. Jain, A. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  8. Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
  9. Wu, X.; Kumar, V. (Eds.) The Top Ten Algorithms in Data Mining; Chapman & Hall/CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
  10. Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. Theor. Comput. Sci. 2012, 442, 13–21. [Google Scholar] [CrossRef]
  11. Hautamäki, V.; Cherednichenko, S.; Kärkkäinen, I.; Kinnunen, T.; Fränti, P. Improving K-means by Outlier Removal. In Proceedings of the 14th Scandinavian Conference on Image Analysis, Joensuu, Finland, 19–22 June 2005; pp. 978–987. [Google Scholar]
  12. Ahmed, M.; Naser, A. A novel approach for outlier detection and clustering improvement. In Proceedings of the 2013 IEEE 8th Conference on Industrial Electronics and Applications (ICIEA), Melbourne, Australia, 19–21 June 2013; pp. 577–582. [Google Scholar]
  13. Jiang, M.; Tseng, S.; Su, C. Two-phase clustering process for outliers detection. Pattern Recognit. Lett. 2001, 22, 691–700. [Google Scholar] [CrossRef]
  14. Rehm, F.; Klawonn, F.; Kruse, R. A Novel Approach to Noise Clustering for Outlier Detection. Soft Comput. 2007, 11, 489–494. [Google Scholar] [CrossRef]
  15. Jiang, S.Y.; An, Q. Clustering-Based Outlier Detection Method. In Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery, Jinan, China, 18–20 October 2008; Volume 2, pp. 429–433. [Google Scholar]
  16. Zhou, Y.; Yu, H.; Cai, X. A Novel k-Means Algorithm for Clustering and Outlier Detection. In Proceedings of the Second International Conference on Future Information Technology and Management Engineering, Sanya, China, 13–14 December 2009; pp. 476–480. [Google Scholar]
  17. Zhang, K.; Hutter, M.; Jin, H. A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data. In Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Bangkok, Thailand, 27–30 April 2009; pp. 813–822. [Google Scholar]
  18. Pamula, R.; Deka, J.; Nandi, S. An Outlier Detection Method Based on Clustering. In Proceedings of the Second International Conference on Emerging Applications of Information Technology, Kolkata, India, 18–20 February 2011; pp. 253–256. [Google Scholar]
  19. Lei, D.; Zhu, Q.; Chen, J.; Lin, H.; Yang, P. Automatic K-Means Clustering Algorithm for Outlier Detection. In Information Engineering and Applications; Zhu, R., Ma, Y., Eds.; Lecture Notes in Electrical Engineering; Springer: London, UK, 2012; Volume 154, pp. 363–372. [Google Scholar]
  20. Jayakumar, G.S.D.S.; Thomas, B.J. A New Procedure of Clustering Based on Multivariate Outlier Detection. J. Data Sci. 2013, 11, 69–84. [Google Scholar] [CrossRef]
  21. Chawla, S.; Gionis, A. k-means--: A unified approach to clustering and outlier detection. In Proceedings of the 2013 SIAM International Conference on Data Mining, Austin, TX, USA, 2–4 May 2013; SIAM: Philadelphia, PA, USA, 2013. Chapter 20. pp. 189–197. [Google Scholar]
  22. Whang, J.; Dhillon, I.S.; Gleich, D. Non-exhaustive, Overlapping k-means. In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015. [Google Scholar]
  23. Gan, G.; Ng, M.K.P. k-Means Clustering with Outlier Removal. Pattern Recognit. Lett. 2017, 90, 8–14. [Google Scholar] [CrossRef]
  24. Zhang, Z.; Feng, Q.; Huang, J.; Guo, Y.; Xu, J.; Wang, J. A local search algorithm for k-means with outliers. Neurocomputing 2021, 450, 230–241. [Google Scholar] [CrossRef]
  25. Olukanmi, P.; Nelwamondo, F.; Marwala, T.; Twala, B. Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities. Neural Comput. Appl. 2022, 34, 5939–5958. [Google Scholar] [CrossRef]
  26. Liao, J.; Wu, X.; Wu, Y.; Shu, J. K-NNDP: K-means algorithm based on nearest neighbor density peak optimization and outlier removal. Knowl.-Based Syst. 2024, 294, 111742. [Google Scholar] [CrossRef]
  27. Dave, R.; Krishnapuram, R. Robust clustering methods: A unified view. IEEE Trans. Fuzzy Syst. 1997, 5, 270–293. [Google Scholar] [CrossRef]
  28. Dave, R.N. Characterization and detection of noise in clustering. Pattern Recognit. Lett. 1991, 12, 657–664. [Google Scholar] [CrossRef]
  29. Hubert, L.; Arabie, P. Comparing Partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
  30. Vinh, N.X.; Epps, J.; Bailey, J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
  31. Frank, A.; Asuncion, A. UCI Machine Learning Repository. 2010. Available online: http://archive.ics.uci.edu/ml (accessed on 21 April 2025).
  32. Celebi, M.E.; Kingravi, H.A.; Vela, P.A. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 2013, 40, 200–210. [Google Scholar] [CrossRef]
  33. Fränti, P.; Yang, J. Medoid-Shift for Noise Removal to Improve Clustering. In Artificial Intelligence and Soft Computing; Springer International Publishing: Cham, Switzerland, 2018; pp. 604–614. [Google Scholar] [CrossRef]
  34. Yang, J.; Rahardja, S.; Fränti, P. Mean-shift outlier detection and filtering. Pattern Recognit. 2021, 115, 107874. [Google Scholar] [CrossRef]
Figure 1. Two synthetic data sets with outliers. The first dataset contains 106 points with two clusters and six outliers (denoted by cross signs). The second dataset contains 816 points with eight clusters and sixteen outliers (denoted by cross signs in a square).
Figure 1. Two synthetic data sets with outliers. The first dataset contains 106 points with two clusters and six outliers (denoted by cross signs). The second dataset contains 816 points with eight clusters and sixteen outliers (denoted by cross signs in a square).
Electronics 14 01723 g001
Figure 2. Boxplots of the performance measures produced by 100 runs of the algorithms on the first synthetic dataset: (a) ARI; (b) NMI; (c) classifier distance; (d) runtime. The runtime is measured in seconds.
Figure 2. Boxplots of the performance measures produced by 100 runs of the algorithms on the first synthetic dataset: (a) ARI; (b) NMI; (c) classifier distance; (d) runtime. The runtime is measured in seconds.
Electronics 14 01723 g002
Figure 3. Boxplots of the performance measures produced by 100 runs of the algorithms on the second synthetic dataset: (a) ARI; (b) NMI; (c) classifier distance; (d) runtime. The runtime is measured in seconds.
Figure 3. Boxplots of the performance measures produced by 100 runs of the algorithms on the second synthetic dataset: (a) ARI; (b) NMI; (c) classifier distance; (d) runtime. The runtime is measured in seconds.
Electronics 14 01723 g003aElectronics 14 01723 g003b
Figure 4. Boxplots of the performance measures produced by 100 runs of the algorithms on the wine dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Figure 4. Boxplots of the performance measures produced by 100 runs of the algorithms on the wine dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Electronics 14 01723 g004
Figure 5. Boxplots of the performance measures produced by 100 runs of the algorithms on the gesture phase dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Figure 5. Boxplots of the performance measures produced by 100 runs of the algorithms on the gesture phase dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Electronics 14 01723 g005
Figure 6. Boxplots of the performance measures produced by 100 runs of the algorithms on the anuran calls dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Figure 6. Boxplots of the performance measures produced by 100 runs of the algorithms on the anuran calls dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Electronics 14 01723 g006
Figure 7. Boxplots of the performance measures produced by 100 runs of the algorithms on the shuttle dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Figure 7. Boxplots of the performance measures produced by 100 runs of the algorithms on the shuttle dataset. (a) ARI; (b) NMI; (c) runtime. The runtime is measured in seconds.
Electronics 14 01723 g007
Table 1. Default values for some parameters required by the KMOD algorithm.
Table 1. Default values for some parameters required by the KMOD algorithm.
ParameterDefault Value
γ 2
δ 10 6
N m a x 100
Table 2. Six sets of parameter configurations for the five algorithms.
Table 2. Six sets of parameter configurations for the five algorithms.
AlgorithmParameterABCDEF
KMOD γ 234567
ODCp234567
NEO-k-means α 0−0.01−0.03−0.05−0.1−0.2
β 00.010.030.050.10.2
KMOR p 0 ( n 0 = p 0 n )00.010.030.050.10.2
γ 234567
k-means-- p 0 ( l = p 0 n )00.010.030.050.10.2
Table 3. Summaries of the real datasets. Here, n, d, and k denote the number of records, the number of attributes, and the number of classes of the corresponding dataset, respectively.
Table 3. Summaries of the real datasets. Here, n, d, and k denote the number of records, the number of attributes, and the number of classes of the corresponding dataset, respectively.
ndk
Wine178133
Gesture phase1747185
Anuran calls71952210
Shuttle43,50097
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gan, G. A k-Means Algorithm with Automatic Outlier Detection. Electronics 2025, 14, 1723. https://doi.org/10.3390/electronics14091723

AMA Style

Gan G. A k-Means Algorithm with Automatic Outlier Detection. Electronics. 2025; 14(9):1723. https://doi.org/10.3390/electronics14091723

Chicago/Turabian Style

Gan, Guojun. 2025. "A k-Means Algorithm with Automatic Outlier Detection" Electronics 14, no. 9: 1723. https://doi.org/10.3390/electronics14091723

APA Style

Gan, G. (2025). A k-Means Algorithm with Automatic Outlier Detection. Electronics, 14(9), 1723. https://doi.org/10.3390/electronics14091723

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop