The three-way k-means clustering algorithm integrates three-way decision theory with the k-means algorithm and uses a pair of sets to represent a cluster, which can effectively deal with the uncertainty of data. However, as for the k-means algorithm, the three-way k-means method is still sensitive to the initial clustering centers and can easily succumb to the problem of local optimization.To solve this problem, we present an improved three-way k-means clustering algorithm by combining a random probability selection strategy and the pheromone feedback mechanism in the ant colony algorithm with three-way k-means. The sensitivity of the three k-means clustering algorithm to the initial clustering center is optimized through continuous updating iterations, so as to avoid the clustering results easily falling into local optimization. The weights of the core domain and the boundary domain are dynamically adjusted to avoid the influence of artificially set parameters on the clustering results.
3.1. Random Probability Selection Strategy
The ant colony algorithm simulates the foraging behavior of ants in nature, and initially solves the problem of traveling salesman by analogy with a pheromone mechanism. The inspiration for the ant colony algorithm came from the behavior of ants in finding the shortest path in the process of looking for food. In the ant colony algorithm, the process of ants looking for food sources can be viewed as continuous clustering. According to the size of the current pheromone quantity, ants randomly select according to probability, and allocate samples to each cluster center. The larger the pheromone amount between a sample and a cluster center, the greater the probability that the sample will be assigned to this class. In the process of clustering, the probability that a sample is assigned to a cluster center is calculated according to the size of the pheromone between the sample and the cluster center and the heuristic function. The random probability selection strategy greatly increases the effectiveness of the algorithm, causes the algorithm to have the characteristics of convergence, and prevents the algorithm from falling into local optimization. The probability calculation formula of the cluster center to which the ant search sample belongs is:
where
is a heuristic function and
,
represents sample
to cluster center
.
represents the pheromone concentration between sample
and cluster center
. The pheromone concentration is distributed between the sample and the clusters, and the initial pheromone concentration is 1,
t is the number of iterations,
is the pheromone importance factor, and
is the heuristic importance factor. To increase the diversity of search and to speed up the convergence speed, at the beginning, ants randomly select a sample
as the starting point, and then Formula (7) is used to calculate the probability
p of the sample to each cluster center
. The cluster of sample
is determined using roulette. The above process is repeated for another sample until all the samples are traversed to form a solution.
In the ant colony algorithm, the objective function is used to evaluate the solution formed by all ants after completing an iteration, and only the clustering results obtained by the ants with the best objective function value are retained. We construct the fitness function using the intra-cluster cohesion function and the inter-cluster dispersion function, so that the objects in the same cluster are as similar as possible, and the objects in different clusters are as different as possible. The intra-cluster cohesion function is defined as follows,
where
J represents the sum of the distance between each sample
and cluster center
, which is used to evaluate the degree of cohesion.
and
represent the weight values of the core region and the fringe region, respectively. Dynamic adjustment
and
can effectively avoid the influence of a sample’s number change in the core region and the fringe region, and can also avoid the influence of clustering centers due to differences in distance distribution. In this paper, we assume that
and
satisfy the following equations.
and
where
and
are the number of samples in the core region and the fringe region, respectively.
The quality of clustering results is determined by the intra-cluster distance and the inter-cluster distance; when the intra-class distance is smaller, the inter-class distance is larger, the value of the objective function is smaller, and the clustering result is better. The inter-cluster dispersion function is defined as,
where
and
represent the cluster center of
and
, respectively.
Based on the intra-cluster cohesion function and the inter-cluster dispersion function defined by (8) and (10), we construct the following fitness function to optimize three-way k-means:
In the process of looking for food sources, ants release pheromones on the paths they pass. The higher the pheromone concentration, the shorter the distance of the road. In this way, the more ants walk, the higher the pheromone concentration on this path. Each ant moves towards the direction with the highest pheromone concentration, and the continuously strengthened pheromone attracts more ants, so a positive feedback mechanism is formed. As time goes on, the pheromone on the poor path cannot be strengthened, and as the pheromone volatilizes continuously, it loses its attraction, thus forming a negative feedback mechanism.
The positive feedback mechanism attracts more ants to choose the current optimal path, accumulates more pheromone, increases the probability of other ants choosing the path, narrows the scope of ant search, and promotes the convergence of the clustering algorithm. The negative feedback mechanism can eliminate the effect of the positive feedback mechanism, effectively preventing more ants from being attracted to the optimal path, and making the algorithm result fall into the local optimal solution. Using the positive and negative feedback of pheromones, the ant colony algorithm avoids the algorithm falling into a local optimal solution and increases the diversity of solutions.
Pheromone updating in the ant colony algorithm uses the overall information of the ant colony. When the ant releases the pheromone, the pheromone remaining on the path will gradually disappear. This is also to make the next generation of ants more robust both globally and locally when choosing the path. Therefore, when all the ants have completed a cycle, the global update of the residual pheromone is carried out. The pheromone update formula in the ant colony algorithm is as follows:
where, parameter
indicates the degree of volatilization of the pheromone,
represents the pheromone concentration, and
represents the increment in pheromone.
3.2. The Improved Three-Way k-Means Algorithm
Because the clustering results of the standard three-way k-means algorithm depend on the selection of initial centers they easily succumb to the problem of local optimization. To overcome this problem, this paper presents an improved three-way k-means algorithm by integrating the ant colony algorithm and three-way k-means. An original element of this paper is the application of clustering centers obtained by the ant colony algorithm to the three-way k-means, which makes up for the shortcomings of the three-way k-means clustering algorithm due to the random selection of clustering centers.
Figure 3 presents a flowchart of the proposed algorithm.
The ant colony algorithm simulates the foraging behavior of ants in nature, and initially solves the problem of traveling salesman by analogy with a pheromone mechanism. Ants randomly select a sample in the sample space as the starting point. The probability that a sample is assigned to a cluster is obtained according to the amount of the pheromone between the sample and the cluster center. The sample is allocated to a cluster by roulette. Then the ant selects another sample until all samples are assigned, that is, when an iteration is completed to form a solution. The optimal solutions are calculated using the value of the objective function. The pheromone in the ant colony algorithm reflects the overall information in the ant colony. When the ant releases the pheromone, the pheromone remaining on the path will gradually disappear. This is also to make the next generation of ants more robust globally and locally when choosing the path. The specific steps of the algorithm are shown in Algorithm 2:
The detailed complexity of Algorithm 2 is as following: Line 3 to Line 11 are to find the support of each cluster. The time complexity of this process is
, where
n and
m are the number of elements and attributes, respectively. Line 12 is to separate the core regions from the support sets using centroid perturbation analysis. The time complexity of this process is
. Line 13 to Line 14 are to update the process. The time complexity of Algorithm 2 is
, where
t is the number of iterations.
Algorithm 2: The improved three-way k-means based on ant colony algorithm. |
|