Dynamic Sub-Swarm Approach of PSO Algorithms for Text Document Clustering

Text document clustering is one of the data mining techniques used in many real-world applications such as information retrieval from IoT Sensors data, duplicate content detection, and document organization. Swarm intelligence (SI) algorithms are suitable for solving complex text document clustering problems compared to traditional clustering algorithms. The previous studies show that in SI algorithms, particle swarm optimization (PSO) provides an effective solution to text document clustering problems. This PSO still needs to be improved to avoid the problems such as premature convergence to local optima. In this paper, an approach called dynamic sub-swarm of PSO (subswarm-PSO) is proposed to improve the results of PSO for text document clustering problems and avoid the local optimum by improving the global search capabilities of PSO. The results of this proposed approach were compared with the standard PSO algorithm and K-means algorithm. As for performance assurance, the evaluation metric purity is used with six benchmark data sets. The experimental results of this study show that our proposed subswarm-PSO algorithm performs best with high purity comparing the standard PSO and K-means traditional algorithms and also the execution time of subswarm-PSO comparatively takes a little less than the standard PSO algorithm.


Introduction
In recent years, there has been a tremendous increase in the volume of text documents on the Internet and modern applications which affect the text analysis process (i.e., text feature selection, text document clustering, text categorization, etc.) [1,2]. Text document clustering is an unsupervised classification of textual documents into clusters based on content similarity [3]. The most popular algorithms for clustering are K-means and its variants, as the K-means algorithm is a simple and most used unsupervised partitioning algorithm [4]. Extracting relevant information from the data is a challenging task that needs fast and high-quality text document clustering algorithms. In this study, swarm intelligence (SI) algorithms have been applied to text document clustering and improved one of the best-performing swarm algorithms to improve the quality of results.
SI algorithms include simple unintelligent agents that follow some simple rules to accomplish very complex tasks. SI algorithms are suitable for resolving complex text document clustering problems compared with traditional clustering algorithms [5]. The PSO provides an effective solution to text document clustering [6]. However, the PSO algorithm usually suffers from falling into a premature convergence to local optimum [7]. This is because PSO initializes the particles in starting of the algorithm and the searching behavior includes that all the particles move towards the best solution and search around the local area. In this case, if the initialization does not explore the proper area including the global solution, there is no option to research the undiscovered area to globally search again in the middle of the algorithm. To improve the global searching capability of the PSO algorithm for text document clustering, in this paper, a dynamic subswarm-PSO algorithm is proposed [8]. This proposed algorithm will reinitialize the number of worst fitness particles in each iteration. Six different data sets, created from BBC sports news [9], 20 newsgroups [10], and scientific papers [11], were used in this study. Purity is used as an evaluation metric for this study. The experimental results show that our proposed subswarm-PSO algorithm performs best comparing the standard PSO and K-means traditional algorithms. Also, the average execution time of our proposed algorithm is a little less than the standard PSO algorithm.
The next section provides an overview of the research work focused on the improvements in the PSO algorithm. Section 3 shows the process of the text document clustering problem. Section 4 outlines the existing standard algorithms for text document clustering. Section 5 describes our proposed subswarm approach in the PSO algorithm. Section 7 shows the experimental conditions. Section 8 shows the experimental results and a detailed discussion of these results. Section 9 summarizes the present research work.

Related Work
This section shows the previous research works (Table 1) related to the PSO algorithm with different sub-swarm optimization techniques.

Authors
Approach Purpose/Results J. Liang et al. [12] In the Dynamic multi-swarm PSO (DMS-PSO) algorithm, the swarm population is divided into many small swarms. Based on various regrouping schedules these swarms are frequently regrouped and information among the swarms is exchanged.
Improved performance than the PSO algorithm.
Xu, Y. et al. [13] Three sub-swarm discrete PSO approach divides the whole swarm into three sub-swarms. Here, one sub-swarm evolves with the standard PSO model, the second subswarm with a social-only model, and the third sub-swarm with a cognition-only model.
Able to find the best fitness more quickly and precisely than discrete PSO.
Y. Liu et al. [14] A modified multi-swarm PSO using sub-swarms and a multi-swarm scheduler.
Used to monitor and control each sub-swarm using the rules to solve the discrete problem.
Ye Wenxing et al. [15] A novel multi-swarm PSO with a dynamic learning strategy and each sub-swarm is classified into ordinary particles and communication particles. Here, ordinary particles focus on exploitation, and communication particles focus on exploration.
Comparing other algorithms, effectively solving complex problems.
C. Qiu et al. [16] The novel multi-swarm PSO for feature selection which splits the population of PSO into several small-sized subswarms and each sub-swarm updates its positions by the guidance of local best particles from its own sub-swarms.
Finds feature subsets with high classification accuracies and smaller numbers of features.
X. Xia et al. [17] In the proposed dynamic multi-swarm global PSO, the entire population is divided into a global sub-swarm and multiple dynamic sub-swarms. the global sub-swarm focuses on exploitation and the dynamic multiple sub-swarm focuses on exploration.
Provides more performance in complex problem solving and avoids premature convergence while solving multimodal problems.

Text Document Clustering
This section describes the main process of text document clustering. This process includes text document collection, pre-processing, document representation, clustering, and cluster validation as shown in Figure 1 [18].
Pre-processing is an important step to enhance the performance of the clustering algorithm. As shown in the Figure 1, after collecting the required raw text documents, text pre-processing is used to clean these text documents by applying natural language processing techniques such as tokenization, stop word removal, stemming, and term weighting to delete the unwanted data and manipulate the data. Then, pre-processing turns these clean documents into a t × d term-document matrix. Here, t represents the number of unique terms in the document collection, and d represents the number of documents. The text document clustering algorithms (PSO and subswarm PSO) directly use this matrix to convert the document data sets into meaningful sub-collections.
As shown in Figure 1, to validate the quality of results from text document clustering algorithms, a few cluster evaluation metrics can be used. The next subsection explains the evaluation metrics in detail.

Evaluation Metrics
This subsection shows the evaluation metrics of text document clustering. Clusters can be evaluated using the metrics such as purity, homogeneity, accuracy, completeness, entropy, F-measure, V-measure, and adjusted rand index (ARI). In this study, we use purity to evaluate the quality of text document clustering algorithms.

Purity
Purity measures whether the clusters contain documents from a single category. The purity value ranges from 0 and 1. Purity is computed by dividing the number of properly assigned documents by N. The ideal clusters will get a purity value is 1. Equation (1) is used to compute the purity [19,20]: where C = {c 1 , c 2 , . . . , c j } is the set of classes and W = {w 1 , w 2 , . . . , w k } is the set of clusters.

Algorithms for Text Document Clustering
This section shows the algorithms such as traditional K-means and standard PSO for the text document clustering problem. The standard PSO is a better performing algorithm than a few popular other standard SI algorithms such as bat and grey wolf optimization for text document clustering [18]. Therefore, in this study, we compare the performance of our proposed subswarm-PSO algorithms with these traditional K-means and standard PSO algorithms.

K-Means Algorithm for Text Document Clustering
The K-means is a popular traditional clustering algorithm that starts with the initialization of a number of clusters k and the random initialization of k cluster centers. After initialization, in the loop, the algorithm assigns each document to its closest centroid based on its similarity and updates the centroid of each cluster. This loop will execute again and again until meets the termination condition as shown in the Algorithm 1 [21].

Algorithm 1: K-means text document clustering algorithm
Result: Best position Initialization; • Specify the number of clusters k.
• Randomly select k points (documents) as the initial cluster centers.

End of initialization;
while Until the stop criteria are met do • Assign each point (document) to the closest center by calculating the distance to each centroid (clustering).
• Update the centroid of each cluster. end return best position

PSO Algorithm for Text Document Clustering
The PSO algorithm is a population-based optimization algorithm developed based on the collective behavior of a flock of birds or fish schools, and aims to locate all the particles in the optimal position [21]. In this study, each particle in the PSO algorithm includes k cluster centroids of text document clusters. The initialization phase is used to initialize the n number of particles P = {p 1 , p 2 , . . . , p n } in the PSO algorithm and a few PSO parameters. The position of each particle p i will be initialized with k number of cluster centers randomly. After the initialization, k clusters are created by assigning each document to the closest center by calculating the distance to each centroid. Then, the fitness function purity (Equation (1)) evaluates the clusters, and the global best (gbest) and individual best (pbest) solutions are updated. Then, the new velocity (Equation (2)) and position (Equation (3)) are calculated for each particle. Until reaching the termination condition iterations will continue as shown in Algorithm 2. As a result, the algorithm finally returns the global best solution [22].
In Equations (2) and (3), v t i and x t i are the velocity and position of the particle i at iteration t. pbest t i is the personal or individual best position of i. gbest t is the global best position of all particles. ω is the inertia weight to balance the global search and local search. c 1 and c 2 are positive constant values used to control the speed of the particle. r 1 and r 2 are random parameters within [0,1] [19,21].
• Randomly choose k cluster centers to initialize each particle's position (X) (centroids), velocity (V), and other parameters.

End of initialization;
while Until the stop criteria are met do while For each particle do • Assign each document to the closest center (particle's position) by calculating distance to each centroid (clustering). • Compute the fitness value. • Update personal best (pbest) and global best (gbest) positions.
Update the velocity and position of a particle to generate new solution (Equations (2) and (3)).

Proposed Approach
This section describes the proposed subswarm-PSO approach. The main focus of this study is to optimize the result of the PSO algorithm for the text document clustering (Algorithm 2) by applying dynamic sub-swarm techniques to the PSO algorithm as shown in Figure 2.
The initialization section of the proposed subswarm-PSO algorithm is the same as the PSO algorithm and its starts with the initialization of populations (n) and other parameters of PSO. Each particle of the PSO algorithm is randomly initialized with the k cluster centers. Then the algorithm computes text document clustering around the cluster centers and executes the fitness function.
The proposed approach is implemented in the PSO algorithm while moving the position of the particles to the next position in the PSO algorithm. Here, based on the fitness solution, we dynamically divide the entire swarm population into two. That is the top n/2 best fitness particles are grouped into one sub-swarm (SP 2 ) and the other n/2 particles with the remaining least fitness solutions are grouped into another sub-swarm (SP 1 ). As shown in Figure 2, the particles of sub-swarm SP 2 will move to the next position and update the pbest and gbest solutions among them. The particle solutions from SP 1 are abandoned and re-initialized randomly. Then, the sub-swarms SP 1 and SP 2 will be combined and moved to the next iteration until meets the termination condition.
In this proposed approach, to divide the two sub-swarms, the optimum number n/2 is chosen from the experiment with a different number of abandoned particles with the least solutions such as 1, n/3, n/2, n − (n/3), and n − 1 as shown in the experiment and results section.

Proposed Dynamic Sub-Swarm Pso Algorithm for Text Document Clustering
As shown in the Algorithm 3, subswarm-PSO algorithm initialize n number of particles P = {p 1 , p 2 , p 3 , . . . , p n } and other PSO parameters. Each particle in the PSO randomly initializes the k number of centroid points of the clusters. After initialization, the clusters are created by assigning each document to the closest centroid by calculating the distance to each centroid. Next, fitness values will be calculated for each particle. The pbest and gbest solutions will be updated as shown in the algorithm.
where particles of PSO P = {p 1 , p 2 , p 3 , . . . , p n }, sorted particles SP = {s 1 , s 2 , s 3 , . . . , s n }, and i = 1, 2, 3, . . . , n. Then, before updating the particles and moving to the next iteration in Algorithm 2, our proposed approaches are applied (from Equation (4) to (9)) in the PSO algorithm as shown in Algorithm 3. Here, Equation (4) and Algorithm 3 show that all particles P of PSO are sorted by fitness values using quicksort in ascending order and stored in SP. The total sorted swarm population is divided based on the fitness solution into two sub-swarms SP 1 and SP 2 as shown in Equations (5) and (6) respectively.
The sub-swarm SP 1 includes particles with the least n/2 fitness solutions and SP 2 includes particles with top n/2 fitness solutions. Here, we abandon the solutions of the subswarm particles SP 1 and randomly reinitialized these particles as shown in Equation (7). As shown in Equation (8), the sub-swarm SP 2 includes the particles of SP 2 with updated new velocity (Equation (2)) and position (Equation (3)). The sub-swarms SP 1 and SP 2 will be combined as P as shown in Equation (9) and assigned to particles P for the next iteration as shown in the Algorithm 3. The final result of the algorithm includes the global best solution.
• Randomly choose k cluster centers to initialize each particle's position (X) (centroids), velocity (V), and other parameters.

End of initialization;
while Until stop criteria are met do while For each particle do • Assign each document to the closest center (particle's position) by calculating the distance to each centroid (clustering).
• Compute the fitness value.

Experiment
This section shows the details of benchmark data sets and experimental conditions that were used for this experiment.

Benchmark Data Sets
In this study, six different benchmark machine learning data sets were used that are constructed from BBC sports news [9], 20 newsgroups [10], and scientific papers [11]. These data sets are most commonly used for text document clustering and are publicly available for machine learning research [23,24]. Considering the computation time and our resources, we have used only six different data sets for this study.
As shown in Table 2, data set 1 includes 1427 documents and 23,057 terms and these documents belong to 2 clusters (alt.atheism and talk.religion.misc) from 20 newsgroups. Data set 2 consists of 737 documents and 4613 terms and belongs to 5 clusters (athletics, cricket, football, rugby, and tennis) from BBC sports news. Data set 3 includes 40 documents and 2596 terms and belongs to 5 clusters (athletics, cricket, football, rugby, and tennis) from BBC sports news. Data set 4 includes 200 documents and 8716 terms and these documents belong to 4 clusters (rec.motorcycles, rec.sport.hockey, sci.electronics, and talk.religion.misc) from 20 newsgroups. Data set 5 includes 100 documents and 5549 terms and these documents belong to 3 clusters (rec.motorcycles, rec.sport.hockey, and talk.religion.misc) from 20 newsgroups. Data set 6 includes scientific papers in the field of computer science. The data set contains 675 scientific papers and 27,416 terms under 4 classes including casebased reasoning (CBR), inductive logic programming (ILP), information retrieval (IR), and sonification (SON). Here, each paper is including the title, authors, abstract, and references.

Experimental Conditions
In this study, the population size of each SI algorithm is assigned as 10. Here, for each algorithm, we performed experiments for the iterations such as 10, 20, 30, . . . , 100. Five simulations are used for each iteration number to handle the stochastic results of algorithms. The performance of this algorithm is evaluated using the evaluation metric purity and these results were compared with the algorithms K-means and PSO.
In this experiment, our proposed approach is applied with a different number of abandoned particles with the least solutions such as 1, n/3, n/2, n − (n/3), and n − 1 to find the optimal number of abandoned particles and with the optimal number, the comparative study is performed. To determine the optimum number of particles to be abandoned, we use experiments with 10 and 20 particles.
Here, to compare the performance of PSO and proposed subswarm-PSO, the same default parameters ω = 0.9, c 1 = 0.5, and c 2 = 0.3 of the PSO algorithm were used in both algorithms from the literature [18,25].

Results
This section shows the experimental results of finding the optimum number of abandoned least solutions of particles and the comparative results of the proposed subswarm-PSO algorithm with the optimum number of abandoned particles with the least solution for the text document clustering problem.

Different Number of Abandoned Particles with Least Solutions for Subswarm-PSO
This section shows the experimental results of a different number of abandoned particles of PSO for the subswarm-PSO approach to find the optimum number of abandoned particles with the least solution. Here, we experiment and compare with 1, n/3, n/2, n − (n/3), and n − 1 number of abandoned particles for subswarm-PSO with data set 5 for 10 and 20 particles.
This experiment includes the following.
• 1 particle with the least solution will be abandoned and reinitialized randomly. The remaining n − 1 particles will move to the next position. • n/3 particles with the least solution will be abandoned and reinitialized randomly. The remaining n − (n/3) particles will move to the next position. • n/2 particles with the least solution will be abandoned and reinitialized randomly. The remaining n/2 particles will move to the next position. • n − (n/3) particles with the least solution will be abandoned and reinitialized randomly. The remaining n/3 particles will move to the next position. • n − 1 particles with the least solution will be abandoned and reinitialized randomly. The remaining 1 particle will move to the next position.
Here, Table 3 and Figure 3 show the comparative experimental results for a number of abandoned particles with the least solutions such as 1, n/3, n/2, n − (n/3), and n − 1 are compared with the standard PSO algorithm for 10 particles. This table shows the maximum, mean, and standard deviation of purity values for the iterations (1, 10, 20, 30, . . . , 100) and data set 5. This table also shows the average and ranks for all numbers of abandoned particles. Here, the results show that n/2 abandoned particles show the highest performance with the purity values 0.806, 0.812, 0.816, 0.836, 0.842, 0.848, and 0.850 for iterations 10, 30, 40, 60, 70, 90, and 100, respectively. The next level performance is shown for n − (n/3) and n/3 with an average of 0.797 and 0.788. Figure 3 shows that comparing the standard PSO algorithm with other compared methods shows better performance and half-abandoned (n/2) particles show the best performance.
Here, Table 4 shows the comparative experimental results for a number of abandoned particles with the least solutions such as 1, n/3, n/2, n − (n/3), and n − 1 are compared with the standard PSO algorithm for 20 particles. This table shows the maximum, mean, and standard deviation of purity for the iterations (1, 10, 20, 30, . . . , 100) and data set 5. This table also shows the average and ranks for all numbers of abandoned particles. Here, the results show that n/2 abandoned particles show the highest performance with the purity values 0.862, 0.852, 0.866, and 0.906 for iterations 20, 40, 70, and 90 respectively. The next level's performance is shown for n/3 with an average of 0.832. Figure 4 shows the results of a different number of abandoned particles for 20 particles and this figure shows that comparing the standard PSO algorithm other compared methods show better performance and half-abandoned (n/2) particles show the best performance with the highest average value of 0.843.
The results from Tables 3 and 4 show that experiments with 10 and 20 particles are having almost similar patterned results and half-abandoned (n/2) particles show the best performance.

Performance Comparison of Proposed Subswarm PSO Algorithm
This section shows the performance comparison results of the proposed subswarm-PSO algorithm with an optimal number of abandoned particles n/2 with the least solution which is determined by the experimental results of the previous subsection. Table 5 shows the maximum, mean, and standard deviation of purity values for the text document clustering algorithms. These values are the average of all the iteration numbers. The highest purity means values among all the algorithms for each data set are highlighted in bold. Here, the subswarm-PSO algorithm has the highest purity mean values of 0.728, 0.820, 0.650, 0.807, and 0.888 for data set 1, 3, 4, 5, and 6 respectively. The algorithm K-means shows the highest purity mean value of 0.763 for data set 2. The K-means algorithm shows the lowest purity mean values of 0.527, 0.651, 0.450, and 0.595 for data sets 1, 3, 4, and 5 respectively. Table 6 shows the ranking of the mean of purity values for text document clustering algorithms with all data sets for the data shown in Table 5. This table shows that the total performance ranks of the algorithms K-means, PSO, and subswarm-PSO are 3, 2, and 1 respectively.
The comparative results and average running time of all algorithms for the data sets 1 to 6 are shown in the Figures 5-16. K-means takes very little time for execution. So, we ignored K-means and compared the execution time for only PSO and subswarm-PSO. Figure 5 shows the comparative results of the proposed algorithm subswarm-PSO with all other algorithms for data set 1. Here, the result shows that the proposed algorithm shows the best performance than other algorithms and the algorithm PSO is performing second. K-means is the lowest-performing algorithm. Figure 6 compares the average execution time of PSO and subswarm-PSO for the results of Figure 5 and Table 5. Here, the proposed algorithm subswarm-PSO takes 5.1% less time than PSO for the execution on average. Figure 7 shows the comparative results and Figure 8 show the average execution time of the proposed subswarm-PSO algorithm with other algorithms for the data set 2. From Figure 7 we can see that comparing other algorithms, K-means is performing well. However, our proposed algorithm subswarm-PSO shows a second better performance than PSO algorithms. On average our proposed algorithm subswarm-PSO takes 0.8% less time for the execution than PSO as shown in Figure 8. Figures 9 and 10 show the comparative results and average running time respectively for all algorithms for data set 3. Figure 9 and Table 5 show that our proposed subswarm-PSO algorithm shows the highest purity. The PSO algorithm performs second. Here, the K-means algorithm shows the least performance. Figure 10 shows that subswarm-PSO takes 13.7% less execution time than PSO. Figures 11 and 12 are the comparative results and average running time respectively of all algorithms for data set 4. From Figure 11 and Table 5, we can say that the proposed subswarm-PSO shows the highest performance, and the PSO algorithm performs second. K-means shows the least performance for dataset 4. Figure 12 shows similar results that subswarm-PSO take 2.1% little less execution time than PSO. Figures 13 and 14 are the comparative results and average running time respectively of all algorithms for data set 5. Figure 13 shows that subswarm-PSO, PSO, and K-means algorithms perform first, second, and third respectively. Here, similarly to other data sets subswarm-PSO takes 0.75% less execution time than PSO. Figures 15 and 16 are the comparative results and average running time respectively of all algorithms for data set 6. Figure 15 shows that subswarm-PSO, PSO, and K-means algorithms perform first, second, and third respectively. Here, the algorithm subswarm-PSO takes 4.31% less time for execution than PSO as shown in Figure 16.

Discussion
Based on the above results, we can say that the proposed algorithm subswarm-PSO is performing best to find the optimal solution for most of the data sets by comparing standard PSO and K-means algorithms in text document clustering. The average running times of this proposed subswarm-PSO algorithm are 5.1%, 0.8%, 13.7%, 2.1%, 0.75%, and 4.31% lesser than the standard PSO algorithm for data sets 1, 2, 3, 4, 5, and 6 respectively. The K-means is much faster than other algorithms but it shows the very least optimum solutions for all data sets except data set 2.
The standard PSO algorithm has more ability to solve complex optimization problems such as text document clustering but is usually trapped into the local optimum. Our proposed algorithm ignores half of the least solutions and reinitializes the particles which include the least solution in each iteration. This proposed approach increases the global search capability of PSO and enhances the performance of the PSO algorithm. During each iteration, only half of the particles which include the best solutions are moved to the next solution, and others are reinitialized randomly. Random initialization takes less time compared with updating particles to the next position. This might be the reason for the reduced execution time of our proposed algorithm.
In our proposed method, we choose half (n/2) of the particles with the least solution to abandon based on the results from Tables 3 and 4. Comparing the standard PSO with other different numbers of abandoned solutions, this half (n/2) and nearly half n/3 and n − (n/3)) are showing better performance. While choosing n/2, it may be balanced well to retain the top half of good solutions to move towards the global best solution and use the remaining half of the particles to explore the new solution. Here, if we reduce the number of abandoned particles to n/2, the global search capability may be reduced, and if increase the number of abandoned particles to more than n/2, may be distracted from moving toward the global best solution. This might be the reason for the n/2 abandoned particles performing well.

Conclusions
The standard PSO algorithm is more effectively solving complex problems like text document clustering compared with other standard SI algorithms and traditional K-means algorithms. Still, PSO needs to be improved to avoid the problems such as premature convergence and local optima. In this study, we propose a subswarm-PSO algorithm to increase the global search capabilities of PSO and avoid the local optimum to improve the results for text document clustering. Our proposed algorithm is to select potential solutions and re-initialize particles dynamically. Out of a number of experimental observations, selecting half of the best solutions in each iteration has provided performance achievement.
The results of this proposed approach were compared with the standard PSO algorithm and the K-means algorithm. Here, evaluation metric purity is used with six benchmark data sets. The experimental results show that our proposed algorithm subswarm-PSO performs best and uses less execution time than the standard PSO algorithm. Previous literature shows that the PSO is also performing well in a few other problems such as clustering, feature selection, and scheduling [18]. Our proposed approach also may perform well in those problems as the PSO algorithm. As a future study, we are aiming to apply and check the performance of the proposed subswarm-PSO approach in different domains.