Hybrid Fruit-Fly Optimization Algorithm with K-Means for Text Document Clustering

: The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-ﬂy optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.


Introduction
Text document clustering has become an important and fast-growing research area, due to the massive amounts of text data produced by the Internet, social media, email and text messages, and other sources. Text-mining techniques are used to process and analyze text-based data, and by the use of different text-mining methods, relevant information can be extracted from the text data. One crucial method in text-mining is clustering, which has the aim of automatically partition the number of documents in a finite set of homogeneous clusters (groups). In a specific cluster, all documents are similar to each other based on the content, while in different clusters, the similarity decreases. From the perspective of optimization, clustering can be presented as an NP-hard optimization problem.
Metaheuristic algorithms are shown to be very efficient to solve NP-hard optimization problems and result in close-optimal solutions in a fair amount of time. Metaheuristic algorithms that are inspired by the nature can be divided into two major categories, swarm intelligence and evolutionary algorithms. In this work, a hybrid swarm intelligence algorithm is proposed. The fruit-fly optimization algorithm [1,2] is hybridized by the firefly algorithm (FA) [3]. Additionally, the opposition-based learning mechanism is incorporated lems [51], resource allocation [52], optimizations in different deep learning applications and neural network training [53][54][55][56][57], supply chain management [58], and many others. The fruit-fly optimization algorithm was recently successfully applied for clustering [59], although not for the specific problem of document clustering, and it was not hybridized with the FA.
Many researchers applied metaheuristic-based (including swarm intelligence) algorithms for text document clustering problems. The well-known particle swarm optimization algorithm (PSO) hybridized with K-means algorithm and OTSU algorithm is used for text clustering in [60]. In [61] grey wolf optimization algorithm (GWO) is hybridized with grasshopper optimization algorithm (GOA), and Fuzzy c-means (FCM) for feature selection and text clustering. The FA is used in [62] for dynamic document clustering. The krill herd algorithm (KH) is hybridized with genetic operators and applied for web text documents clustering in [63]. A hybrid multi-verse optimizer (MVO) with the traditional K-means clustering algorithm is used for text clustering in [64]. Text document clustering represents an important machine learning task, and our intention was to examine the performance of FFO for this application, as well as specifically evaluating the newly proposed technique in this scenario. In such tasks there is also space for improvement using other novel, improved metaheuristics and algorithms, often obtained through hybridization. Notwithstanding the existing metaheuristic-based text clustering applications, there is still a need to develop an algorithm that is more robust and efficient than the existing ones.
The following research questions inspired this work: (i) How to develop an efficient text clustering method that will group closely related documents and minimize the similarity in different clusters? (ii) How to develop an efficient swarm intelligence-based algorithm for text document clustering? The objective of this paper was to answer the research questions as follows: (i) Develop an improved hybrid fruit-fly optimization (FFO) swarm intelligence method in which performance in terms of quality of solutions is superior to the original algorithm. (ii) Employ the novel hybrid FFO approach to text document clustering and produce better results as compared to the ones obtained by other state-of-the-art approaches for the same problem.
The rest of the paper is organized as follows: the text document clustering problem formulation is described in Section 2 and the explanation of the proposed method is given in Section 3. Section 4 demonstrates the simulation results for the optimization of unconstrained benchmark functions, as well as it presents the text datasets used for evaluation purposes, evaluation metrics, and simulation results along with the comparative analysis. Finally, Section 5 concludes the paper.

Text Document Clustering
Text Document Clustering (TDC) represents the process of partitioning a set of heterogeneous documents into distinctive clusters based upon the content similarity. Before clustering technique application, data pre-processing should be performed. The pre-processing steps in text-based documents are tokenization, stop words are removed, stemming, lemmatization and, as a last step, conversion of the text to numerical form are performed.

Tokenization
In the tokenization process of text-based documents, each document in the corpus (collection of all documents) is converted to a list of tokens (units). In the case of a sentence, the token would be a word, while in the case of a word input, the tokens would be characters.
For example, having the following sentence: input_sentence = An example of tokenization.
The list of tokens after the tokenization process would be as follows:

Stop Words Removal
In the next process of text pre-processing, the common words, such as articles, conjunctions, etc. are removed from the list of tokens. By applying the stop words removal process on the above example, the result would be as follows: where an and o f are stop words and they are removed from the token list.
Stop words removal reduces the size of the vocabulary, which leads to faster computation.

Stemming and Lemmatization
In the process of stemming and lemmatization, the words are normalized, meaning that the words are reduced to their root form. The difference between stemming and lemmatization is that in stemming, the last few characters are removed from the word, while in lemmatization, the context is taken into account, and the word is converted to its base form (lemma). For example, applying the stemming process on word studies, would result in studi, where the es are removed from the end of the word, while in the case of lemmatization, the result would be study, which is the actual lemma of the word.

Vectorization
The last step in the text pre-processing pipeline is converting the words to their numerical forms for building the vocabulary that contains all tokens; this process is called vectorization. There are different vectorization techniques, the Term Frequency-Inverse Document Frequency (TF-IDF) is one of the most commonly used approaches.
The i-th document is represented by the weight vector as described in Equation (1).
The weight of document terms is calculated according to the formula in Equation (2), where i represents a term, and j represents a specific document. TF(i, j) denotes the frequency of i-th term in document j, DF(i) refers to the number of documents that contains the term i, while d is the total number of documents.
The vector space model (VSM) is presented in Equation (3).

Proposed Method
This section, first introduces the original FFO algorithm, points out its advantages and drawbacks. Subsequently, the proposed hybrid method is described, followed by the adaptation of the proposed method for text document clustering problem.

Original Fruit-Fly Optimization Algorithm
The original fruit-fly optimization (FFO) algorithm is introduced by Pan [1,2]. The FFO is inspired by the fruit-fly's foraging behavior. The original fruit-fly optimization algorithm consists of four phases: • initialization, • osphresis foraging, • population evaluation and • vision.
Initially, the solutions (fruit flies) are generated randomly within the given lower and upper bounds, according to Equation (4), where x i,j denotes i-th solution, and j subscript indicates to the element's position in the i-th solution. lb, and ub refers to the lower and upper bound, respectively. rand is a random number drawn from the uniform distribution.
After the population is initialized, the position of each solution is updated according to the osphresis foraging phase, where the solution moves a random step from the current location, and it is formulated mathematically as in Equation (5), where the new position i,j , rand() ∈ [−1, 1]. The iteration counter is denoted by t.
After the position update, the fitness value is calculated for each solution, then the greedy selection procedure will decide whether the new position will remain or the old one will be kept. This phase of the algorithm is known as vision foraging phase. If the new solution's fitness value is better than the old one, the new solution will replace the old, otherwise, the old solution will remain in the population and the new one will be discarded. The algorithm is terminated upon reaching the stop criteria and it returns the best solution.

Proposed Hybrid FFO
The FFO's structure is relatively simple, it has a reduced number of parameters, and the adaptation of the algorithm to a specific problem is straightforward. However, besides these advantages, the method has some drawbacks. The algorithm can become trapped in the local minima, its position update strategy is fixed, and it is poor in the exploitation process. The deficiencies of the algorithm were observed during experiments on unconstrained functions.
For the sake of enhancing the exploitation, the firefly algorithm (FA) search mechanism is incorporated in the algorithm, which excels at intensification. Additionally, to explore better the search space, opposition-based learning (OBL) is introduced.
The hybrid algorithm is named hybrid enhanced fruit-fly firefly, in short HEFFF. Initially, the population is generated randomly according to Equation (4). In even iterations, the FFO search mechanism is used if a random number is less than 0.5, otherwise, the opposition-based learning is applied, while in every odd iteration, the firefly [3] search mechanism is employed.
The opposite number (X ) is defined as shown in Equation (6).
The firefly search mechanism updates the position by equation Equation (7), where the distance between the current candidate solution x i and another one x j is denoted by r i,j and it is calculated by Equation (8). The attractiveness, at distance zero, toward the fittest firefly (solution) is denoted by β 0 . γ and alpha are control parameters for randomization, and κ is generated from Gaussian distribution.
The reader may refer [3] for more details about the original FA algorithm. For further improving the performance of the algorithm, a dynamic step is employed for the α control parameter according to [65], where the α value is decreasing gradually from the initial value (α 0 ), until the defined minimum value (α min ) throughout the iterations. The update of the α value in each iteration is defined as in Equation (9), where the current iteration is denoted by t, MaxIter refers to the maximum number of iterations, α (t) denotes the value of α at the current iteration, while the updated value is denoted by α (t+1) .
Algorithm 1 illustrates the main steps of the proposed HEFFF approach.

Algorithm 1 Pseudo-code of proposed HEFFF
Initialize the population randomly by Equation (4) Initialize the FA parameters of β 0 , γ, α Set the iteration counter t to 0 and define the termination criteria Evaluate the fitness of each individuals while termination criteria is not satisfied do for i = 1 to N do if t is even then if rand < 0.5 then Update the position according to FFO updating mechanism by Equation (5)

Hybrid FFO Adapted for Text Clustering
In the proposed hybrid FFO for the text clustering problem, the K-means procedure is incorporated as an intermediate step. Within pre-processing step of the text data, the features are normalized between 0 and 1, thus, the lower and upper bounds are also 0 and 1. Initially, the solutions are generated randomly between the lower and upper bounds.
One individual in the population represents one possible solution for clustering the text corpus data. Hence, the population represents the candidate solutions for the text corpus data clustering. Each solution is encoded as a matrix X i = (C 1 , C 2 , . . . , C i , . . . , C k ), where C i represents the i-th cluster centroid vector and k is the number of clusters. Based on the distance measure, each document should be assigned to the closest cluster.
The distance between a document and centroid is calculated as in Equation (10), where t refers to the number of terms (features) in a document, d denotes the document, and c is the centroid.
The fitness value is based on the K-means algorithm, and it represents the average distance of documents to the cluster centroid. The fitness value is measured by the equation Equation (11), where K denotes the total number of clusters, the number of documents in i-th cluster is denoted by n i , the j-th document in cluster i is denoted by d ij , the centroid of i-th cluster is denoted by c i . The dist() function calculates the distance between the document d ij and cluster c i .
Based on the position updating mechanisms in the proposed method, the values of the centroid vector are adjusted in every iteration, then the distance from the documents to the centroids are recalculated and reassigned. This procedure is repeated until the termination criteria are met.

Experimental Results
The proposed method is first validated on unconstrained benchmark functions, next it is applied for TDC. The experimental setup, problem sets, and the obtained results are presented in the following two subsections.

Unconstrained Experiment on CEC 2019 Test Suit
The performance of the proposed method is validated on 10 modern CEC2019 functions [66] and the results are compared to the original FFO, and other nine metaheuristicbased approaches (EHOI, EHO, SCA, SSA, GOA, WOA, BBO, MFO, PSO) [67], where the simulations were conducted under similar condition and the same problem sets are used.
The population size is set to 50, the maximum number of iterations is 500. The experiment is repeated in 30 independent runs. The control parameters of the unconstrained benchmark function experiment are shown in Table 2.

Parameter Description Notation Value
Population size NP 50 Maximum iteration number MaxIter 500 Absorption coefficient γ 1 Attractiveness parameter at r = 0 β 0 1 Randomization parameter α changes according to Equation (9) Initial α value α 0 0.5 The obtained experimental results and the comparison of the mean value and standard deviation of the proposed method and counterparts are presented in Table 3. The results of EHOI, EHO, SCA, SSA, GOA, WOA, BBO, MFO, and PSO are taken from [67].   The results in Table 3 indicate the superiority of HEFFF, it has the best mean value in the case of 8 functions, while the original FFO achieved the best mean results on 2 functions (CEC01, CEC08).
Furthermore, the Friedman test [68,69] two-way analysis of variance by ranks is performed to prove the significant difference between the proposed approach and other comparable methods. The ranking of 11 algorithms when applied to the 10 functions is provided in Table 4.  Table 4 shows that HEFFF is superior to the other 10 algorithms with an average ranking of 1.2 for the Friedman test. The Friedman statistics (χ 2 r = 58.9) is greater than the χ 2 critical value with 10 degrees of freedom (18.3), at significance level α = 0.005, thus the null hypothesis (H 0 ) is rejected and we can conclude that HEFFF is significantly different from the other 10 methods. As reported in [70], the Iman and Davenport's test [71] may be more accurate than the approximation of chi-square; for that reason, we also conducted the Iman and Davenport's test. The obtained statistic of 1.29 × 10 , which is greater than the F-distribution critical value (F(9, 9 × 10) = 1.94). Consequently, the second test also rejects H 0 . In the case of both tests, the p-value is less than the significance level. The summary of the statistical results is provided in Table 5. As the null hypothesis is rejected by Friedman and Iman and Davenport tests, we proceeded with the non-parametric post-hoc procedure, with the Holm's step-down procedure, and the obtained results are reported in Table 6.  Table 6 shows that the proposed method significantly outperformed the compared method at significance level α = 0.1, as well as the first eight methods at significance level α = 0.05 . Figure 1 illustrates a head-to-head comparison between HEFFF and the second-best method with respect to the results on the 10 benchmark functions, i.e., FFO. The plots illustrate how the fitness evaluation decreases over the 500 iterations for each test function in turn.

Text Document Clustering Experiment
In the second experiment, 6 benchmark text datasets are used for the HEFFF validation: (i) The Centre for Speech Technology Research (CSTR) dataset has 299 documents belonging to four categories (artificial intelligence, theory, systems, and robotics); (ii) The 20Newsgroups dataset (20Newsgroups) contains 20 classes and for this research 3 of them are used, namely the comp.windows.x, talk.politics.misc , and rec.autos. Each of the three categories has 100 documents; (iii) Tr41, (iv) Tr12, and (v) Wap datasets are from Karypis lab, the number of classes by datasets are 10, 8, and 20 classes, respectively; (vi) The Clas-sic4 dataset consists of four classes (CISI, CACM, MED, and CRAN), each class has 500 documents, in total there are 2000 documents in the dataset. The characteristics of the text datasets are described in Table 7.

Evaluation Metrics
The error rate, accuracy, precision, recall, F-measure, purity, and entropy are used as evaluation measurements in text document clustering. Their expressions were taken from [64].
The error rate calculates the wrong cluster assignment to a document over the total number of documents in the given dataset. The mathematical formulation is shown in Equation (12), where the total number of documents in the dataset (or test dataset) is denoted by n, n − i,j refers to the number of misplaced documents from class i to cluster j and k represents the number of clusters.
The accuracy measure calculates the ratio of document assignment to the correct clusters (when class i is identified as cluster j) in the given dataset. The calculation is performed by equation Equation (13).
The precision metric gives information about the correct classes over all classes in the clusters, and it is calculated as shown in Equation (14), where n i,j denotes the correctly assigned class i in the j cluster, and n j indicates to the total number of documents in cluster j.
The recall measures the ratio of correct document assignment over the total number of documents in the given class, at the value is determined by Equation (15), where n i,j denotes the correctly assigned class i in the j cluster, and n i indicates to the total number of documents in class i. R(i, j) = n i,j n i The F-measure represents the Harmonic mean of the precision and recall, and it is calculated by Equation (16), where P(i, j), and R(i, j) denotes the precision, and recall, respectively.
Entropy measures the distribution of documents of class labels in each cluster. The value is between 0 and 1, the best entropy is closer to 0. The entropy value of cluster j is determined by Equation (17), where E(j) denotes the entropy of cluster j, the probability of class i in cluster j of a document is denoted by p(i, j).
The entropy of clustering (all clusters) is determined by Equation (18).

Results and Discussion
The text document clustering experiment is repeated in 30 independent runs. The maximum number of iterations is set to 1000, and the population size is 20. The control parameters are summarized in Table 8.
The text document clustering on the six text-based dataset, experimental results of the proposed method, original FFO, five non-metaheuristic methods, and 9 other metaheuristicbased methods are presented in Tables

Parameter Description Notation Value
Population size NP 20 Maximum iteration number MaxIter 1000 Absorption coefficient γ 1 Attractiveness parameter at r = 0 β 0 1 Randomization parameter α changes according to Equation (9) Initial α value α 0 0.5 As presented in the tables, the hybrid FFO method is a very competitive and promising method for text clustering. HEFFF has the best error rate in the experiment with all six datasets, and in the case of five datasets, the best accuracy is obtained by the proposed method. Moreover, in the case of five datasets, the HEFFF has the best results of recall, while the F-measure is the best on four datasets. On the CSTR dataset the HEFFF improved the accuracy by 16% over the non-metaheuristic approaches, and for 2% over the best resulted in metaheuristic method reported in [64]. HEFFF resulted in the best error rate, accuracy, precision, recall, F-measure, and entropy on the 20Newsgroups dataset, and the best purity in this dataset is obtained by H-MVO1. The best performing method on the tr12 dataset is HEFFF, with the best-obtained results for error rate, accuracy, recall, and F-measure, while the second-best approach is H-MVO2, which achieved the best result on precision, purity, and entropy. The accuracy improvement of HEFFF on tr41 is from 3% to 24% over the other methods. In the case of tr41, HEFFF has the best result on all metrics, except purity, where H-MVO2 has the best value. On the Wap dataset, the H-GA method has the best accuracy, while the proposed method achieved the best value on three metrics. On the Classic4 dataset, HEFFF improved the accuracy from 2% to 16% over other non-metaheuristic and metaheuristic approaches.
In an overall view, HEFFF is classified first in all 6 benchmark cases for the error rate and accuracy. In 5 out of 6, the proposed method has the best recall and F-measure, the only dataset where it was not best being CSTR for both measures, but where the results were still close to the best output. The best precision and entropy are obtained in 3 out of the 6 tests, while the best purity is achieved in one benchmark case.
To have an overview of the results from Tables 9-14, we additionally illustrate a metrics comparison over all methods and datasets in Figures 2 and 3  In both experiments, the comparison is made between our method, basic FFO and other metaheuristics. All metaheuristics are tested under similar conditions in terms of number of iterations and solutions in the population. Computational complexity of all approaches is the same in terms of number of fitness function evaluations. We kept relatively low values for the population sizes, i.e., 50 for function optimization and 20 for clustering, and had a maximum number of iterations of 500 and 1000, respectively. The running time, given these parameters, is relatively reduced, as the number of fitness evaluation calls is maintained low. Naturally, deterministic algorithms are faster to apply, but in the current experiment led to weaker results.
In setting adequate values for the parameters, we noticed that good results were obtained for a wider variety of values for the population size and the number of iterations. As concerns the other variables, they are kept the same for both types of tasks, making parameter tuning easy for the user.
Based on the experimental results, we can conclude that HEFFF is a very efficient method in text document clustering. The FFO search phase and the OBL mechanism investigates the search space globally, and by incorporating FA, the exploitation is enhanced, and it leads to efficiently finding better cluster centroids than other approaches.

Conclusions
In this work, a hybrid metaheuristic algorithm is proposed for text document clustering. The method is named hybrid enhanced fruit-fly firefly, for short HEFFF. The FFO algorithm is hybridized with the FA, and the OBL mechanism is additionally incorporated in the algorithm to better explore the search space on a global scale.
The method is tested on 10 unconstrained benchmark functions from CEC 2019 test suit, and the obtained results are compared to other metaheuristic techniques (EHOI, EHO, SCA, SSA, GOA, WOA, BBO, MFO, PSO). In the comparison, the fitness means to value and the standard deviation is compared between the different approaches, where the proposed method outperformed all the other state-of-the-art methods. Furthermore, non-parametric statistical tests are used to prove the significant difference of the algorithm, namely the Friedman test, and Iman and Davenport's test. After conducting the non-parametric statistical tests, the Holm step-down procedure is used as a post-hoc procedure. The obtained statistical results prove the significant difference in between the results obtained by HEFFF and the other competitor approaches.
In the second simulation, the algorithm is employed for the text document clustering problem, where six standard text documents are used for evaluation purposes. The obtained results are compared to other non-metaheuristic and metaheuristic methods (Kmean, K-mean++, DBSCAN, Agglomerative, Spectral, KHA, HS, PSO, GA, MVO, H-PSO, H-GA, H-MVO1, H-MVO2). For evaluation purposes, the error rate, accuracy, precision, recall, F-measure, purity, and entropy are used in the text document clustering experiments. The proposed hybrid FFO demonstrates much better results over the other state-of-the-art methods and that it is, therefore, suitable for document clustering problems.
The proposed technique combines benefits from FFA and FA. The additional combination with the traditional K-means algorithm leads to a framework that delivers the best performance for text-based document clustering, as observed in the obtained results.
In future work, we plan to include other text-based document datasets in the experiments, as well as to implement other algorithms and to additionally incorporate different mechanisms for improvement and to enhance their efficiency in text document clustering.