Document Clustering Using K-Means with Term Weighting as Similarity-Based Constraints

In similarity-based constrained clustering, there have been various approaches on how to define the similarity between documents to guide the grouping of similar documents together. This paper presents an approach to use term-distribution statistics extracted from a small number of cue instances with their known classes, for term weightings as indirect distance constraint. As for distribution-based term weighting, three types of term-oriented standard deviations are exploited: distribution of a term in a collection (SD), average distribution of a term in a class (ACSD), and average distribution of a term among classes (CSD). These term weightings are explored with the consideration of symmetry concepts by varying the magnitude to positive and negative for promoting and demoting effects of three standard deviations. In k-means, followed the symmetry concept, both seeded and unseeded centroid initializations are investigated and compared to the centroid-based classification. Our experiment is conducted using five English text collections and one Thai text collection, i.e., Amazon, DI, WebKB1, WebKB2, and 20Newsgroup, as well as TR, a collection of Thai reform-related opinions. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%.


Introduction
Knowledge discovery and data mining (KDD) is a vital process to help and understand user (human) behavior by inferring knowledge and discovering patterns from a large-scale data collection. Classification (supervised learning) and clustering (unsupervised learning) are two complementary techniques in KDD, where the former utilizes a set of labeled examples to create a model for further classification of an unseen object while the latter uses no prior information but groups similar objects based on a kind of (dis)similarity measure. Generally, the clustering technique is useful for discovering a new set of categories in which discovering group relies on identifying the interesting distribution, in contrast the classification classifies data based on the predefined classes [1]. However, a large dimensional data often contains noise that is not a part of the underlying pattern. Some researchers tried to guide the unlabeled data by giving some hint for identifying interesting distribution in data [2][3][4] and make a relationship by using similarity metric and distance metric [5]. Since construction of labeled data (objects) is costly, it is worth investigating unsupervised approach or semi-supervised approach (combination of supervised and unsupervised approaches) [6]. In the past, some researchers learning (USL), in terms of (i) existence of predefined classes, (ii) model learning, (iii) availability of labeled examples, and (iv) availability of unlabeled examples. Note that classification is a method of supervised learning while clustering is a method of unsupervised learning. In the past, most existing literature declared only the first three schemes, i.e., SL, SSL, and USL. Indeed, to be more precise, in the past, there exist a number of SUSL methods, such as constrained clustering with pairwise constraints [11] and integrating constraints with metric learning [12] but they treat it as USL with contraints, not the SUSL concept. However, in this paper, we intend to form a systematic framework by introducing the fourth scheme; semi-unsupervised learning (SUSL) in addition to SL, SSL, and USL. While the SL scheme always requires a costly training dataset with labels, the SSL learning is an extension of SL to learn a classification model from a small set of labeled data, and then extend or revise such model by unlabeled data. Both  In summary, to contrast our proposed semi-unsupervised learning (SUSL) scheme to the supervised learning (SL), the semi-supervised learning (SSL), and unsupervised learning (USL), two primary points are (i) arbitrariness on the number of classes/clusters/groups and (ii) exploitation of labeled examples to guide clustering/grouping. As an interesting application, it is possible to apply SUSL to group documents according to user preference (or intention), namely constrained document clustering. Given user preference in the forms of a few constraints that indicate which documents should (or should not) be in the same clusters, a large number of unlabeled documents can be grouped into a number of clusters with satisfaction of such constraints. Here, the advantage of constrained document clustering, i.e., SUSL, is shown via the following example. We have conducted a preliminary experiment to investigate performance of SL, SSL and USL using the WebKB dataset of 4161 documents under two complimentary dimensions (WebKB1: 4 classes and WebKB2: 5 classes) and the result is shown in Table 2. As the dataset characteristics, the WebKB1 classes are quite balanced while the WebKB2 classes have skewness in one large class (approximately 10 times of another class). The evaluation is made in the classification manner.
Here, the SL is the classification mode where the model is learned from a set of labeled data using the centroid-based method [20]. The evaluation was done with 3329 documents (80% of the data) as the training set and 832 (20% of the data) as the test set. The SSL is the clustering mode where the initial cluster centroids are learned from a set of labeled data, i.e., 80% of the data, and then the clusters are refined using 20% of the data. This mode is similar to the seeded k-means shown in [13]. The USL is the conventional k-means where the clusters are formed using 20% of the data, the initial centroids are randomly selected for 100 trails and the best performance is chosen. In this experiment, four schemes of term weighting are explored; standard term frequency (TF), normalized term frequency (nTF), term frequency with inverse document frequency (TF × IDF), and normalized term frequency with inverse document frequency (nTF × IDF). Here, the normalized term frequency used in this experiment is the norm-1. The normalized term frequency of the j-th term in the i-th document, denoted by nt f ij , , where t f ik is the frequency of the k-th term in the i-th document and T is the number of possible terms. Table 2. Geo-mean of (accuracy, f -measure) in cases of supervised Learning (SL), semi-supervised learning (SSL), and unsupervised learning (USL). From Table 2, a number of observations can be made as follows. Firstly, it is not surprising that SL mostly obtains better performance than SSL and the SSL tends to gain higher performance than USL since more information is available for model construction. Secondly, for some term weighting (i.e., nTF), the SSL is worse than USL. Thirdly, WebKB2 (unbalanced dataset) seems to achieve higher performance than WebKB1 (balanced dataset) in the case of supervised learning since the classification may gain a bias due to the class-size information. Here, the largest classes dominate 75% of the cases. With less tendency, the reverse results were obtained in the cases of semi-and un-supervised learning. Fourthly, without class information (USL), the clustering is blind and the performance is low. Here, the maximum is 56.00% for the WebKB1 case of USL with nTF × IDF. In summary, without class information, the accuracy drops dramatically and class balancedness affects the clustering result. For further investigation, it is worth figuring out how term weighting and term distribution, as similarity-based class constraints, affect clustering quality for unsupervised learning.
As a more concrete elaboration, clustering is known as the unsupervised learning, where there is no pre-defined class, no labeled examples, and no model learning. However, the task that we coped with in this paper, is semi-unsupervised learning, where we partially have some pre-defined classes but they are loosely since we do not use the class information directly but we extract statistics and then use them indirectly as term weights for clustering. That is, we assume that there is no class defined. This is the same with the concept of the well-known seeded k-means, but the seeded k-means have no term weighting applied. This point is the originality of this work. We also have weak model learning, not a rigid one, like the supervised learning or the semi-supervised learning. We use labeled examples but do not straightforwardly use them as the prior knowledge for class definition, rather than we try to capture behavioral patterns using the statistics extracted from these examples for clustering. In other words, distribution-based term weightings are used as soft guidance to construct good clusters of unlabeled documents.

Similarity-Based Constrained Clustering
In the past, the concept of controlling unsupervised learning was reported in several literatures, including pairwise constraints [11], metric-learning constraints [12], and community-relationship pairwise constraints [17]. According to Davidson et al. [18], most constrained clustering techniques can be divided into two categories; namely search-based (also known as constraint-based) and similarity-based (also known as distance-based ), even some methods are a hybrid of these methods [23]. The search-based method modifies clustering algorithms to incorporate direct the prior knowledge into the clustering task, where the solution space is searched according to the constraints. On the other hand, the similarity-based method applies an existing clustering method by modifying the distance measure in accordance with the prior knowledge. The latter can enhance the former by transferring the original space to a new space using a sort of cluster quality measure, namely "distance metric learning" [19] and then perform clustering. Moreover, several researchers asserted that a combination of search-based and similarity-based approaches can improve cluster quality [24,25].
As another point of views, Dinler & Tural [23] classified constrained clustering into three approaches based on the type of constraints used for grouping guideline. Three types of constraints are pointed out, i.e., (i) labeled-data constraint, (ii) instance-level constraint and (iii) cluster-level constraint. The first approach utilizes a small set of instances with their labels for clustering in the initial round and then to perform the succeeding rounds either with or without instance labels. Unlike classification, some extra clusters can be introduced with a sort of probabilistic or discriminative models [13,26]. The second approach introduces a set of pairwise constraints, i.e., MUST-LINK and CANNOT-LINK, to guide which data object can or cannot be together with which data object when we formulate clusters [11,27]. Unlike the previous two approaches, the last approach focuses on cluster-level constraints, rather than labeled-data or instance-level constraints. For example, characteristics are the size of the groups/clusters, lower or upper bounds on the radii, and diameter of the groups [28][29][30]. Zhu et al. [28] proposed a heuristic algorithm to transform the size of constrained clustering problems into the integer linear programming problem.
Finally, it is a challenging research topic, to explore the constrained clustering with labeled-data constraints. Recently, feature projection and weighting have been widely used to highlight and/or suppress features according to discriminative information at the bag level [31]. Moreover, it is worth designing a semi-unsupervised learning model that compromises between flexibility and accuracy by applying the feature distribution [20,32,33], the feature projection, and the feature weighting schemes for partial guidance, i.e., promote or demote features for recognizing user intention. However, these research works still need detailed investigation with some symmetry settings.

Constrained Document Clustering with Distribution-Based Term Weighting
This section starts with the concept of term distribution as term weighting in manipulating clustering process. Then its application to document clustering using term weighting is discussed. Based on this background and a framework of our semi-unsupervised learning towards adaptive constrained clustering is described.

Distribution-Based Term Weighting Scheme
Most existing document classification methods applied to the basic TFIDF as term weighting since TF highlights the words/terms that occurs often and IDF can discount the terms that occurs in several documents. Some works [34,35] applied parametric distance metric learning with labeled information to find out a regression mapping of a point on an original input space onto a point on an optimal feature space in some specific task, such as e-commerce. Moreover, a clustering method using multi-distance measures calculated under multiple objectives has been proposed to support a variety of characteristics on different structures of datasets [36]. In suchwork, dissimilarity between patterns in the input space is approximated by Euclidean distance between points in the feature space [37]. Although the method works well in general, it is sensitive to the noise [38]. As an alternative, the term weighting concept can be introduced to encode the lexical knowledge as constrained with term weighting scheme. This approach controls the similarity and dissimilarity among documents by adjusting the weight of a term using its variances in in-collection, inter-class and intra-class set. The weighting scheme can help promote significant terms and demote trivial (general) terms [20,21,32]. Following the concepts in [20], an important term should (i) appear frequently in a certain class, (ii) appear in few documents, (iii) not distribute very differently among documents in the whole collection, (iv) not distribute very differently among documents in a class, and (v) distribute very differently among classes. The first and second represent the conventional term frequency and inverse document frequency, respectively. The third to fifth items refer to distributions of a term in the whole collection, those within a class, and those among classes. These three distributions are defined by standard deviation (SD), class standard deviation (CSD), and inter-class standard deviation (ICSD) as follows: Let D = {d 1 , d 2 , ..., d |D| } be a set of |D| documents (document collection), T = {t 1 , t 2 , ..., t |T| } be a set of |T| possible terms, and C = {c 1 , c 2 , ..., c |C| } be a set of |C| clusters. The class model M : D × C → {T , F } can partition documents in a collection into a number of groups by assigning a Boolean value to each pair d i , c k ∈ D × C that M(d i , c k ) = T . The value of T (i.e., true) is assigned to d i , c k when the document d i is determined to belong to the cluster c k . Moreover, let C k = {d | d is a document belonging to the cluster c k }, where C i = D and C i ∩ C j = ∅. On the other hand, a value of F (i.e., false) is assigned to d i , c k when the document d i is determined not to belong to the cluster c k . Here, let t f ij be the term frequency of the term t j of document d i and it can be an actual frequency, a normalized frequency with respect to document/term length or other forms. The formal definitions of the two common frequencies; TF and IDF, as well as the three standard deviations; SD, ACSD, and ICSD are as follows: • Term frequency (TF): • Inverse document frequency (IDF): • Standard deviation (SD): • Average class standard deviation (ACSD): • Inter-class standard deviation (ICSD): Here, the factor t f ij is the frequency of term t j in the document d i and id f j is the inverse document frequency of term t j . The factor id f j is the logarithmic scale value of one plus the ratio of the number of documents in the collection (|D|) to the number of documents that contain the term t j , i.e., d f j , namely document frequency. The factor sd j (the in-collection standard deviation of the term j) represents the variation of the term j's frequency among the documents in the document collection. Conceptually, the higher sd j means the term j has high occurrence variation among documents in the whole collection. That is the term may tend to be a stopword and it may not be a good representative of the class (cluster).
The factor acsd j presents the average of cluster standard deviation among all possible clusters, where the cluster standard deviation is the variation of the term j's frequency among the documents in the cluster. While a term with a low acsd j , i.e., low intra-class variation, could be a good representative term in a class (or a cluster). As the last type, icsd j presents the standard deviation of the term j's class-summation frequencies, on the set of possible classes (or clusters). A term with a higher icsd j may be considered as a good representative term.

Constrained Document Clustering with Term Weighting
This section presents the formalism of constrained document clustering with term weighting. In the past, term weighting was shown to be effective in improving classification performance [21]. However, there are still few works on application of term weighting in clustering. While most of previous works used instance-level constraints, such as MUST-LINK and CONNOT-LINK to guide k-means clustering, this work proposes a method to use term distribution extracted from a relatively small set of data with their labels as term weighting to guide clustering. Such distribution acts as a clue of user intention in clustering process by distinguishing effective terms from non-effective ones. Let d = [tw j ] be the document d's term-weighting vector derived from two components; where tw j is the term weight of the term t j defined as follows.
Here, t f j is the term t's frequency or its derivatives while id f j is the inverse document frequency. In this work, two types of term frequency (t f j ) is used, the original term frequency and the normalized term frequency as shown in Section 2.1. The θ, κ, α, β, and γ are the parameters for setting the exponent of t f j , id f j , sd j , acsd j , and icsd j , where a positive value means to promote the factor while a negative value works as relegation. Initially the documents in the collection are randomly grouped into N group, where N is the number of groups, we intend to partition the documents. While there have been several means of expressing distance/similarity in clustering, two major classes are k-means using Euclidean distance, and k-means using cosine similarity. In this work, for the sake of simplicity and scale invariance, we apply the k-means using cosine similarity [39], where the closeness between two documents is represented by cosine distance between them [40]. The larger value is, the closer the documents are. Here, let the i-th document in a collection be represented by d i and its document vector be expressed by d i . In the same way, the k-th cluster be represented by c k , its cluster vector be expressed by c k and its associated document set be denoted by C k . The number of clusters is denoted by |C| as mentioned in the previous section. The norm-2 of the vector d i (= [tw ij ]) is represented by || d i || 2 and it corresponds to the size of the vector (∑ |T| j=1 tw ij ) over all possible terms (T = {t 1 , t 2 , ..., t |T| }). Based on this background, the objective of clustering is to find the best partition S * = {c * 1 , c * 2 , ..., c * |C| }) that maximizes the summation of cosine distances between the documents and their associated clusters.
However, as known, the clustering problem is NP-hard and therefore, it is difficult to find the global optimal according to the above equations. This work applies k-means to find the near-optimal solution and introduces distribution-based weight to guide the clustering process towards user intention.

The Framework of Clustering with Term Weighting
This section describes the framework of document clustering with constraints provided in the form of term weighting. As shown in Figure 1, the framework includes three main processes; (i) statistics extraction, (ii) document encoding, and (iii) seed calculation and constrained clustering. The framework of constrained clustering (semi-unsupervised learning) using distribution-based term weighting. Here, TF = term frequency, FW = frequency-based weight, DW = distribution-based weight, DV = document vector, S = set of centroid.
As the first process, the statistics extraction process extracts term-related statistics, including IDF, SD, ACSD, and ICSD, where the first is frequency factor while the rest are distribution factors. In the second process, the term frequency (TF) and the extracted statistics (IDF, SD, ACSD, and ICSD) are used to encode each document in labeled dataset (the upper part of the process ii) and/or unlabeled dataset (the lower part of the process ii) into a vector by term weighting and term normalization. In the third process, the document vectors of the labeled dataset can be used to calculate a seed (also called an initial centroid) for each cluster. However, it is also possible to consider the unguided version where the initial centroids of the clusters can be set randomly. At this process, the unlabeled documents are clustered with the constraints in the form of initial centroids and term weighting encoded in the document vectors. This work applies k-means clustering.
Algorithm 1 illustrates the pseudo-code of the main procedure, Clustering, of the constrained clustering (semi-unsupervised learning), which is the third process (seed calculation and constrained clustering) in Figure 1.
Algorithm 2 illustrates the pseudo-code of two sub-procedures, namely StatisicsExtraction (line 1) and DocumentEncoding (line 10). The three inputs to the main procedure are the set of the labeled documents (D L ) and the set of the unlabeled documents (D U ) and the number of clusters (k) the user intends to group. The output is the result clusters of the input documents in both forms of sets (C = {C 1 , C 2 , ..., C k }) and centroids (S = { c 1 , c 2 , ..., c k }). while not satisfy convergence condition do 13: (12) 14: S = CentroidCalculation(C) # Re-calculate centroids by Equation (13) 15:

13:
Output: 15: begin 16: for each document dl i in D L do 17: dl i = ENC(dl i ,Σ) # Encode labeled documents by Equation (7) 18: end for 19: for each document du i in D U do 20: du i = ENC(du i ,Σ) # Encode unlabeled documents by Equation (7) 21: end for 22: end 23: end procedure Firstly, the statistics (Σ, including SD, ACSD, ICSD, IDF) of D L are extracted (line 8 of Algorithm 1) by the StatCal function in the subprocedure (StatisicsExtraction). Secondly each document in D L or D U is encoded (line 9 of Algorithm 1) into a vector using the statistics Σ and term frequency (t f ij ) by the ENC function in the DocumentEncoding subprocedure in Algorithm 2 at line 17 and 20. Each element of the vector is a weight for the corresponding term in the document. The weight can be calculated according Equations (7)- (11). After the encoding step, the clusters are initialized by the mathttIDC function (line 10 of Algorithm 1), called initial document clusters) before execution of iterative clustering. The initial document clusters (groups) will be used to calculate the centroid of each clusters by the CentroidCalculation function (line 11 of Algorithm 1). After the initialization, the ReAllocation function is applied to reallocate the documents to their closest cluster (line 13 of Algorithm 1). Then the centroids of the newly allocated clusters are calculated by the CentroidCalculation function (line 14 of Algorithm 1). The reallocation and the centroid re-calculation are iteratively executed until the convergence condition is satisfied (line 12 Algorithm 1).

Experiment Settings and Metrics
This section describes the datasets, the experiment settings and the performance measures for evaluating the effectiveness of the constrained clustering using our proposed distribution-based term weighting scheme.

Data Sets and Preprocessing
In this work, six text datasets from five sources are used for evaluation as shown in Table 3. The first dataset, "Amazon", is a collection of 6000 reviews in three categories taken from Book, DVD, and Electronics domains (2000 reviews for each domain) in the Amazon online shopping store, collected from Dredze's homepage at Johns Hopkins University (www.cs.jhu.edu/mdredze/datasets/ sentiment).
The second dataset, "DI" (stand for drug information), contains 4480 (640 × 7) online medical prescriptions in seven categories, provided in the form of HTML documents at www.rxlist.com, an online medical resource dedicated to offering detailed and current pharmaceutical information on brand and generic drugs. The third (WebKB1) and fourth (WebKB2) datasets consists of 4161 web documents from the same source provided by the CMU Text Learning Group at www.cs.cmu.edu. These web documents were collected from departments of computer science from four universities with some additional pages from other universities in January 1997, under the World Wide Knowledge Base (WebKB) project. While WebKB1 includes web documents from the four popular classes (project (501), course (922), faculty (1118), and student (1620) from the original 7 classes), WebKB2 consists of web documents into five classes by university, concretely Cornell (221), Washington (237) 7), and Local government (Category 10), taken from the full set of more than 100,000 documents of twenty (20) categories. The documents are suggestions or comments written in Thai language on how to reform Thailand in twenty areas (classes), collected by the online system at thaireform.org. While some comments are short, some are quite lengthy. The total number of terms in the DI and 20Newsgroups is relatively large while there is not much difference on distinct terms among the six datasets. However, the 20Newsgroups has the largest number of distinct terms (i.e., 8286 terms).
Before using these datasets, we performed the following preprocess steps as follows. While the Amazon documents are plain texts without tags and headers, the DI and WebKB documents have some HTML tags and the 20Newsgroups documents have some news headers. Therefore we exclude the HTML tags from the DI and WebKB documents and eliminate the headers from the 20Newsgroups documents. Moreover, for the five English datasets, the common pre-processing steps are (i) to omit stopwords, (ii) to transform all letters to lowercase, (iii) to remove words that are less than 3 characters, (iv) to apply the Porter's Stemmer for the remaining words, and (v) to ignore terms whose document frequency is lower than 0.001 percent of the number of documents in the collection. As the preprocess for the Thai Reform dataset, the Thai comments are segmented by the LexTo word segmentation tool (www.sansarn.com/lexto/). Next, we remove non-alphanumeric characters and omit stopwords using the list provided by Jaruskulchai, C. (1998) as well as we ignore the term with only one frequency. For class characteristics, the class size are quite uniform for all datasets, except the Thai Reform has relative smaller classes than the other datasets. For the class size when we consider only distinct terms, the Amazon has the largest number of distinct terms for each class since many words share among all classes. The ratios of inter-similarity/intra-similarity for the DI and 20Newsgroups are low, that is 0.3373 and 0.3548, respectively. Then we can expect high classification performance for these datasets. They seem to have good separation between documents in the class and those outside the class. Such ratio for the WebKB2 is very high (i.e., 0.9332) The separation among documents in the class and those outside the class is not so good. Then we can expect low classification performance for this dataset.

Experiment Settings
To evaluate our method, we have conducted five experiments in the standard five-fold cross-validation. In addition, we have conducted the experiments to investigate performance of SL, SSL and USL. The centroid-based method in [20], the seeded k-means algorithm in [13], and the k-means algorithm in [39] are used for SL, SSL and USL, respectively. The first experiment aims to investigate effect of single distribution-based term weighting, combined with traditional term weighting on both sides of a multiplier (for promoting) or a divider (for demoting). The second experiment is performed to analyze effect of combined distribution-based term weighting on different exponents (powers) of term weighting factors. In total, we explore the performance of 125 combinations, i.e., 3 factors (SD, ACSD, ICSD) and with 5 different exponents (−1.0, −0.5, 0, 0.5, 1.0), are explored. The best-20, best-10, worst-20, and worst-10 combinations are characterized to explore whether each factor has promoting or demoting affects on the clustering performance. For the best-10 combinations, their performances on each dataset are also investigated. The baselines are the methods where the exponents of factors (SD, ACSD, and ICSD) are set to zero. In the third experiment, we use distribution-based term weighting, extracted from predefined clusters, as expression of the user intention. Then we evaluate clustering performance when the user intention is varied. We use the distribution-based statistics extracted from KB1 (#classes = 4) as term weighting to represent the WebKB documents and then perform the conventional (un-seeded) k-means to cluster the documents into 4 and 5 classes. In the same way, we also exploit the statistics extracted from KB2 (#classes = 5) as term weighting for 4-class and 5-class clustering. From the result, we evaluate the impact of distribution-based term weighting (user intention) on clustering performance. The best-5 combinations for KB1 (or KB2) are selected for performance comparison. The fourth experiment surveys performance of varied training set sizes. Finally the last experiment explores the performance of the best term weightings when the number of clusters is varied from two (2) to twenty (20) for each dataset, except the 20N dataset from 15 to 100 (steps of 5), due to its large number of classes. Here, the ratio of the training dataset to the whole dataset is set to 0% to 80%.

Evaluation Measures
As evaluation metrics, we apply three types of measures; (1) class-based, cluster-based, and similarity-based measures. As the class-based measure, the geometric mean (GM) of accuracy (A) and macro average of f -measure (F) in Equation (14) is used. This measure has a strong point in the fairness in evaluation of a task with data imbalanceness [41].
Here, when the number of clusters is not equal to the number of classes in the training set, a greedy method is applied to map multiple clusters to a single cluster in order to absorb the difference between the number of actual classes and that of predicted classes. As the cluster-based measure, the purity represents the ratio of the number of instances with the most frequent label in the clusters, to the total number of instances, as shown in Equation (15).
where C k denotes the k-th cluster and L m represents the m-th labeled class. As the last measure, the similarity-based measure is calculated by pairwise cosine similarity within/amongst clusters. The similarity among instances in the same cluster, so-called intra-similarity Equation (16), as well as the similarity among instances in the different classes, so-called inter-similarity Equation (17), are used.
Here, d i and d j are the document vectors for the document d i and d j , respectively.

Cluster Quality of Single Factor
The first experiment investigates effect of an individual distribution factor on clustering quality by adding one single term distribution factor (either SD, ACSD, or ICSD) to the frequency-based weighting. In this experiment, as frequency-based weighting, either TF × IDF or nTF × IDF is explored. Remind that nTF, the normalized term frequency of the j-th term in the i-th document, is defined as , where t f ik is the frequency of the k-th term in the i-th document and T is the number of possible terms, as shown in Section 2.1. They are the same schemes as shown in Table 2. The cluster quality evaluation is conducted in both classification and clustering manners. For classification, we perform the centroid-based method with five-fold cross validation, where 80% of the data are used for centroid calculation and the rest 20% are used for performance testing. For clustering, we perform the seeded k-means method [13] with the same five-fold cross validation. Table 4 Table 4, SD T , SD N , SD TI , and SD NI imply that the standard deviation (SD) is calculated from term frequency (T), normalized term frequency (N), term frequency with inverse document frequency (TF × IDF: TI), and normalized term frequency with inverse document frequency (nTF × IDF: NI), respectively. The ACSD and ICSD are also explored in the same manner. The distribution factor is attached to the frequency-based component (FW) in two styles; promotor (×) and demotor (/), as shown in the first column of Table 4. Since five folds are performed, the p-value can be calculated from a one-tailed t-test for these five trails. As significant expression, † † † , † † , and † are provided when p-value ≤ 0.01, ≤ 0.05, and ≤ 0.1, respectively. The Avg. column shows the averaged performance from the six datasets. For each distribution factor, we compare promotor (×) performance with demotor (/) performance and we highlight the winner with the bold font. From Table 4, some observations can be made as follows. Firstly, it is not surprising that the centroid-based classification obtains approximately 2-5% higher performance (GM) than seeded k-means clustering since the former is a supervised method but the later is an unsupervised one. Secondly, NTF × IDF (the right part in the table) outperforms TF × IDF (the left part in the table). This implies that the normalization helps improve classification and clustering performance. Thirdly, in most cases, SD and ACSD perform well as a demoter while ICSD works well as a promoter. This phenomenon is the same with the result reported in the work of [20]. Fourthly, it seems that the distribution factors (statistics) calculated from normalized term frequency with inverse document frequency (nTF × IDF: NI) seem to be a good method to catch the intuitive property of the documents or the collection. As the result, in the following experiments, we use nTF × IDF as frequency-based weighting and also for calculating the distribution factors.

Cluster Quality of Multiple Factors
While the result of the first experiment suggests the promoting/demoting role of three single factors of distribution-based term weighting. The experiment in this section explores performance of the combinations of parameters in order to find the potential combinations of these parameters. The exponents of each parameter (i.e., SD = α, ACSD = β, and ICSD = γ) are varied between −1.0 and 1.0 with step size of 0.5. By this, there are 125 (5 × 5 × 5) combinations in total. Here, the factor of SD, ICSD, and ICSD are calculated when the standard term weighting (NTF × IDF) is applied. Three algorithms; centroid-based, seeded k-means and conventional k-means, are investigated. While the first and second algorithms set the k initial centroids by calculating from the training set, but the conventional k-means method randomly selects k points as the initial centroids, with 100 trials to reduce the effect sampling variations, then the result shown with maximum value. Based on the average GM on six datasets, the best-20 or (best-10) weightings (combinations) as well as the worst-20(worst-10) weightings (combinations), are collected and their exponents are analyzed. By the investigation of the best-20 and the worst-20 weightings (combinations), the exponent of each parameter is characterized. Table 5 shows the numbers of the best-20 (or best-10) or the worst-20 (or worst-10) weightings by the exponent of each parameter (SD, ACSD, and ICSD). For example the first row of Panel I, Panel A (best), the number '5(3)'means five of the best-20 weightings have the exponents of SD= −1. Panel A shows the numbers for the best-20 (best-10 in parenthesis) while Panel B displays those for the worst-20 (worst-10 in parenthesis).   Table 5 implies that SD and ACSD works well as a demoter since most of the best-20 (and best-10) weightings have negative exponents for SD and ACSD. On the other hand, ICSD acts superior as a promoter since most of the combinations (weightings) have positive exponents for it. Moreover, while most centroid-based and seeded k-means algorithms have zero as the exponent of ICSD, the conventional k-means algorithm prefers to have a positive value for the exponents of ICSD. In other words, the inter-class weight (ICSD) affects unseeded k-means while it does not influence the seeded versions, i.e., the seeded k-means and centroid-based algorithm.
As a further analysis, the GM performances of the best-10 weightings and baseline are investigated as shown in Table 6. We can observe that 15 weightings are superior to the baseline for the centroid-based algorithm, 11 weightings for the seeded k-means, and 64 weightings for the conventional k-means. By averaging on the six datasets, the best weightings for the centroid-based (i.e., SC1), seeded k-means (i.e., SK1) and conventional k-means (i.e., UK1) are superior to the baseline with a gap of 2.47% (varying from −0.10% of AM to 5.21% of DI), 4.28% (varying from 1.7% of AM to 9.13% of KB2), and 28.68% (varying from 15.11% of KB1 to53.48% of KB2), respectively. One more observation is that the rankings of centroid-based and seeded k-means looks similar while they are quite different with the conventional k-means. The most dominant difference is the effect of ICSD is positive for the former ones but it is not so important for the latter. Table 6. Geo-mean (GM) performance of best-10 weightings and the baseline for the six datasets: Panel I for the centroid-based, Panel II for the seeded k-means, and Panel III for conventional k-means.

Term Weighting as Expression of User Intention
In this experiment, the distribution-based term weighting is calculated from the statistics extracted from predefined clusters, as expression of the user intention. The clustering performance is evaluated with different user intention using the WebKB dataset. Concretely, the statistics extracted from KB1 (#classes = 4) are used as term weighting to represent the WebKB documents and then the conventional (un-seeded) k-means are executed to cluster the documents into 4 and 5 classes. The conventional k-means calculated with 100 trials to initial centroids (selecting k points) then the maximum value is shown. Similarly, the statistics extracted from KB2 (#classes = 5) are used as term weighting, instead. We evaluate the impact of distribution-based term weighting (user intention) on clustering performance, using the best-5 weightings for KB1 (or KB2) are selected for performance comparison. Table 7 shows a performance comparison between two user-defined dimensions of WebKB, i.e., KB1 and KB2. Here, the best-5 weightings are evaluated with the unseeded k-means (UK). The values before the parentheses are geo-mean of accuracy and f -measure while those in the parentheses are accuracy and f -measure, respectively. Table 7. Geo-mean of accuracy and f -measure when user intention is expressed by the distribution-based term weightings calculated from KB1 (Panel I) and those calculated From KB2 (Panel II). Here, the values before the parentheses are geo-mean of accuracy and f -measure while those in the parentheses are accuracy and f -measure, respectively.  The result shows that it is possible for us to use the distribution statistics as term weighting for guiding the clustering process. Term distribution extracted from a dimension is useful to guide clustering on that dimension as the clustering performance is high. For example, Panel I indicates that the distribution extracted from the first dimension of KB (KB1 with four classes) can help classify a text on the first dimension with a geo-mean between 64.55-67.09%. Reversely, the performance on the second dimension is relatively low with a geo-mean of 29.80-31.14% On the other hand, Panel II shows that the distribution extracted from the second dimension of KB (KB2 with five classes) is suitable for classifying a text on the second dimension with a geo-mean between 85.97-87.34%. In the same way, the performance on the first dimension is relatively low with a geo-mean of 30.10-35.72%.

Investigation of Various Training Set Sizes
This section aims to explore the effect of training set size on the performance of our constrained k-means. The dataset is split into two sets: 80% for the training set and 20% for the test set. To investigate the effect of the training set size, the test set is fixed to 20% of the whole dataset while the training set size is set to 5% and 10% to 80% with a step size of 10%. To reduce the effect of overfitting in the training set, each experiment is performed 100 times randomly and the performance is the average of these trails. The algorithms in comparison are the centroid-based algorithm and the seeded k-means algorithm. The results are shown in Figure 2a,b, respectively. Here, we select the best weighting of each dataset in Table 6, later called 'the best' in short. In the figure, the legends represent the number of classes in the dataset, and the exponent of the weight for each distribution factor. For example, for the Thai Reform dataset (TR), the legend is "TR (3, −0.5, −1, 0.5)", describing that the number of classes is 3 and the best weighting contains SD = −0.5, ACSD = −1, and ICSD = 0.5. Some observations can be made as follows. Firstly, for both centroid-based and seed k-means algorithms, the larger the training set is provided, the higher performance is. Secondly, when we provide a large training set, say 80%, there is only small difference between centroid-based and seed k-means algorithms. Thirdly, for the datasets with a small number of classes such as AM, TR and KB1, the seeded k-means algorithm tends to outperform the centroid-based algorithm when a small training set is used. However, for the datasets with a large number of classes such as DI, KB2, and 20N, when the training set is small, the centroid-based algorithm has a tendency to obtain a higher GM than the seeded k-means algorithm. As a possible reason, for a small training set with a large number of classes, when the iterative clustering process is performed after the centroid-based algorithm (that is, the seeded k-means algorithm), the clustering may become more diverse and then the performance becomes lower, compared to the pure centroid-based algorithm. Fourthly, for all datasets, the performance becomes stable when the training set size is large enough. In this experiment, performance on most datasets becomes stable at the training set size of 40%.
To contrast with the seeded k-means, we also perform the unseeded version (the conventional algorithm), by setting k initial centroids randomly and then performing the iterative k-means process.
To alleviate the effect of the initial clusters of the algorithm, 100 trials are made and their average and maximum are calculated. Unlike the centroid-based algorithm and the seeded k-means algorithm where the best weighting is selected from the average of the six datasets, in this experiment, we select the best weighting for each dataset, later called 'the best' in short. That is, best weightings for different datasets may differ. The results are shown in terms of the maximum GM (Figure 3a) and the average GM (Figure 3b). The following observations can be concluded. Firstly, the maximum performance is naturally higher the average performance. Secondly, the performance in terms of maximum and average (maximum GM and average GM), has the same tendency for AI, AM, TR, KB1, and KB2, except 20N. Thirdly, the 'maximum'and 'average'performances on the 20N dataset are quite different, as shown Figure 3. One possible cause may come from the effect of the number of the classes. The performance of a dataset with a large number of classes (for example, 20N in this experiment) tends to have high variance. Fourthly, the unseeded k-means method can obtain a high GM for the DI dataset, compared to other datasets. Referred to Table 3, the ratio of the inter-similarity to intra-similarity is small (i.e., 0.3373), implying that the seven classes of DI are quite obviously distinct in nature. Fifthly, the results for DI, AM, and KB1 are quite stable even the training set size is varied. In cases of TR and KB2, the performance increases when the size of training set becomes larger. It seems that TR and KB2 have a high ratio of the inter-similarity to intra-similarity. They are 0.9339 and 0.7363, respectively. For these two datasets, the larger the training set, the higher the performance is obtained. Sixthly, KB1 has a high ratio of the inter-similarity to intra-similarity, i.e., 0.7547 and its performance is low and stable. It is relatively hard to classify/cluster the KB1 documents as shown in Table 7. Therefore, the performance of this dataset is low, even the weighting is applied.

Effect of Cluster Number on Cluster Quality
This section presents an investigation on how the number of clusters affects the cluster quality. To this end, we vary the number of groups (clusters) in the clustering process and then explore their performances. As mentioned in Section 4.3, the performance measures are of three types; class-based, cluster-based, and similarity-based metrics.
In this experiment, the baseline is set to the conventional k-means algorithm with the weighting of NI=NTF × IDF, where term distribution is not applied but only term frequency and inverse document frequency are used, 100 trails initial clusters are made and their performance is calculated by average. The investigation is performed on the six datasets, using our proposed method, which is the conventional k-means with the best distribution-based term weighting of each dataset (as in Figure 3). Figures 4-6. show the average of cluster performance in terms of class-based, cluster-based, and similarity-based measures, respectively. Each figure shows the performance of the best term weightings (later called the best in short) when the number of clusters is varied from two (2) to twenty (20) for each dataset, except the 20N dataset from 15 to 100 (steps of 5), due to its large number of classes. The big circle marks on the point in each graph indicate the performances when the original number of clusters is used.    From these figures, some observations can be made as follows. Firstly, for all datasets, incorporating term distribution into term weighting as the constraint can help us to improve the performance over the baseline for all metrics: GM, purity, and the ratio of inter-and intra-similarity. Moreover, this advantage remains even if the number of clusters are set higher. Secondly, for the AM and TR datasets, the highest GM and purity is achieved when the number of clusters is set to four (or five), even the original number of clusters is three, as shown in Figure 4. When the number of clusters becomes higher than five, the GM and purity of the resultant clusters reduce. Referring to Table 3, the AM and TR datasets contain relatively small documents, i.e., texts with fewer than 70 terms (on average 64.58 words for AM and 43.91 words for TR). Therefore, grouping these small documents seems difficult since they include less information for clustering. The highest GM for the best and the baseline for AM are 69.73% and 61.69%, respectively, when the number of clusters is five. The highest GM for the best and the baseline for TR are 90.62% and 85.84%, respectively, when the number of clusters is six. The highest purity for the best for AM is 70.86%, when the number of clusters is four. The highest purity for the best and the baseline for TR are 89.70% (#cluster = 6) and 87.77% (#cluster = 4), respectively. For both AM and TR, When the number of clusters becomes higher than five, the best and baseline performance drops. Thirdly, shown in the bottom-most section of Table 3, the ratio of inter-to intra similarity (by cosine similarity) of both DI and 20N dataset are lower (0.3373 for DI and 0.3548 for 20N) than that of the other datasets (0.6784 for AM, 0.7547 for KB1, 0.9332 for KB2, and 0.7363 for TR). This figure means that clustering or classification on the DI and 20N datasets is easier than the other datasets. Based on this, the GM and purity gaps between the best and the baseline of DI are quite large, and the gap is still obvious when the number of clusters increases, as shown in Figures 4 and 5. However, for the 20N dataset, since the number of classes (clusters) is large (20 groups), the classification task become complicated and the performance is relatively low, i.e., approximately 50% for both GM and purity. There exists a trivial gap between the best and the baseline in terms of both GM and purity indices.
Fourthly, for the WebKB dataset (KB1 and KB2), there is a medium (GM and purity) gap between the best and the baseline. However, for KB2, the purity of the baseline is quite stable and there is only a small gap when the number of clusters increases from two to twenty, as shown in Figure 5 The KB2 where one large class exists (docs per class: 221/237/249/304/3150, see Table 3), the distribution-based term weighting seems more effective to preserve cluster quality than the traditional term weighting. Although the results of GM and purity on the KB2 dataset show that the best is superior to the baseline, but when the number of clusters is small, the baseline performs better. One observation of this performance outcome is that the KB2 dataset has one large class (3150 documents) and the performance on this dataset drops when the number of clusters is smaller than the original. Lastly, Figure 6 presents the performance of the best distribution-based weighting that also has a lower ratio of inter-to intra similarity (by average cosine similarity) than the baseline. Our proposed method achieves better improvement in the resultant clusters, compared to the baseline. When we increase the number of clusters, the average ratio of inter-to intra-similarity over 100 trails is also improved. The ratio of the baseline to the best (the black square) indicates their good performance. For all datasets, the ratio of the baseline to the best is higher than 1.0, that is the distribution-based term weighting can improve the quality of clustering. Unlike classification where the number of classes is fixed, in the work the number of clusters can be varied and the preservation of the GM, purity and the average ratio of inter-to intra-similarity is observed.

Discussion and Related Works
In this section, the constrained clustering using distribution-based term weighting is discussed, along with related works. Most constrained clustering methods used either labeled data or a set of MUST-LINK and CANNOT-LINK pairwise constraints to guide blind clustering. However, to the best of our knowledge, there is no investigation on term weighing as constraint for clustering. In the past, term weighting was used as means to improve classification process in several literatures. Early works straightforwardly applied the frequency-based term weighting (FW), in the form of TFIDF, such as [42][43][44]. However, FW may not be sufficient to reflect the importance level of a term, in relevant to characteristics of a class, since these statistics come from the whole collection, regardless of class consideration. Towards this drawback, some works [20,21,45,46] exploited class-based statistics to reflect the class information during classification, i.e., chi-square, information gain, gain ratio, and inverse class frequency. In contrast to term frequency, term distribution can be used to express importance of a term by assigning different scores to a term with high distribution and a term with low distribution, in the form of distribution-based weighting (DW). Our DW uses class-information to promote and demote a term. Using only the frequency-based term weighting (FW), the centroid-based method (B-SC), seeded k-means (B-SK), and k-means (B-UK) obtains 89.00%, 86.25%, and 51.36%, respectively, as shown in Table 6. On the other hand, when the distribution-based term weighting (DW) is also used, the centroid-based method (SC1), seeded k-means (SK1), and k-means (UK1) obtains 91.47%, 90.53%, and 80.04%, respectively. They are 2.47%, 4.28%, and 28.68% gaps, corresponding to approximately 2.78%, 4.96%, and 55.84% improvement rate over the FW performances. As error reduction viewpoint, they are 22.45%, 31.13%, and 58.96% reduction rate over the FW performances. The improvement triggered by the distribution-based term weighting (DW) is quite significant. The class information affects the clustering process, as shown in Table 7). Figures 4-6 show the performance when the number of clusters are varied. The figures also indicated that the DW can help enhance the performance of FW.

Conclusions
In this paper, three types of distribution-based term weightings are used as distance constraint to improve document clustering, i.e., distribution of terms in collection (SD), average distribution of terms in a class (ACSD), and average distribution of terms among classes (ICSD). Weighting terms helps guide the clustering process by promoting or demoting terms based on their importance in the context. This weighting is calculated from statistics that extracts the characteristic of class considered by distribution. The experiments claimed that SD and ACSD should be used as demotors, but ICSD as a promotor. Compared to the conventional TFIDF, the distribution-based term weighting improves the centroid-based method, seeded k-means, and k-means with the error reduction rate of 22.45%, 31.13%, and 58.96%. This characteristic is also the same when we vary the size of the training set. One main advantage of our approach is that we can cluster data or objects (in this work, documents) into any k clusters by considering the statistics (or knowledge) from some classified examples. In the future, we plan to apply this approach to other type of clustering, such as affinity propagation, agglomerative clustering, BIRCH, DBSCAN, mean shift, OPTICS, spectral clustering, Gaussian mixture model, a family of k-means and k-medoids, and fuzzy-based clustering. Additionally, we plan to explore efficiency of our method on any machine learning algorithms, including those of active learning, classification, and regression. Another interesting topic is to investigate the effectiveness of our proposed method on a number of standard tabular datasets with spherical or non-spherical expected clusters. Moreover, it is worth exploring the efficiency and effectiveness of this approach when dimensionality reduction, such as latent semantic analysis (LSA), factor analysis (FA), random projection (RP), independent component analysis (ICA), linear discriminant analysis (LDA), and principal component analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), isometric mapping (ISOMAP), uniform manifold approximation and projection (UMAP).