Method of Feature Reduction in Short Text Classiﬁcation Based on Feature Clustering

: One decisive problem of short text classiﬁcation is the serious dimensional disaster when utilizing a statistics-based approach to construct vector spaces. Here, a feature reduction method is proposed that is based on two-stage feature clustering (TSFC), which is applied to short text classiﬁcation. Features are semi-loosely clustered by combining spectral clustering with a graph traversal algorithm. Next, intra-cluster feature screening rules are designed to remove outlier feature words, which improves the e ﬀ ect of similar feature clusters. We classify short texts with corresponding similar feature clusters instead of original feature words. Similar feature clusters replace feature words, and the dimension of vector space is signiﬁcantly reduced. Several classiﬁers are utilized to evaluate the e ﬀ ectiveness of this method. The results show that the method largely resolves the dimensional disaster and it can signiﬁcantly improve the accuracy of short text classiﬁcation.


Introduction
Communication on the internet is increasingly frequent, resulting in a considerable amount of information data. Most of these data are short texts, such as microblogs and BBSs (Bulletin Board Systems). The short text classification has encountered new challenges due to their limited length, the prevalence of internet vocabulary, abbreviations, and intensifying synonyms [1,2]. Therefore, one research hotspot in natural language processing is precisely classifying these short texts and effectively analyzing their meanings.
The early work of short text classification involves converting texts into data representation, a form that computers can process, which is crucial in text classification and can directly affect classification [3]. A basic representation of a text is bag of words in which each text is signified as a vector of words. In traditional short text classification tasks, a VSM (vector space model) or a word embedding model usually represent text vectors. However, the following problems exist when applying these methods to short text classification: (1) The spatial distance between words is not considered, and dimensional disasters are easily triggered when constructing text vectors by the VSM and TF-IDF (term frequency-inverse document frequency). The VSM only pays attention to statistical information, such as word frequency. Moreover, it ignores the influence of other factors, such as synonyms on text similarity, thus reducing the accuracy of short text classification [4][5][6]; (2) By modeling the context and semantic relations of words, the word embedding model maps words to an abstract low-dimensional real space and generates a corresponding word vector. This is an effective way of • A two-stage feature clustering (TSFC) method is proposed, which takes into account the efficiency and accuracy of feature clustering.

•
The original feature was replaced by similar feature clusters to express feature vectorization.

•
The method that is presented in this paper can effectively reduce feature dimension. The rest of this paper is organized, as follows: Section 2 introduces the related works of text semantic extension. Section 3 presents the proposed method, including the TSFC algorithm and the Appl. Sci. 2019, 9,1578 3 of 13 intra-cluster feature selection rules. Section 4 verifies the validity of the method with experiments with classification of short Chinese texts. We draw conclusions in Section 5.

Related Works
Feature reduction can reduce computation and improve the accuracy of text classification. A document analysis method, called the discriminant coefficient, was proposed to reduce the features by Xu et al. [12]. In view of the traditional serial and parallel feature fusion method shortcomings, Yu et al. proposed a dimensionality reduction method for feature vectors while using a PCA (principal component analysis) method before fusing feature vectors [13]. Li et al. proposed a general importance weighted feature selection strategy for text classification, by which its relative frequency in the document determines the importance of a feature in a document [14].
Researchers mainly carry out the semantic extension of short texts in two ways: constructing an external knowledge base and optimizing feature representation. Many have attempted using an external corpus to construct semantic ontology to optimize classification. Xu et al. [27] took the pre-built ontology as the knowledge base and introduced semantic relationships between the features. However, constructing the ontology itself requires a great deal of extra work. Desai et al. [28] utilized the currently available packages, such as WordNet, a lexical database for English from Princeton University, to help build the ontology. Although some extra effort of building the ontology can be reduced, the size and effectiveness of WordNet limited its semantic understanding. WordNet is also only English-specific. This library can no longer be used when the corpus is in another language. Pak et al. [29] proposed a contextual advertising approach that is based on Wikipedia matching, so as to embed candidate ads into related pages. Ren [30] et al. combined background knowledge with a neural network to classify texts and expanded the dimension of feature expression. Wu et al. [18] proposed an effective Wikipedia text document classification semantic matching method. This method uses a Wikipedia corpus to construct a concept space, uses heuristic rules to select concepts, and then maps texts to concepts one by one to achieve the effect of improving the text classification accuracy.
The optimization of feature representation is another effective method for semantic expansion in short texts, which optimizes within a text collection. Jiang et al. [31] combined a neural network language model and Word2Vec to increase the accuracy of emotional analysis. Lilleberg et al. [32] utilized word embedding to generate document vectors and connected these vectors as additional features to the word frequency-reverse document frequency (TF-IDF) vector to improve classification accuracy in various classifiers. Song et al. [33] employed the minimum spanning tree (MST) clustering method to select the features, and they proved that the method can improve the accuracy of text classification while ensuring efficiency. Cao et al. [26] applied a feature reduction strategy and used a k-means algorithm to cluster feature words and then classify them. However, this simple clustering method is unable to achieve an appropriate number of feature clusters or control the size of clusters. Ged et al. [15] used a loose clustering strategy that is based on graph breadth-first traversal to perform feature clustering. However, this method is inefficient and the internal correlation of the obtained feature clusters cannot be guaranteed. Vlachostergiou et al. [34] designed a depth model to learn robust, resizable representations of features from the untagged data. This method is mainly applicable to the supervision model.

The Proposed Approach
We put forward an effective TSFC method for the sake of overcoming the contradiction between the efficiency and accuracy of feature clustering. Figure 1 shows the framework of the TSFC approach, which consists of the following four steps. First of all, the construction of feature word bags. After text preprocessing, the feature word bags are constructed by selecting all of the appropriate feature words in the whole corpus. Secondly, similar feature clusters are constructed by TSFC. All the feature words are clustered into multiple preliminary sub-clusters by a spectral clustering algorithm in the first stage. The similar features within each feature sub-cluster are then paired. These feature pairs are loosely clustered by the method of the figure traversal algorithm in the second stage, and several similar feature clusters are obtained. Thirdly, the feature clusters are optimized. In order to weaken the influence of polysemous or irrelevant words on classification accuracy, the feature selection rules are designed to ensure the size and effect of each feature sub-cluster. The classification accuracy can be significantly improved by eliminating such words in the final cluster of similar features. Finally, several classifiers that are based on the feature vectors of similar feature clusters classify the short text corpus. In Table 1, we define the symbols used in this paper.
the final cluster of similar features. Finally, several classifiers that are based on the feature vectors of similar feature clusters classify the short text corpus. In Table 1, we define the symbols used in this paper.
The vector representation of the feature word w

Construct the Feature Word Bags
The experimental corpus of this paper is the Fudan Chinese text classification dataset and the China Mobile SMS dataset. The steps of text preprocessing are as follows: • Collection of an internet dictionary. Social network texts are more informal and colloquial than formal texts. They contain many new internet buzzwords, such as "laotie" and  The vector representation of the feature word w

Construct the Feature Word Bags
The experimental corpus of this paper is the Fudan Chinese text classification dataset and the China Mobile SMS dataset. The steps of text preprocessing are as follows: • Collection of an internet dictionary. Social network texts are more informal and colloquial than formal texts. They contain many new internet buzzwords, such as "laotie" and "lanshouxianggu". The internet dictionary is a user-defined dictionary. It includes internet terms that contribute to the precision of word segmentation.
• Stopword removal. Stopwords are words that contain no concrete meanings (e.g., prepositions, pronouns, and conjunctions). Therefore, stopwords need to be removed from each document, so as to avert a negative impact on the method. A Chinese stopword list with 2929 entries is applied here. • Word segmentation. Chinese word segmentation refers to the cutting of a Chinese character sequence into individual words. Word segmentation is the process of regrouping consecutive word sequences into compound word sequences. The NLTK (Natural Language Toolkit) [35] segmentation kit is used in this method.

•
Construct feature word bags. In all corpuses, after word segmentation, feature words whose word frequency is higher than the set threshold and that are without repetition are selected to construct the feature word bags.

Two-Stage Feature Clustering
The TSFC method contains two stages. The sub-clusters obtained in the first stage are called initial feature clusters, and those that are obtained in the second stage are called similar feature clusters. Word bags are cut into multiple initial feature clusters in the first stage. The purpose of this preliminary clustering is to obtain more semantically similar features in a smaller feature set, namely the sub-cluster that was obtained in this step. Its advantage is that it can significantly reduce the size of the overall calculation of feature clustering, because a large number of paired comparison calculations are needed in the clustering process. In addition, the size of the final similar feature clusters can be controlled to some extent. In the second stage, the graph traversal algorithm is used to directly connect semantically similar features and merge them into clusters. This method can cover more feature points than the traditional clustering algorithm and avoid the loss of effective information. However, this method has no distance or density constraint, which is emphasized in the traditional clustering method, so we call it loose clustering. TSFC is a semi-loose clustering strategy, when combined with the strict clustering in the first stage. This is a compromise, because the accuracy of traditional clustering methods to construct similar feature clusters is not satisfactory, and the computation that is required by the graph-based search method is enormous.

Adaptive Spectral Clustering
At the first stage, adaptive spectral clustering is utilized. Spectral clustering works by first transforming the data from Cartesian space into similarity space and then clustering in similarity space. The original data is projected into the new coordinate space, which encodes information regarding how nearby data points are. The similarity transformation reduces the dimensionality of space and, loosely speaking, pre-clusters the data into orthogonal dimensions. This pre-clustering is non-linear and it allows for arbitrarily connected non-convex sample space, which is an advantage of spectral clustering. Another advantage of spectral clustering is that it can improve the time efficiency of constructing similar feature clusters and ensure the effectiveness of each similar feature cluster.
We use an adaptive multipath spectral clustering algorithm in this approach. According to the adaptive strategy that was proposed by Liu.et al. [36], different values can be obtained for different local densities, so a more accurate number of class clusters can be obtained according to the dataset. It is worth mentioning that the TSFC algorithm is not sensitive to the initial clustering K value, and the initial K value can be adjusted to a roughly appropriate value, according to the size of the dataset. The focus of our work is the feature clustering in the second stage. The purpose of the first stage of clustering is only to divide the feature word bag into multiple initial feature clusters. The reason is that initial clustering only needs to divide the text feature word space into several small blocks and ensure that the cluster is neither too large nor too small; while, the two-stage graph search can connect each feature word. In other words, the cluster partition error of the first stage can be corrected in a certain sense in the second stage.

Loose Clustering Based on Graph Depth-First Traversal
The clustering in the second stage is carried out within each initial feature cluster that was obtained in the first stage. This phase consists of the following steps: Step 1. Pairing. In our approach, pairing is the process of connecting any two words whose cosine similarity is higher than the threshold values of 0.35, 0.4, 0.45, 0.5, 0.6, or other values. The similarity is calculated while using the word's 300-dimension vector derived from the Word2Vec model. The Word2Vec model is trained by a 5GB Wikipedia Chinese corpus. For example, the words "strawberry" and "grapes" have a similarity of 0.46, as calculated from Word2Vec, so "strawberry" and "grapes" are paired when the similarity threshold is 0.45. The similarity between "strawberry" and "grapes" is estimated by calculating the cosine similarity between the vectors that are represented by V and E; that is, sim (V, E) = cosine (V, E), defined as follows: where V and E are the vector of length n. Therefore, we declare that, if the cosine similarity of feature word V and E exceeds the threshold, then V and E are paired. Semantically related words are combined by direct links that are based on a loose clustering strategy. This strategy can intuitively select suitable words to form a similar feature cluster, rather than cluster all of the words together in one. Obviously, not all of the words are suitable for entering a similar feature cluster, because they do not have a high semantic similarity with respect to any other words in the feature word bag. At the same time, it avoids the influence of an improper number of clusters on the clustering results. In addition, this method can cover more effective keywords, because the complex and strict clustering algorithm is not used.
Step 2. Building the adjacency list. The feature adjacency list is composed of pairs of feature words with similarities of 0.4, 0.45, 0.5, 0.55, and 0.6, respectively. This structure can also be represented as an undirected linkage graph, which is the basis for determining the final cluster of similar features. Within each initial feature cluster, the feature adjacency list is composed of paired feature pairs. Figure 2 shows the graph of feature pairs with undirected connections.
The points in the figure are feature words, the lines between the two points represent the pairing relationship, and the numbers represent the cosine similarity. Since Chinese word vectors are used in this method, English words may not represent the elements in the graph, but it is a word in Chinese. The optimal similarity threshold is determined by the word vector model. The optimal similarity threshold will be different with different word vector models. The results of different similar feature clusters caused by different thresholds will be further described in the following experiments.
Step 3. Clustering. Depth-first traversal is performed on the adjacency list that is obtained in Step 2. Each iteration traversal forms a similar feature cluster, and multiple similar feature clusters are then obtained. Note that our traversal does not directly jump to a new node that has not been visited. During each traversal, if all existing nodes have been visited and no new feature pairs are found, the nodes that have been visited will be clustered into a cluster and the next iteration will then begin. Within each initial feature cluster, a similar feature cluster is constructed by traversing the feature adjacency list. Once these clusters are formed, these clusters will be used in text vectorization. This method of feature reduction is driven by the idea that it is not necessary to keep all of the semantically similar features separate. Instead, those similar features can be reduced to a new feature and made equivalent to an element in the process of text vectorization. For example, the features "Monster", "Devil", and "Ghost" have high cosine similarity, so they are clustered into a similar feature cluster, which is represented by "M-D-G". In the classification process, "Monster", "Devil", and "Ghost" are replaced with "M-D-G", so the total feature dimension is reduced by 2. Specific examples of text substitution are described in the next section. number of clusters on the clustering results. In addition, this method can cover more effective keywords, because the complex and strict clustering algorithm is not used.
Step 2. Building the adjacency list. The feature adjacency list is composed of pairs of feature words with similarities of 0.4, 0.45, 0.5, 0.55, and 0.6, respectively. This structure can also be represented as an undirected linkage graph, which is the basis for determining the final cluster of   for each term ∈ doc do: terms ← pre-processing all term with NLP methods end word bags ← terms end 2. adaptive spectral clustering in word bags: Similar feature clusters C i ← Word bags 3. for each term ∈ C i do: for each word ∈ term do: if cosine similarity of any two words > similarity threshold then: adjacency list ← paired words similar feature clusters C s ← depth-first traversal in adjacent list end

Sub-Cluster Feature Selection
Obviously, in our method, not all of the feature words should be clustered into a similar feature cluster, which is a negative effect of feature clustering. In some texts, it may yield richer semantics to words that are not related to the subject, that is, a whole cluster of semantics. Therefore, in some cases, the influence of ambiguous or irrelevant words is magnified. In addition, due to the loosely clustered pairing strategy, some of the related feature pairs may be less compact in the entire feature cluster, and some points of these feature pairs are actually noise points in this cluster. These outliers have a bad effect on the classification results. In order to improve the quality of each similar feature cluster, we designed an intra-cluster feature selection rule. Feature screening is carried out in each cluster after the similar feature sub-clusters are obtained. The principle of screening is to calculate the average similarity and the maximum similarity of each feature term and the whole cluster. The concept of average similarity is defined, as follows: Average similarity: the average cosine similarity of a feature to all features in a cluster. Therefore, the main point of our rule can be summarized, as follows: the words with average similarity less than the threshold value will be deleted. Here, the threshold is equal to the similarity threshold of feature word pairing in Step 1 of the previous section.

Vectorization
The purpose of the TSFC method is to obtain similar feature clusters, that is, high similarity feature sets. The next step of text classification is to vectorize the feature words. TF-IDF and word embedding are used to verify the effectiveness of our method. It is an intuitive way to demonstrate the effect of our method. TF-IDF and word embedding are the most common text vectorization methods. At the same time, they are regarded as the basis of other improved vectorization methods.
A Corpus D consists of a set of texts, D = {d 1 , d 2 , · · · , d n } and the word bags {T = w 1 , w 2 , · · · , w m , w r 1 , w r 2 , · · · , w r k , where k is the number of all elements in similar feature clusters, m is the number of other words in the feature bag, and n represents the dimension of the vector space.
When the TF-IDF is used, there are the following rules for feature vectorization: W r is the feature word in the corresponding similar feature cluster C s . That is to say, in the process of calculating TF and IDF, if a feature word belongs to a similar feature cluster, then it will be replaced by a corresponding similar feature cluster. For a similar feature cluster with a capacity of 20, 20 feature words will be replaced by it, so the total feature dimension is reduced by 19.
When the TF-IDF is used, there are the following rules for feature vectorization: The vector of each similar feature cluster is the mean of all the feature word vectors in the cluster. In the process of text vectorization, similarly, if a feature word belongs to a similar feature cluster, then a corresponding similar feature cluster will replace it. The essence of this method is to make the semantically close feature vectors closer to each other in Cartesian coordinates, that is, they are replaced by a central point (similar feature cluster). Table 2 shows a practical example of feature substitution.

Dataset and Data Preprocessing
Two datasets are used in our experiment, and they cover different categories and sizes, as shown in Table 3. The first is the China Mobile SMS dataset, which was obtained from a project, including normal SMS and five types of spam SMS. The other is a Fudan News dataset, which contains five single-label categories, and the balanced data size of each category after our processing. The purpose of employing these datasets is related to its considerable feature size, which makes it an appropriate candidate for evaluating the outcome of our approach. The SMS dataset includes five categories: front, advertising, credit card, loan, and other. The Fudan News datasets includes six types of data, and these categories include sports, politics, economics, art, history, and computers.

Performance of the TSFC Algorithm
In this paper, four aspects of the TSFC algorithm are evaluated: the effect of spectral clustering, the scale of similar feature clusters, the effect of feature reduction, and the classification accuracy.

Results of Spectral Clustering
The first step of constructing similar feature clusters is to cut the feature set into several sub-clusters by using a spectral clustering algorithm, and the number of clustering centers K is the only parameter to be adjusted. In Figure 3, the broken line shows the time relationship between the number of clustering centers and the construction of similar feature clusters, and it shows the final classification F1 score of the TSFC method under different K values. If K is too large or too small, then the size of similar feature clusters will not be optimal. Therefore, when K is moderate, the best results will be obtained. With the increase of K, the time to obtain similar feature clusters is shorter, because, the larger the initial feature cluster is, the more computational work to construct similar feature clusters is. In our method and date set, the optimal K fluctuates by around 10, depending on the size of the dataset.

Performance of the TSFC Algorithm
In this paper, four aspects of the TSFC algorithm are evaluated: the effect of spectral clustering, the scale of similar feature clusters, the effect of feature reduction, and the classification accuracy.

Results of Spectral Clustering
The first step of constructing similar feature clusters is to cut the feature set into several sub-clusters by using a spectral clustering algorithm, and the number of clustering centers K is the only parameter to be adjusted. In Figure 3, the broken line shows the time relationship between the number of clustering centers and the construction of similar feature clusters, and it shows the final classification F1 score of the TSFC method under different K values. If K is too large or too small, then the size of similar feature clusters will not be optimal. Therefore, when K is moderate, the best results will be obtained. With the increase of K, the time to obtain similar feature clusters is shorter, because, the larger the initial feature cluster is, the more computational work to construct similar feature clusters is. In our method and date set, the optimal K fluctuates by around 10, depending on the size of the dataset.

Size of Similar Feature Clusters
Within each sub-cluster of spectral clustering, multiple different similarity thresholds are used to connect the features. The word vector model calculates the cosine similarity between two features, so the optimal similarity threshold is different when different word vector models are utilized. In order to verify the universal applicability of this method, the word vectors that are trained by Wikipedia's corpus are utilized instead of the specific word vectors in a certain field, which proves that our method has good portability.

Size of Similar Feature Clusters
Within each sub-cluster of spectral clustering, multiple different similarity thresholds are used to connect the features. The word vector model calculates the cosine similarity between two features, so the optimal similarity threshold is different when different word vector models are utilized. In order to verify the universal applicability of this method, the word vectors that are trained by Wikipedia's corpus are utilized instead of the specific word vectors in a certain field, which proves that our method has good portability. Figure 4 shows the results of similar feature clusters. Similarity thresholds of 0.6, 0.55, 0.5, 0.45, and 0.4 are used. Two important factors are plotted in the figure to reveal the effect of the clustering: the cluster size distribution and the total number of similar feature clusters. With the increase in similarity threshold, the size of the cluster of similar features decreases, but this reduction is not particularly obvious in the three intervals of 45%, 50%, and 55%. The cluster with a capacity that exceeds 20 is the smallest, and clusters with a capacity of 2-5 account for more than half of the total. However, when similar features are further increased, the size of feature clusters begins to decrease. The influence of feature clusters that are too large or too small on the classification results is also limited, which will be discussed later. When the threshold value is 40%, all of the features in the similar feature clusters cover about 50% of the feature word bags, which implies a very desirable feature reduction result. The essence of similar feature clusters is feature reduction and semantic extension. Determining the optimal similarity threshold depends on the actual task. Usually, the threshold is acceptable when similar feature clusters can cover 40%-50% of the total feature words.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 13 similarity threshold, the size of the cluster of similar features decreases, but this reduction is not particularly obvious in the three intervals of 45%, 50%, and 55%. The cluster with a capacity that exceeds 20 is the smallest, and clusters with a capacity of 2-5 account for more than half of the total. However, when similar features are further increased, the size of feature clusters begins to decrease. The influence of feature clusters that are too large or too small on the classification results is also limited, which will be discussed later. When the threshold value is 40%, all of the features in the similar feature clusters cover about 50% of the feature word bags, which implies a very desirable feature reduction result. The essence of similar feature clusters is feature reduction and semantic extension. Determining the optimal similarity threshold depends on the actual task. Usually, the threshold is acceptable when similar feature clusters can cover 40%-50% of the total feature words.  Table 4 shows the F1 score in classifiers, such as the naive Bayes (NB), the support vector machine (SVM), and logistic regression (LR). The ratio of the training set to the test set is 4:1 in the classification experiment, and a five-fold cross validation is employed. A significant trend is that, if the similarity threshold is too high or too low, the improvement of classification results will be limited. This is because, when the similarity threshold is too high, the size of similar feature clusters is very small and it only affects a few features. When the similarity threshold is too low, the paired features are not close enough in the similarity feature cluster, and the size of the similarity feature cluster is too large, which results in inappropriate feature replacement. When TF-IDF is utilized, the optimal similarity threshold is 50% and the classifier is SVM or NB, while the optimal threshold used by LR is 45%, and the maximum F1 score is increased to 3.04%. However, a high similarity threshold results in an F1 score that is lower than the baseline. It is worth mentioning that the representation of feature reduction is what we expect. The lower the similarity threshold, the stronger the feature reduction, which can reach a maximum of 27.54%. When word embedding is used for vectorization, the maximum F1 score increases by 6.7%, but when the threshold is 40 and 60%, the F1 score is lower than that of baseline. In summary, our method achieves a maximum feature reduction of 27% and a maximum F1 score improvement of 3% with TF-IDF.  Table 4 shows the F1 score in classifiers, such as the naive Bayes (NB), the support vector machine (SVM), and logistic regression (LR). The ratio of the training set to the test set is 4:1 in the classification experiment, and a five-fold cross validation is employed. A significant trend is that, if the similarity threshold is too high or too low, the improvement of classification results will be limited. This is because, when the similarity threshold is too high, the size of similar feature clusters is very small and it only affects a few features. When the similarity threshold is too low, the paired features are not close enough in the similarity feature cluster, and the size of the similarity feature cluster is too large, which results in inappropriate feature replacement. When TF-IDF is utilized, the optimal similarity threshold is 50% and the classifier is SVM or NB, while the optimal threshold used by LR is 45%, and the maximum F1 score is increased to 3.04%. However, a high similarity threshold results in an F1 score that is lower than the baseline. It is worth mentioning that the representation of feature reduction is what we expect. The lower the similarity threshold, the stronger the feature reduction, which can reach a maximum of 27.54%. When word embedding is used for vectorization, the maximum F1 score increases by 6.7%, but when the threshold is 40 and 60%, the F1 score is lower than that of baseline.

Classification Effectiveness Evaluation
In summary, our method achieves a maximum feature reduction of 27% and a maximum F1 score improvement of 3% with TF-IDF.

Conclusions
A two-stage feature clustering (TSFC) approach that is based on a semi-loose strategy is proposed. This method employs a word-embedded model to build feature sample space, and an improved feature clustering strategy and sub-cluster optimization rules are used to build appropriate similar feature clusters, which are utilized to replace the original feature vectors for short text classification. The evaluation experiments demonstrate that the efficiency and accuracy of the TSFC are improved and that the method has good extensibility. At the same time, this method significantly reduces the feature dimension. For corpora in different fields, a Word2Vec model that is trained by applying the corresponding corpus, which allows for better short text classification results, calculates the cosine similarity of the feature. Future work may focus on improving the semi-loose clustering method, so as to improve the feature reduction effect of our similar feature clustering. The feature clustering method can be regarded as an effective complement to other fields, including feature selection, information extraction, and association rule mining.

Patents
The method of this paper involves a patent under review named "an optimized short text classification method".