Next Article in Journal
Performance of Universal Reciprocating Heat-Engine Cycle with Variable Specific Heats Ratio of Working Fluid
Previous Article in Journal
Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi Entropy
Previous Article in Special Issue
CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
Open AccessArticle

Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming

by Hyunki Lim 1 and Dae-Won Kim 2,*
1
Image and Media Research Center, Korea Institute of Science and Technology, 5 Hwarang-Ro 14-gil, Seongbuk-Gu, Seoul 02792, Korea
2
School of Computer Science and Engineering, Chung-Ang University, 221 Heukseok-Dong, Dongjak-Gu, Seoul 06974, Korea
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(4), 395; https://doi.org/10.3390/e22040395
Received: 15 February 2020 / Revised: 8 March 2020 / Accepted: 26 March 2020 / Published: 30 March 2020
(This article belongs to the Special Issue Information Theoretic Feature Selection Methods for Big Data)
The rapid growth of Internet technologies has led to an enormous increase in the number of electronic documents used worldwide. To organize and manage big data for unstructured documents effectively and efficiently, text categorization has been employed in recent decades. To conduct text categorization tasks, documents are usually represented using the bag-of-words model, owing to its simplicity. In this representation for text classification, feature selection becomes an essential method because all terms in the vocabulary induce enormous feature space corresponding to the documents. In this paper, we propose a new feature selection method that considers term similarity to avoid the selection of redundant terms. Term similarity is measured using a general method such as mutual information, and serves as a second measure in feature selection in addition to term ranking. To consider balance of term ranking and term similarity for feature selection, we use a quadratic programming-based numerical optimization approach. Experimental results demonstrate that considering term similarity is effective and has higher accuracy than conventional methods. View Full-Text
Keywords: text categorization; information gain; mutual information; chi-square statistic; quadratic programming text categorization; information gain; mutual information; chi-square statistic; quadratic programming
Show Figures

Figure 1

MDPI and ACS Style

Lim, H.; Kim, D.-W. Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming. Entropy 2020, 22, 395.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop