Next Article in Journal
Comparison for the Effect of Different Attachment of Point Masses on Vibroacoustic Behavior of Parabolic Tapered Annular Circular Plate
Previous Article in Journal
Bitumen and Bitumen Modification: A Review on Latest Advances
Article Menu
Issue 4 (February-2) cover image

Export Article

Open AccessArticle
Appl. Sci. 2019, 9(4), 743; https://doi.org/10.3390/app9040743

The Influence of Feature Representation of Text on the Performance of Document Classification

1
Department of Informatics, University of Rijeka, Radmile Matejčić 2, Rijeka 51000, Croatia
2
Faculty of Public Administration, University of Ljubljana, Gosarjeva ulica 5, Ljubljana 1000, Slovenia
3
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, Ljubljana 1000, Slovenia
*
Author to whom correspondence should be addressed.
All authors contributed equally to this work.
Received: 11 December 2018 / Revised: 15 February 2019 / Accepted: 18 February 2019 / Published: 20 February 2019
(This article belongs to the Section Computing and Artificial Intelligence)
  |  
PDF [412 KB, uploaded 1 March 2019]
  |  

Abstract

In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents. View Full-Text
Keywords: document classification; bag-of-words; word2vec; doc2vec; graph-of-words; complex networks document classification; bag-of-words; word2vec; doc2vec; graph-of-words; complex networks
Figures

Graphical abstract

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Supplementary material

SciFeed

Share & Cite This Article

MDPI and ACS Style

Martinčić-Ipšić, S.; Miličić, T.; Todorovski, L. The Influence of Feature Representation of Text on the Performance of Document Classification. Appl. Sci. 2019, 9, 743.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Appl. Sci. EISSN 2076-3417 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top