The Influence of Feature Representation of Text on the Performance of Document Classification

In this paper we perform a comparative analysis of three models for feature representation of text documents in the context of document classification. In particular, we consider the most often used family of models bag-of-words, recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based model that have been rarely considered for representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated the three models and their variants. The results of the empirical comparison show that the commonly used bag-of-words model has performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.


Introduction
The growth of the use of electronic documents propelled the development of solutions aiming at automatic organization of those documents in appropriate categories. The related task of automatic classification of text documents become an important tool for the relevant applications of news filtering and organization, information retrieval, opinion mining, spam filtering and e-mail classification (Aggarwal & Zhai, 2012). In general, document classification is the task of assigning a label from a predefined set of candidate class labels to a text document of interest. More formally, the task of a single-label document classification can be defined as follows (Sebastiani, 2002): to words and edges denoting the co-occurrence of words in sentences.
In this paper, we empirically analyze the influence of the document representation models on the performance of document classification. The empirical analysis is performed on seven benchmark tasks of document classification stemming from four standard data sets used in numerous studies (Craven et al., 1998;Francis & Kucera, 1979;Lang, 1995;Lewis et al., 2004). Our primary focus is on identifying the variant of the three document representation models, introduced above, which leads to the best classification performance. Hence, in all the experiments, we use a strong versatile classification model of a random forest (Breiman, 2001) and a single dimension-reduction method of principal component analysis (Jolliffe, 2014), where necessary. We analyze the document classification performance from different perspectives corresponding to the standard measures of classification accuracy, recall, precision, F1-score and the area under the receiver operating characteristic curve.
The paper represents an important contribution to the existing work on the comparative analysis of document classification performance. While the performance of different variants of the bag-of-words model is well studied (Aggarwal & Zhai, 2012;Forman, 2003;Sebastiani, 2002;Yang, 1999), the systematic comparative study of the performance of the other two document representation models is missing. Namely, the comparative studies focus on identifying the best performing classification model and/or subset of the bag-of-words features. Moreover, while recent studies of document representation models widely consider the continuous space models of word2vec and doc2vec, a network-based models have not been considered in the machine learning literature and have been applied in the context of document classification sporadically. Therefore, this paper provides the first systematic comparative analysis that include the widely used bag-of-words models, the emerging vector space models, and the network-based models that have been neglected. To sum up, this comparative study will con-tribute a novel and relevant guide for deciding upon the appropriate document representation model for a given document classification task.
The rest of the paper is organized as follows. In Section 2, we introduce the three document representation models and their variants as well as provide an overview of related studies for each of them. Section 3 introduces the setup used to conduct the empirical comparison of the performance of the document representation models for document classification. Section 4 presents and discusses the experimental results. Finally, Section 5 summarizes the contributions of the paper and outlines the directions for further research.

Document Representation Models
We can cluster the document representation models into two large groups.
The models in the first group lead to features that are at the level of words, In the continuation of this section, we are going to provide a detailed introduction of the three document representation models compared in this study.

Bag-of-Words Model
The bag-of-words (BOW) model represents each document as an unordered set (bag) of features that correspond to the terms in a vocabulary for a given document collection. The vocabulary can include words, a sequence of words (token n-grams) or sequences of letters of length n (character n-grams) (Manning et al., 2008;Papadakis et al., 2016;Zhang et al., 2015). Each vocabulary term is represented with one numerical value in a feature vector of a document: the feature value can be calculated in different ways. The simplest is to measure the frequency of a term (tf ) in a given document. A commonly used measure is also term frequency inverse document frequency (tf-idf ), where the term frequency (tf ) is multiplied by the reciprocal frequency of the term in the entire document collection (idf ). In this way, tf-idf reduces the importance of the terms that appear in many documents and increased the importance of rare terms.
The major characteristic of the BOW model is the high dimensionality of the feature space: the size of the vocabulary can be tens or hundreds of thousands of terms for an average-sized document collection. Usually, to reduce the vocabulary size, the documents are first preprocessed by removing noninformative terms (stop words). Furthermore, document frequency thresholding (Yang & Pedersen, 1997) removes terms with document frequency below some predetermined threshold. Finally, the standard methods for feature selection or dimensionality reduction, such as principal component analysis (Jolliffe, 2014), are applied.
Traditionally, the BOW model is used as the state-of-the-art document representation model in many natural language processing applications. Its success emerges from the implementation simplicity and the fact that it often leads to high accuracy document representation. Still, it is well known that BOW is characterized with many drawbacks such as high dimensionality, sparsity, the inability to capture semantics or any dependencies between words like simple word order. Therefore new representation models in the forms of distributed word embeddings (word2vec and doc2vec) and graph-of-words (GOW) have been proposed and tested to challenge the open issues in document classification.
Recently studies of the continuous space models for document classification have been primarily focused on the exploration of one isolated aspect show the combination of the bag-of-words and the continuous space models lead to a marginal performance improvement over the alternatives in the sentiment analysis task.
In this study, we use the variants of the word2vec and doc2vec models that correspond to the alternative sizes of the feature vectors extracted from the continuous space transformation. We consider each variant as a document representation model and conduct a systematic comparison thereof with the bagof-words and network-based models.
Note that the GOW model does not come with a standardized language network representation and set of features (network properties). The diversity of the network based models is related to the variety of networks, ranging from directed and undirected through unweighted and weighted to bipartite graphs.
Moreover, it seems that there is no unique strategy in utilizing micro, mezzo and macro level structural properties, which contribute even more to diversifi- In this study, we employ a variety of network properties at all three levels simultaneously. For the properties at the micro (node) level, we consider different methods for their aggregation into document features.

Experimental Setup
In this section, we present details of the setup of the empirical comparison of the different document representation models for document classification. First, we describe the data sets used in the experiments and the data preprocessing steps. Furthermore, we elaborate upon the used implementations and peculiar values of parameters of the document representation and classification models.
Finally, we introduce the performance metrics methods used for the evaluation and ranking of the models. Table 1 provides an overview of the properties of the four data sets used in experiments. They represent a standard set of benchmarks for various natural language processing and text mining tasks and have been used in numerous other studies (Hassan et al., 2007;Malliaros & Skianis, 2015;Nguyen et al., 2016;Papadakis et al., 2016;Ren & Sohrab, 2013;Rossi et al., 2016;Rousseau et al., 2015;Uysal, 2016;Yogatama & Smith, 2014;Yoshikawa et al., 2014).

Data and Preprocessing
The Brown corpus consists of 500 documents of over 2,000 tokens each, which are written in a wide range of styles and a variety of prose (Francis & Kucera, 1979). There are 15 document classes structured in a taxonomy consisting of four levels with 2, 4, 10, and 15 class labels, respectively. Therefore, in the experiments we consider four different document classification tasks related to the Brown corpus, referred to as Brownn, where n represents the number of class labels (2, 4, 10 or 15). We use the version of the Brown corpus included in the Python Natural Language Toolkit (Bird et al., 2009). thousand newsgroup posts on twenty topics. In the experiments, we consider each topic to represent a document class. The corpus was taken from the Python scikit-learn library for machine learning (Buitinck et al., 2013).
Reuters8 3 is a subset of the Reuters-21578 collection of news articles that includes the articles from the eight most frequent classes (acq, crude, earn, grain, interest, money-fx, ship, trade) (Lewis et al., 2004). Craven et al., 1998) is a corpus of Web pages collected from computer science departments of four universities in January 1997. The class labels are faculty, staff, department, course, project, student and other. The Web pages are included in the corpus as HTML documents, so we have employed the Python library Beautiful Soup 5 to extract the text from the HTML pages.
The first step in natural language processing, also necessary when performing document classification, is the preprocessing of text in documents. The preprocessing typically includes document tokenization, the removal of stop words and normalization. During tokenization, the document is broken down into lexical tokens: in our case, we use words. Removing stop words is the process of removing frequently used words that are the most common, short function words which do not carry strong semantic properties, but are needed for the syntax of language (for example, pronouns, prepositions, conjunctions, abbreviations and interjections). We use the list of English stop words from the Python Natural Language Toolkit (NLTK). In the last phase of document normalization, we perform the reduction of different inflectional word forms into a single base word form. More specifically, we use stemming, a simple heuristic process of shortening the different word forms to a common root referred to as a stem. To this end, we employ the implementation of the Porter stemming heuristics (Porter, 1980) from NLTK.   Dimensionality of the feature spaces for the variants of the three document representation models (bag-of-words, continuous space and network based) for the four benchmark data sets.

Bag of words
Bag-of-words features are calculated with the scikit-learn library in Python (Buitinck et al., 2013) using the TfidfVectorizer function. For a bag-of-words representation of a given document d, we use two weighting schemas tf and tf-idf (Manning et al., 2008): • Term Frequency (tf ): the weight of the term t in d equals the number of occurrences of t in d.
• Term Frequency, Inverse Document Frequency (tf-idf ): the weight of term is the number of documents in the data set that contain t, and N denote the total number of documents in the data set.
For calculating the features we also use document frequency thresholding (Ren & Sohrab, 2013;Yang & Pedersen, 1997) for removing terms with a document frequency less then 5. To further reduce the dimensionality of the feature space, we apply the principal component analysis (PCA) on the obtained feature vectors (Jolliffe, 2014) as implemented in the scikit-learn library. We selected the first p principal components that explain at least 80% of the total data variance (parameter value n componenets=0.80 ) as features for document classification.

Continuous space
For the continuous space document representation models we use the word2vec and doc2vec methods as implemented in the gensim library (Řehůřek & Sojka, 2010). Word2vec implementation at input takes a list of documents, each of them being represented as a list of words, to train a neural network model, which can be used to calculate a vector representation for each word. We used the following parameter settings. The parameter min count sets a lower bound of a word frequency; since we preprocessed the data set, we set this threshold to 1. The parameter size denotes the dimensionality of the feature vectors, to this end, we use four values of 25, 50, 75 and 100. Hence, we have four variants of the word2vec model: word2vec25, word2vec50, word2vec75 and word2vec100.
To get the representation of the whole document, we calculate the average of feature vectors for the words occurring in the document (Jiang et al., 2016). For the other parameters, we retain the default settings.
The doc2vec implementation at input takes a list of documents, their unique identifiers and a list of words in each document. The trained neural network can be used to calculate a vector representation for a given document. Since the doc2vec implementation extends the word2vec class, we used the same settings of the shared parameters. In addition, we set the number of iterations over the training documents to 20, where in each iteration a random sequence of training documents is fed into the neural network. Again, we vary the dimensionality of the resulting document vectors in the interval between 25 to 1,000 leading to seven variants of the doc2vec model: doc2vec25, doc2vec50, doc2vec75, doc2vec100, doc2vec200, doc2vec500 and doc2vec1000.

Network based
We construct language networks with nodes representing words and links connecting adjacent words within the same sentence. The links are directed and weighted, where the weight of a link between two nodes represents the overall cooccurrence frequency of the corresponding words, while the directions represent the ordering of linguistic units in a co-occurrence pair (Martinčić-Ipšić et al., 2016a). Although language networks are very often constructed from raw (not preprocessed text), here we apply network construction methods after tokenization, the removal of stop words and stemming. Network construction and analysis is implemented using Python NetworkX software package developed for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (Schult & Swart, 2008).

Learning and Evaluating Classification Models
Once we have documents represented with features, we can use an arbitrary machine learning method for the supervised learning of the task of document classification. In the experiments, performed here, we use random forest (Breiman, 2001), a strong and robust classification model that is also versatile; it is reported to work well in a variety of contexts, domains and data sets (Bosch et al., 2007;Dubath et al., 2011;Ellis et al., 2014;Onan et al., 2016).
To obtain an unbiased, out-of-sample estimate of the classification performance, we use a single split of the training data set into training and test data using createDataPartition from the caret package in R (Kuhn, 2012). Two of the experimental data sets (20News, Reuters8) already cluster their documents into training and test sets, while for the other two, we take a random, stratified, 80% samples of documents without repetition as a training set and the remaining 20% of documents as a test set as presented in Table 3. Note that the samples were stratified with respect to the distribution of the document class labels. Another reason for selecting the random forest classifier is its robustness to the different parameter settings. Following other applications of random forest, we only tune the value of the parameter mtry, that is the number of feature candidates considered for selecting a tree split in each iteration of the tree building procedure (James et al., 2014). The value of mtry parameter is tuned on the training set only using the tuneRF function from the R package RandomForest (Liaw & Wiener, 2002) also providing the implementation of the random forest classifier, which is used in the experiments.
The commonly used measure of classification performance is accuracy, so we are going to use it for the evaluation of the models. Note however, that in document classification, we ofter encounter tasks where the distribution of class labels is highly imbalanced. Thus, accuracy does not provide sufficient insight into classification performance. To this end, we also employ the commonly used area under the receiver operating characteristic curve (AUROC). In addition, we also use three per-class measures of recall T P i /(T P i + F N i ), precision T P i /(T P i + F P i ), and F1-score 2 × prec × recall /(prec + recall ), where i denotes the class label, while T P i , F P i , and F N i denote the number of true positives, false positives and false negatives for the class label i, respectively.

Ranking Classification Models
The ranking classification model with regard to a single performance measure, such as accuracy, is trivial: the larger the performance metrics, the better the classification model is. In contrast, when we rank models with regard to a single per-class measure (recall, precision, F1-score or AUROC), we have to compare their performance along multiple dimensions, that is for each class separately. In that case, the ranking for one class label can be different from the rankings for the other class labels. Thus, the issue of the overall ranking with regard to a single measure (for example, recall) becomes non-trivial.
To obtain a ranking along multiple dimensions, we employ a method from multi-objective decision theory (Srinivas & Deb, 1994). First, we embed the classification models in the multidimensional space, where each dimension corresponds to a per-class performance measure for a single class: each classification model represents a single point in that space. To identify top-ranked models, we search for a set of non-dominated points in the space. These points correspond to models that are best performers according to at least one per-class dimension (in other words, we identify the Pareto front of non-dominated points). After we assign the top ranks to these models, we remove the corresponding points from the multidimensional space and recursively continue with ranking until all the models are ranked.

Experimental Results
This section provides an overview of experimental results presented in a form of rankings of classification models (and the corresponding document representation models) according to the non-dominated sorting algorithm presented above.
Beside rankings on each task separately, we also present the average rankings achieved on all seven document classification tasks. We rank the 16 classification models corresponding to the 16 variants of the three document representation models: two bag-of-words (tf and tf-idf variants), four word2vec (the dimensionality of the feature vectors increasing from 25 to 100), seven doc2vec (the dimensionality of feature vectors increasing from 25 to 1,000) and three graph-of-words (averages, quartiles and histograms) variants.
The rankings according to the accuracy (Table 4), AUROC (Table 5) and F1-score (Table 6)  In addition to the analysis of the overall rankings, we are going to analyze the rankings of a groups of data sets (or tasks), clustered according to the additional criteria, such as average document size, the vocabulary size and the number of class labels. The average document size is considered as long in Brown, medium in 20News ans WebKD and short in Reuters8 data set. According to the vocabulary size, we can group the data sets into smaller (Brown and Reuters8, cca. 30K) and larger vocabulary (20News and WebKD, more than 100K). We will also consider three groups of tasks with small (2-4 for Brown2 and Brown4), moderate (7-10 for Brown10, WebKD and Reuters8) and large (15-20 for Brown15 and 20News) numbers of class labels.  Table 5 summarizes the rankings of the document representation models according to the AUROC performance measure. The average ranks according   Rankings of the document representation models by document classification task according to F1-score. The last column reports the average rankings over all tasks. Top-ranked models in each column are in bold.

F1-Score Rankings
be preferred. The precision and recall rankings, reported in Tables .7 and .8, mostly confirm the regularities observed here, but provide some further insight: word2vec model leads to high-precision classification models.

Discussion
Taken together, the presented results identify two top-performing document representation models: traditionally and commonly used bag-of-words, and the emerging doc2vec model. The finding is consistent regardless of the performance evaluation metrics. The standard variant of the bag-of-words model often used in the text mining studies (Sebastiani, 2002;Yang, 1999 (Jiang et al., 2016). On the other hand, (an extension of) doc2vec has been already shown to perform well on the document classification task (Jawahar et al., 2016).
The network-based model is systematically underperforming regardless of the task or evaluation metrics. Still, some additional remarks should be noted.
The number of features used in all three network-based model variants with averaging, quartiles and histograms are 19, 68 and 128, respectively, which is lower than in other models (see Table 2  when it comes to selecting among the word2vec and doc2vec, the latter seems to be a consistently better choice, except when observing accuracy on data sets with larger vocabularies and tasks with a moderate number of class labels.

Conclusion and Further Work
In this study we conduct a comparative analysis of document representation models for the classification task. In particular, we consider the most often The average shortest path is defined as where d ij is a shortest path between nodes i and j, and N is the number of nodes.
An efficiency measure was first defined by Latora & Marchiori (2001, 2003 where they introduced it as a property which quantifies how efficiently information is exchanging over the network Local efficiency is defined as the average efficiency of the local subgraphs: where G i is the subgraph of the neighbors of i.
Next we calculate local measures: in-degree and out-degree of node i, de- where triads are two edges with a shared node.
The clustering coefficient is a measure which defines the presence of loops of the order three and is defined as: where e ij represents the number of pairs of neighbours of i that are connected.
Betweeness centrality (c B ) and closeness centrality (c C ) (Brandes, 2001) are where σ st = σ ts denotes the number of shortest paths from s ∈ V to t ∈ V , and σ st (v) denotes the number of shortest paths from s to t that some v ∈ V lies on, and d G (s, t) is the distance between nodes s and t.
Page rank (Page et al., 1999) of the node is based on the eigenvector centrality measure and implements the concept of 'voting'. The Page rank score of a node v is initialized to a default value and computed iteratively until convergence using the following equation: where d is the dumping factor set between 0 and 1 (usually 0.85).

Precision Rankings
Table .7 summarizes the rankings of the document representation models according to the precision.