Next Article in Journal
Efficient Image Encryption Scheme Using Novel 1D Multiparametric Dynamical Tent Map and Parallel Computing
Next Article in Special Issue
Enhancing Precision in Large-Scale Data Analysis: An Innovative Robust Imputation Algorithm for Managing Outliers and Missing Values
Previous Article in Journal
A Wavelet-Based Computational Framework for a Block-Structured Markov Chain with a Continuous Phase Variable
Previous Article in Special Issue
GRAN3SAT: Creating Flexible Higher-Order Logic Satisfiability in the Discrete Hopfield Neural Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automatic Product Classification Using Supervised Machine Learning Algorithms in Price Statistics

Department of Applied Economics and Quantitative Analysis, University of Bucharest, 030018 Bucharest, Romania
Mathematics 2023, 11(7), 1588; https://doi.org/10.3390/math11071588
Submission received: 16 January 2023 / Revised: 1 March 2023 / Accepted: 22 March 2023 / Published: 24 March 2023

Abstract

:
Modern approaches to computing consumer price indices include the use of various data sources, such as web-scraped data or scanner data, which are very large in volume and need special processing techniques. In this paper, we address one of the main problems in the consumer price index calculation, namely the product classification, which cannot be performed manually when using large data sources. Therefore, we conducted an experiment on automatic product classification according to an international classification scheme. We combined 9 different word-embedding techniques with 13 classification methods with the aim of identifying the best combination in terms of the quality of the resultant classification. Because the dataset used in this experiment was significantly imbalanced, we compared these methods not only using the accuracy, F1-score, and AUC, but also using a weighted F1-score that better reflected the overall classification quality. Our experiment showed that logistic regression, support vector machines, and random forests, combined with the FastText skip-gram embedding technique provided the best classification results, with superior values in performance metrics, as compared to other similar studies. An execution time analysis showed that, among the three mentioned methods, logistic regression was the fastest while the random forest recorded a longer execution time. We also provided per-class performance metrics and formulated an error analysis that enabled us to identify methods that could be excluded from the range of choices because they provided less reliable classifications for our purposes.

1. Introduction

With the advent of the technological revolution, big data has been targeted as having immense potential for obtaining more time-related and relevant statistics at a lower cost. One of the areas where big data has been adopted is for the computation of consumer price index (CPIs). Several authors ([1,2,3,4,5,6,7]) have reported the potential to integrate new data sources such as web-scraped data and scanner data into the computation of CPIs in order to augment the traditional data used for calculation. The main advantages of using such data sources for CPI calculations consist of increasing its timeliness and relevance while reducing the costs of the data collection, objectives that are in agreement with the requirements for the modernization of official statistics.
Following this trend, in [8], we described a set of tools we developed to collect data from major national e-commerce sites, and since their development, we have collected around 50,000 records on a weekly basis, thus building a very large dataset.
Since CPIs are computed as a weighted average of prices for a basket of goods and services that are representative of aggregated consumer spending, the first step after data collection is to group the products according to the classes of goods and services that make up the basket. In a classical approach, when the number of products from each group (or category) is limited, products are labeled manually by human experts, but when using big data sources, manually labeling these records is impossible due to the high volume of data; therefore, an automatic classification process should be used. In this paper, we describe a process of automatic product classification in a multi-class setting, using a series of machine-learning techniques, in order to transform the price data for CPI computation. The information collected from e-commerce sites included the name and a textual description of each product, together with the corresponding price. These records were grouped according to the product classes that formed the basket, and we first transformed the product names into numeric vectors and then applied an automatic classification method to classify the vectorized names. As a result, we could select both the vectorization method that produced the best separation between product classes and the classification algorithm with the best performance. Once the products were grouped, the computation of the CPI could be conducted for each product group separately and then aggregated, based on the weights of each group.
The process of product classification has been streamlined, reducing it to a text document classification problem, which is a hotspot in the research. Features such as the text length and the purpose of the text can significantly influence the performance of a classifier. For example, classifying a set of newspaper articles can be very different from classifying a set of products by their description. Therefore, there is no standard method for performing such a classification, and the decisions are determined on a per-case basis.
While automatic classification using machine learning methods has been reported by several authors in different fields, ranging from sentiment analysis (see for example [9] and the references therein) to scientific literature classification [10], medicine ([11], object recognition [12], and diverse price indices computation [13], in this paper, we describe our approach for classifying a set of products by their names according to an international classification used for CPI computation. We began by collecting a relatively small sample of records (2853 products) from the price database that we built and manually labeled each product with its corresponding class, using the European Classification of Individual Consumption according to Purpose (ECOICOP) international product classification [14], with five-digit classes. A description of the datasets is presented in Section 3. Machine learning methods handle numerical data, but our product names were text data; therefore, before using any machine learning methods to classify the products, we needed to transform text data into numerical data. For this task, we used a series of techniques called word embedding. Therefore, we built several numerical vector representations for each product name in our dataset, each representing an embedding technique. The concept that we followed was to be able to choose not only the best machine-learning-classification method but also the word-embedding technique that was best suited for our needs (i.e., produced the best differentiation between product classes). In an exploratory data analysis, we built 2D visualizations for each set of numerical vectors, corresponding to the product names, to determine which embedding technique produced the best separation between the classes. Then, we proceeded to apply a set of machine-learning-classification methods and computed the performance metrics for each. All the embedding techniques and classification methods used in this experiment are presented in Section 4, while in Section 5, we present the results, and in Section 6, we discussed compared our results with those of other similar studies. The performance of a classification model greatly depended on the value of its parameters. Being a pilot study, we wanted to ensure the operational time was under acceptable limits. Therefore, we used a grid search to select the optimum values of the hyper-parameters for only a few selected methods, and the operational time was approximately 24 h. This paper contains a section dedicated to an error analysis and ends with the final conclusions and directions for future work.

2. Related Work

Following recent technological developments, official statistics bureaus, which are typically in charge of the CPI computation in every country, have adopted machine learning methods for product classification. while these methods are still in the experimental phase, there have been some notable results presented in this area. Therefore, Roberson reported in [15,16] the results of a study regarding the product classification using The North American Product Classification System based on a description of each product, showing that the automatic classification procedure achieved an accuracy over 90%. Martindale et al. described in [17] the process of using web-scraped data records regarding clothes for CPI computations, using the COICOP5 classification. The authors started by manually labeling a small subset of products to build a training dataset, then enlarged this dataset using fuzzy matching techniques, based on the Levenshtein distance, partial ratio, and the Jaccard distance. They also used label propagation and spreading techniques that were semi-supervised to label the products. Having a large labeled set of products, three machine learning methods were used to build an automatic classifier, namely support vector machines with a non-linear kernel, decision trees, and random forest. The results showed good performance, with a precision between 0.86 and 0.90 and an F1-score between 0.80 and 0.87, depending on the classification method and the word-embedding technique used. The authors concluded their work with a discussion on the performance metrics of the classifiers, stressing the impact of the incorrectly excluded and included products on the price index.
Another study on product classification for price statistics was described in [18]. Here, the author showed how different datasets from several sources were combined in a training dataset that could later be used for classification models. Only two classification algorithms were used, random forest and logistic regression, building the word embedding with count vectorization and term-frequency–inverse-document-frequency methods. On the test set, the best precision was obtained with random forest (0.87), and this method also had a better F1-score than logistic regression (0.86 versus 0.81).
Myklatun [19] presented the results of another study developed at Statistics Norway, where data for food and non-alcoholic beverages were automatically classified using a regularized logistic regression model, a naïve Bayes classifier, and a support-vector-machine model. The best accuracy of the classification was obtained with support vector machines at 90.2%, followed by regularized logistic regression at 89.3% and naïve Bayes at 87%. The author reported that using the automatic classification significantly reduced the time consumed by the CPI calculation.
Automatic product classification has also been used in commercial applications, as presented in [20], where the authors described a process of using the naïve Bayes method to classify two sets of products presented on a commercial website. The authors described their vectorization method that used the bag-of-words technique and analyzed how different pre-processing techniques, such as stemming, stop-word removal, number removal, etc., influenced the accuracy of the predictions. The authors reported an accuracy of 79.6% for naïve Bayes on one of the two datasets involved in their study. They also experimented with kNN and a tree classifier that provided an accuracy of 69.5% and 86%, respectively, but they argued that the trade-off between the accuracy and the operational time indicated naïve Bayes was a better method for their purposes.
Other works [21,22,23] also discussed the automatic classification of text data, and several authors emphasized that when using such methods, the costs of data processing and the time required for this task were reduced.
However, the current studies addressing with the problem of product classification for CPI computations have limitations:
  • Most considered only simple embedding techniques, such as count vectorization or term frequency–inverse document frequency, which have a significant drawback: they cannot be used with words not in a standard dictionary, so that when we presented the classifier with a new product not used in the training set, the embedding process had to be repeated. The only study that extended the vectorization techniques to a method capable of handling with words not in a standard dictionary was [17].
  • The number of machine-learning-classification methods used in these studies was rather limited. Most existing studies were limited to only two or three classification methods with logistic regression, random forest, and support vector machines being widely used, though a few authors reported that they used methods such as naïve Bayes or kNN.
  • The metrics used to compare the classification performances of different methods were limited to the classical accuracy and F1-scores, even when the datasets were imbalanced, which requires special attention to the classification results.
  • An error analysis was also absent in most of the existing studies.
Based on these limitations, we attempted to extend and improve these previous studies by:
  • Using a wide range of the existing embedding techniques: Count vectorization; term frequency–inverse document frequency; Word2Vec (both CBOW and skip-gram) with two variants for computing the vectorization of a product name; FastText with both CBOW and skip-gram variants; and GloVe, a method that was not tested at all in the previous studies.
  • Using a wide range of classification methods: We used a total of 13 methods, including 7 variants of decision-tree-based methods, neural networks, support vector machines with different kernels, multinomial naïve Bayes, multinomial logistic regression, and kNN.
  • Comparing the performances of the classifiers: We considered not only the accuracy, the F1-score, and the AUC but also a weighted F1-score that better reflected the classification quality in the presence of a highly imbalanced dataset.
  • Providing a per-case analysis of errors: This enables statisticians to make a more informed decision on the methods to use and exclude.
  • Providing a operational time analysis for the methods with the best performances: This allows statisticians to select the most efficient methods.
  • Providing an analysis of the classification performances: We also included the number of features generated by the embedding process.
We were aware that other embedding techniques (such as BERT [24]) and classification methods (such as LSTM [25], PIQN [26], and W2NER [27]) existed that would not be covered in our study. Some we had already tested, but more experimentation was necessary to obtain better results, while others were published only a short time before submitting this paper and could not, therefore, be considered in the present study. However, as far as we know, this was the most comprehensive study in the area of product classification for CPI computation, to date.

3. Data

The datasets were collected using a web-scraping technique of the main national e-commerce sites, and each record contained a product code provided by the retailer, the product name (which included a short description of the product), the price per unit, timestamps, and the retailer ID. We processed the data collection scripts on a weekly basis, with approximately 50,000 records collected each week. We used only the product names in our study in order to classify the products while ignoring the rest of the attributes. The samples used in our experiment included 2853 products from 15 classes corresponding to food and home appliance categories, and we manually labeled each product with its corresponding class.
The dataset was divided into a training (70%) and a testing set (30%). The records were randomly selected from the entire database, and they generally followed the same distribution of products as the initial dataset. In Table 1 and Figure 1, we present the distribution of the total number of products among the 15 selected classes, as well as the training and testing subsets.
Our dataset showed an important imbalance among the classes: 3 classes (05.3.1.1, 05.3.1.2, 05.3.1.3) contained 82.6% of the total number of products while the rest of the 12 classes contained only 17.4%. This imbalance between the size of the classes had consequences on the performance metrics of the classification methods, while the accuracy was generally accepted as a good performance indicator. In this case, high accuracy did not necessarily mean that the resulting classification was satisfactory since this high accuracy could be obtained only by correctly classifying items from larger classes. Therefore, in addition to the accuracy and the F1-score, we also used a weighted F1-score to report the performances of our classifiers. All the details of the metrics used are provided in the next section.
Table 1. The distribution of the selected products among the 15 ECOICOP classes.
Table 1. The distribution of the selected products among the 15 ECOICOP classes.
ECOICOP Class CodeECOICOP Class NameTotal No. ProductsNo. of Products in the Training SetNo. of Products in the Testing Set
01.1.1.2Flours and other cereals745222
01.1.1.3Bread14104
01.1.4.1Fresh whole milk453114
01.1.4.2Fresh low fat milk362511
01.1.4.7Eggs594118
01.1.5.1Butter453114
01.1.5.3Olive oil886226
01.1.5.4Other edible oils422913
01.1.6.1Fresh or chilled fruit17125
01.1.7.3Dried vegetables, other preserved or processed vegetables21156
01.1.7.4Potatoes21156
01.1.8.1Sugar332310
05.3.1.1Refrigerators, freezers and fridge-freezers931652279
05.3.1.2Clothes washing machines, clothes drying machines and dish washing machines767537230
05.3.1.3Cookers660462198
Figure 1. Distribution of the products among ECOICOP classes.
Figure 1. Distribution of the products among ECOICOP classes.
Mathematics 11 01588 g001
In Table 2 and Table 3, we present a short statistical description of the initial dataset and the corresponding training and testing datasets.

4. Methods

Classification modeling approximates a mapping function (f) from input variables X = ( x 1 , x 2 , , x n ) , also called either predictors, features, or attributes, to a discrete output variable y, called the target or output variable. A classification model could be simply written as:
y = f ( X ; θ ) ,
where X = ( x 1 , x 2 , x n ) are the predictors; y is a categorical variable with two values (0/1, for example) for binary classification problems or a set of values in case of multi-class problems; and θ stands for a set of parameters. We used only supervised classification methods in our study. A supervised classification method started with a dataset consisting of pairs ( y i , X i ) , where for each observation i, we knew the actual class (the value of y i ), fit a model using these data, and then could predict the values of the output variables for unseen observations.
Therefore, in our case, y was the class of a product (with 15 different possible values for our particular dataset) while X was the name of the product. Because machine-learning-classification methods used numerical vectors as inputs, in order to be able to use any classification method, the first step was to transform the actual inputs (text data = the name of a product) into numeric values. Before applying the word-embedding techniques, we pre-processed our data by:
  • Tokenizing the product names;
  • Transforming all characters into lowercase characters;
  • Eliminating leading and trailing white spaces;
  • Trimming any unnecessary white spaces between words;
  • Eliminating punctuation marks, such as commas, semi-colons, and colons.
Transforming text data into numerical representation meant building a vector X = ( x 1 , x 2 , , x n ) with certain properties for each word. One important property of such a method would be to obtain similar embedding for similar words. Currently, there are several word-embedding techniques with different properties and characteristics. We selected the count vectorization [28], term-frequency—inverse-document-frequency [29], Word2Vec [30], FastText [31], and GloVe [32] methods to be used in our study.
In the following, we shortly describe each embedding method that was used in our study. For more details, an interested reader can consult the above cited references.
Count vectorization is very simple, and it involves counting the appearance of each word in a document. Suppose we have two products with the following names: P1, “white flour 000 for sponge cakes”, and P2, “superior white flour from wheat 000”. Count vectorization builds a vector representation of these two names by first making a set of unique words (the vocabulary of the problem) and then by assigning the number of appearances of each word in every document. An example with the these product names is presented in Table 4.
Therefore, the first product name “white flour 000 for sponge cakes” has a vectorized form of v 1 = ( 1 , 1 , 1 , 1 , 0 , 1 , 0 , 0 , 1 ) and “superior white flour from wheat 000” of v 2 = ( 1 , 0 , 1 , 0 , 1 , 0 , 1 , 1 , 1 ) .
This is a very simple and fast method of word vectorization, but it has some disadvantages. Firstly, different product names can have exactly the same vectorized representations since this method does not account for the order of the words. Secondly, there is no way to encode the context of the words. Thirdly, it cannot handle out-of-vocabulary words. Out-of-vocabulary words could appear in this context if a new product was presented to a classifier but the product name contained a word that was not in the training set. To mitigate this problem, one possible solution would be to use a very large vocabulary when building a dataset via word-embedding in order to exclude the chance of encountering new words.While for general text classification problems this could be a satisfactory solution, in our case, we had to handle words that might not exist in general vocabulary, since product names could contain words from other languages (especially English) or highly technical words. Rebuilding the vocabulary each time we presented the classifier with new product sets appeared to be the only acceptable solution in this specific case. For the first two problems, to ensure the awareness of the order and context of words, the count vectorization method would have to consider not only single words when building vectorized representations, but also the sequences of consecutive words, called n-grams, where n is the number of words.
We implemented this vectorization method with the superml R package [33], using n-grams ranging from one to three words and removing the stop-words. The resulting vectors had more than 32,000 elements, which would be a serious issue for some of the machine-learning methods used for classification. Therefore, we limited the dimension of the embedding by considering only the first 3000 terms (single words and n-grams) ordered by their frequency. This value could be considered a parameter, and a search operation for the optimum value could be performed.
Term frequency—inverse document frequency (TF-IDF) was the second embedding method used in our study. TF-IDF builds upon a count factorization method by attributing more importance to certain words. Frequently used words in a text are considered less important since they are typically stop-words, and less common words are considered more important since they can carry useful information. The score of a word i in document j denoted by w i , j was given by:
w i , j = t f i , j × i d f i ,
where t f i , j is the frequency of word i in document j, and i d f i is determined by the following:
i d f i = log n d f i + 1 ,
where n is the total number of documents and d f i is the number of documents containing the word i. Therefore, the embedding for word i is given by ( w i , 1 , w i , 2 , w i , N ) , and N is the number of dimensions (we used 3000, as in the previous case).
This method had the same limitations as the previous one: unawareness of the context when vectorizing a word and the inability to build vectorizations for unknown words. The solution was the same as mentioned for the count factorization, i.e., using n-grams in addition to the individual words.
We implemented this method with the same superml R package, using single words, bi-grams, and tri-grams, and limiting the dimensions of the vectors to 3000.
Word2Vec is an algorithm that uses a set of words (a vocabulary, or a corpus) as input and produces a vectorized representation of each word as output, using a shallow neural network. There were two versions of this method: continuous bag of words (CBOW) and skip-gram.
The CBOW version of the algorithm attempted to guess a word w i starting with the surrounding words w i m , , w i 1 , w i + 1 , w i + m while the skip-gram version started from a word w i and attempted to predict the surrounding words w i m , , w i 1 , w i + 1 , w i + m . Here, m is a parameter of the algorithm called the window size. The structures of the neural networks for both variants are depicted in Figure 2.
Consider, for example, the left side of the picture showing the skip-gram version. The algorithm started by building the vocabulary (or the corpus) of the problem and then encoding each word as a vector of the same dimension as the number of words in the vocabulary. The elements of this vector were all 0, except for an element on the position where the corresponding word appeared in the vocabulary, which had a value of 1. This simple method was called one-hot encoding. This vector was the input of the neural network. From the input to the hidden layer, the word vector was multiplied by a weight matrix W 1 . The number of columns of this matrix, which was also the number of neurons in the hidden layer, would be the number of the features (the dimension) of the output. This was a hyper-parameter, and the performances of the algorithm could be tuned by testing different values for it. A second weight matrix W 2 was used to compute a score of each word, and using the s o f t m a x function, the final output would be a vector with the posterior distribution of the words. The network was trained using a back-propagation algorithm.
Figure 2. A schematic view of the neural networks in the Word2Vec method.
Figure 2. A schematic view of the neural networks in the Word2Vec method.
Mathematics 11 01588 g002
The Word2Vec algorithm provided the vectorized representation of each word, but for our problem, we needed a vectorized representation of the product name, which could be composed of several words. We used two methods to build these vectors: firstly, by adding the vectors of each word in the product name (ADD), and secondly by averaging these vectors (MEAN). We tested the classification methods with both versions. Therefore, for the Word2Vec method, we have four vectorizations for each product name: CBOW + ADD, CBOW + MEAN, skip-gram + ADD, and skip-gram + MEAN.
We implemented the Word2Vec vectorization using the word2vec R package [34], and we built vectors with 50 features. The number of features was limited to a small value in order to ensure that the operational time was acceptable for our experiment.
FastText builds on Word2Vec by involving not only the words but also the character n-grams (the sequences of n characters from a word). Therefore, this method could handle words not included in its vocabulary by attempting to build their embedding from the character n-grams used in the training process. It had the same two versions, CBOW and skip-gram, as Word2Vec. After obtaining the vectorization of each word, we proceeded to build the vectorization of the product names by following the original description of the algorithm: We divided each word embedding by its L2 norm and computed the average value of the word vectors in a product name for only those vectors with a non-zero L2.
We implemented this method using the fastText R package [35], and we set the dimension of the vectors at 50. We used word n-grams with up to three words, and character n-grams with n up to three to train the network.
GloVe goes a step further, and in addition to considering only local words for contextual information, it used word co-occurrence to integrate global information into the computations for word embedding.An element m i , j of the co-occurrence matrix indicated how many times a word w i had co-occurred with word w j . Given two words w i and w j and a third word, also called the probe word, w k , GloVe used P e e k / P j to compute the word embedding, where P i k is the probability of seeing word w i together with word w k , which is simply computed by dividing the number of times words w i and w k appear together by the total number of times word w i appears in the vocabulary. P j k was computed in a similar way. Building the word-embedding was performed with a neural network, using a least-squares method, such as a log-bilinear cost function.
We implemented the GloVe method of vectorization using the text2vec R package [36], and we set the dimension of the vectors to 50.
Therefore, we built nine different vectorized representations for each product name, given by the following methods:
  • Count Vectorization;
  • TF-IDF
  • Word2Vec CBOW, with product name-embedding computed by adding each word-embedding—(Word2Vec CBOW ADD)
  • Word2Vec CBOW with product name-embedding computed by averaging each word-embedding—(Word2Vec CBOW MEAN)
  • Word2Vec skip-gram with product name-embedding computed by adding each word-embedding—(Word2Vec skip-gram ADD)
  • Word2Vec skip-gram with product name-embedding computed by averaging each word-embedding—(Word2Vec skip-gram MEAN)
  • FastText CBOW
  • FastText skip-gram
  • GloVe
Having the vectorized representations of the product names, we proceeded to apply several machine-learning-classification methods. We used a series of supervised classification methods, which are presented in Table 5, along with the implementation details.
We used classical methods, such as logistic regression, kNN, and multinomial naïve Bayes, all of which had surprisingly good results; basic decision trees (CART) and their more sophisticated variations (Bagged CART, C4.5, C50, random forests); and more modern methods such as support vector machines, artificial neural networks, and XGBoost. Most of the classification methods were used with all nine vectorizations of the product names with two exceptions. For the multinomial naïve Bayes, we used only the count vectorization and TF-IDF because it required only positive values for the features, and we excluded these two methods when using artificial neural networks because the software implementation did not support features with such a high dimensionality as produced by count vectorization and TF-IDF. For the tree-based methods, we included a repeated 10-fold cross-validation procedure because it was known that their results would have high variances.
A general data-flow diagram of the classification pipeline is presented in Figure 3.
Table 5. Machine learning classification methods.
Table 5. Machine learning classification methods.
MethodSoftware ImplementationDetails
Multinomial Logistic Regression [37]glmnetUtils package [38]Applied for all 9 vectorization methods;
Multinomial Naïve Bayes [39]naivebayes package [40]Applied only for count vectorization and TF-IDF vectorization
Classification and Regression Trees (CART) [41]rpart package [42]Applied for all 9 vectorization methods;
Applied with Gini and Information gain criteria to split the nodes
Bagged CART [43]e1071 [44] and caret [45] packagesApplied for all 9 vectorization methods
Repeated 10-fold cross-validation to further reduce the variance
C4.5 [46]Rweka package [47]Applied for all 9 vectorization methods
Repeated 10-fold cross-validation to further reduce the variance
C50 [48]C50 package [49]Applied for all 9 vectorization methods
Repeated 10-fold cross-validation to further reduce the variance
Random Forest [50]ranger package [51]Applied for all 9 vectorization methods
Repeated 10-fold cross-validation to further reduce the variance
Support Vector Machines [52]e1071 [44] and caret [45] packagesApplied with radial and sigmoid kernels
Applied for all 9 vectorization methods
Artificial Neural Networks [53]nnet package [54]One hidden layer
Applied for Word2Vec, FastText and GloVe vectorization
kNN [55]caret package [45]Applied for all 9 vectorization methods
XGBoost [56]XGBoost package [57]Applied for all 9 vectorization methods

5. Results

Our software was developed using the R package, and the scripts were available at https://github.com/bogdanoancea/autoencoder. We executed the data processing scripts on a desktop computer with an Intel Core i7-8559U processor at 4.5 GHz, 32 GB DDR4 RAM, and a Windows 11 operating system. The processing time of all classification methods was around 24 h.
We started with an exploratory data analysis to determine how well our classes were separated (or inter-leaved). Therefore, we built bi-dimensional visualizations of the dataset for all nine vectorization methods. We used the t-distributed stochastic neighbor-embedding (t-SNE) method [58] to reduce the dimensionality of the vectors from 3000 in the count vectorization and TF-IDF and from 50 for Word2Vec, FastText, and GloVe, to only 2. For the implementation, we used the Rtsne R package [59].
The bi-dimensional visualizations for all nine datasets are presented in Figure 4. We observed that the count vectorization, TF-IDF, FastText skip-gram, and even GloVe produced much better separations between product classes than Word2Vec, where there was significant interleaving, especially among smaller classes. Therefore, we expected to observe similar results when we applied the classification models and computed their individual performance metrics.
The performance metrics for the classification problems were derived from the well-known confusion matrix. In a two-class problem, we used the terms “positive” and “negative” for the two classes: We denoted the number of “positive” data points predicted correctly as TP (true-positive); the number of “negative” data points predicted correctly as TN (true-negative); the number of data points predicted in the “positive” class but belonging to the “negative” class as FP (false-positive); and the number of data points predicted in the “negative” class but belonging to the “positive” class as FN (false-negative). The performance metrics are shown in Table 6.
In our case, using only the accuracy could be misleading because we could obtain very high accuracy if we predicted only the larger classes correctly; therefore, we used the F1-score in addition to the accuracy. In the case of a multi-class classification problem, we usually computed a per-class F1-score, and then we would report an aggregated form of these scores as the simple mean of the per-class F1-scores, called macro-F1. We also computed a weighted macro-F1-score by defining the weight of class i as w i = N i N , where N i is the number of observations in class i and N is the total number of observations. To give more importance to small classes, we used the inverse of the weights defined as v i = 1 w i and u i = v i v i .
Then, we defined the weighted macro-F1 as:
F 1 w = i = 1 n u i × F 1 i ,
where n is the number of classes and F 1 i is the F1-score for class i.
In addition to the accuracy and (weighted) F1-score, we also computed the multi-class AUC, as defined by [60], which was a mean of several individual AUCs and, therefore, could not be plotted.
The performance metrics for the all automatic classification and embedding methods are presented in Figure 5. In Table 7 and Table 8, we listed all classification methods along with the embedding technique that provided the highest weighted F1-scores and accuracy values. An attempt to create a ranking according to the AUC would show similar results, but this metric was less sensitive, as the maximum value had the same values for the six classification methods: XGBoost, C50, C4.5, random forest, Bagged CART and support vector machines with a radial kernel. In Appendix A and in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12 and Table A13, we present the performance metrics for all the classification methods and all the embedding techniques. As shown in Table 7 and Table 8, both the weighted F1-scores and accuracy values indicated that the best performing classification methods were logistic regression, the support vector machines with a radial kernel, the random forest combined with the FastText skip-gram embedding technique, and XGBoost combined with TF-IDF.
Table 7. Classification methods and embedding techniques with the highest weighted F1-scores.
Table 7. Classification methods and embedding techniques with the highest weighted F1-scores.
Classification MethodEmbedding TechniqueAccuracyF1Weighted F1AUC
Logistic RegressionFastText Skip-Gram0.9950.9630.9630.993
Support Vector Machines with Radial kernelFastText Skip-Gram0.9940.9720.9570.983
Random ForestFastText Skip-Gram0.9930.9620.9420.982
XGBoostTF-IDF0.9920.9660.9340.998
kNNFastText CBOW0.9890.9430.9310.969
C50TF-IDF0.9920.9630.9290.997
C4.5Count Vectorization0.9910.9630.9290.997
Bagged CARTCount Vectorization0.9920.9580.9190.997
Multinomial Naïve BayesCount Vectorization0.9910.9550.9150.996
Support Vector Machines with Sigmoid kernelFastText Skip-Gram0.9800.9300.7990.977
CART-Gini IndexCount Vectorization0.9700.9690.7660.975
Artificial Neural NetworksGLOVE0.9610.8400.8690.906
CART-Information GainCount Vectorization0.9590.9480.5800.946
Table 8. Classification method—embedding technique with the highest accuracy.
Table 8. Classification method—embedding technique with the highest accuracy.
Classification MethodEmbedding TechniqueAccuracyF1Weighted F1AUC
Logistic RegressionFastText Skip-Gram0.9950.9630.9630.993
Support Vector Machines with Radial kernelFastText Skip-Gram0.9940.9720.9570.983
Random ForestFastText Skip-Gram0.9930.9620.9420.982
XGBoostTF-IDF0.9920.9660.9340.998
C50TF-IDF0.9920.9630.9290.997
Bagged CARTCount Vectorization0.9920.9580.9190.997
C4.5Count Vectorization0.9910.9630.9290.997
Multinomial Naïve BayesCount Vectorization0.9910.9550.9150.996
kNNFastText CBOW0.9890.9430.9310.969
Support Vector Machines with Sigmoid kernelFastText Skip-Gram0.9800.9300.7990.977
CART-Gini IndexCount Vectorization0.9700.9690.7660.975
Artificial Neural NetworksGLOVE0.9610.8400.8690.906
CART-Information GainCount Vectorization0.9590.9480.5800.946
Figure 5. Performance metrics for automatic classification methods.
Figure 5. Performance metrics for automatic classification methods.
Mathematics 11 01588 g005

6. Discussion

The automatic product classification is a mandatory task when using big-data sources to complement classical data sources for consumer price statistics, as manual classification can be prohibitive in terms of the time needed for this task and the costs involved.
Our results showed very good classification performance with accuracy ranging from 0.326 to 0.995 and the weighted F1-score ranging from 0.095 to 0.963 for different word-embedding and classification combinations.
The best results in terms of the accuracy of the predictions were obtained for logistic regression at 0.995, support vector machines with a radial kernel at 0.994, and RF at 0.993 (all three classification methods combined with the FastText skip-gram word-embedding technique).
In terms of the weighted F1-score, the best classification methods were similar: logistic regression, support vector machines, and random forest, combined with the FastText skip-gram, with 0.963, 0.957, and 0.942, respectively.The AUC values also confirmed that these three methods, combined with the FastText skip-gram embedding, had a very high power of distinction between the classes. At the same time, the lower values of the weighted F1 and AUC in the Word2Vec embedding (see Figure 5) showed that when using this embedding technique, the separation between classes was more difficult to obtain, regardless of the classification method.
Therefore, support vector machines using a radial kernel, logistic regression, and random forest, combined with the FastText skip-gram embedding technique, appeared to have the best results for our classification problem, regardless of the performance metrics used. They were followed by the XGBoost method combined with TF-IDF, which showed good results, as well, for both accuracy and weighted-F1 metrics. The kNN was the only classification method that provided a high accuracy and weighted F1-score, when combined with the FastText CBOW embedding, while almost all classification methods performed poorly with Word2Vec embedding. All tree-based methods showed the best results for count vectorization and TF-IDF embedding.
As a general conclusion of the results, we found the following:
  • The FastText skip-gram, as well as the simple embedding methods, such as count vectorization and TF-IDF, yielded good results with the majority of classification methods, which was in line with the first visual inspection of the classes performed with t-SNE (see Figure 4). FastText had the advantage of being able to handle words not in the vocabulary, as well;
  • The Word2Vec embedding had poor results for almost all classification methods. This was confirmed by both the t-SNE transformation and the performance metrics values;
  • When analyzing how different classification methods performed using the same embedding techniques, we noted that the weighted F1 showed a much higher variability than the other metrics, and combined with a per-class error analysis, this confirmed our hypothesis that for highly imbalanced classes, the weighted F1 was a much better performance indicator than the accuracy or the simple macro-F1. The same conclusion held when analyzing how different embedding techniques performed for the same classification method. Very low values of the weighted F1-scores (for example, review the results of the CART with Gini or the information gain criteria for node splitting, support vector machines with a sigmoid kernel, and the artificial neural networks) were obtained even when the accuracy was high.
  • Logistic regression, support vector machines, and random forest had good classification performances when they were combined with FastText skip-gram, count vectorization, and TF-IDF embedding techniques, while the same methods had weaker performances when combined with Word2Vec embedding;
  • Surprisingly, even simple and old methods, such as logistic regression and naïve Bayes had good classification performances, with logistic regression showing the best values for the performance metrics and on the dataset considered in our case;
  • Predictably, more elaborate decision tree-based methods (Bagged CART, C4.5, C50, random forest) performed better than the simple decision-tree-classification methods, with random forest being one of the best classifiers according to our results;
  • The decision tree methods, with one exception (random forest), had the best results when combined with the count vectorization or TF-IDF embedding methods, potentially due to the higher dimensionality of the resulting embedding.
The results obtained in this experiment surpassed other recent approaches [16,17,18,19], which were already presented in a previous section.
Regarding the combinations between the classification and word embedding methods, in [17], the same combination (support vector machines + FastText, as in our study) was found to yield the best results, in terms of the F1-score.
For a more in-depth analysis of the performances of the classification methods, we selected the first three that showed the best performance metrics, namely logistic regression, support vector machines with a radial kernel, and random forest, all combined with the FastText skip-gram embedding, and computed the performance metrics for a varying number of features generated during the text-vectorization process. Therefore, for logistic regression (LR), random forest (RF), and support vector machines (SVMs) with a radial kernel, we computed the accuracy, F1-score, and weighted F1-score, changing the number of features during the vectorization from 25 to 250, with a step-size of 5. The results are presented in Figure 6.
All the performance metrics had an oscillating evolution with a general increasing trend, up to a maximum value, followed by an approximately constant value or even a slight decrease if we further increased the number of features. Table 9 shows the maximum values for the accuracy, F1-scores, and weighted F1-scores, along with the number of features.
As shown, while the maximum values of the performance metrics had almost the same values for all three methods, the support vector machines with a radial kernel achieved the maximum classification performance with a lower number of features (135) than logistic regression (235) and random forest (145). To further analyze the performance of these three methods, we measured the execution time of each versus the number of features, and we presented the results in Figure 7.
The time for the vectorization and the training time for all three classification methods showed a linear increasing trend but with different slopes. Fitting a simple linear regression model for the training time, we found the values of the slopes presented in Table 10.
This indicated that the processing time for random forest and support vector machines increased rapidly with the number of features, though this increase was very small for logistic regression. Comparing also the absolute values of the running times of the embedding and the classification, we noted that the total processing time for logistic regression was almost constant, as compared to random forest and support vector machines. In Table 11, we present the values of these processing times for the number of features that generated the best performance metrics for each classification method. For logistic regression, the total time was dominated by the time needed by the embedding process, while for the other two methods, the total processing time was dominated by the training process.
Considering the total processing time, logistic regression provided the best performance with a total processing time two-fold less than support vector machines and almost seven-fold less than random forest, for the number of features that had the highest accuracy for each classification method. Most of the total time for logistic regression was spent on the vectorization of the product names (>90%), while for random forest and support vector machines, the situation was the opposite, where the training process was much longer than the vectorization. For random forest, even if the classification performances were very good, the processing time for larger datasets could be prohibitive for normal computing resources. This processing time analysis, along with the values for the accuracy and the F1-scores, which were almost the same for these three methods, recommended logistic regression as the most efficient classification method, followed by the support vector machines and random forest.
We conclude this section with a general process-flow diagram of the classification process. This example used the Word2Vec embedding method, and it is shown in Figure 8. The text pre-processing operations are shown in the upper part of the figure while the model fitting with the training set and the predictions with the testing set are shown in the lower part.

7. Error Analysis

Despite these very good results, there were a number of factors to be considered further. The size of the dataset used in this study was rather small, and it is widely accepted that using larger datasets generally provides better classification performance. However, the time needed to execute the classifiers on larger datasets drastically increases, and special programming techniques should be used.
The distribution of the products among the classes was highly imbalanced, and this could have a negative impact on the quality of the classification results. Even if the accuracy of the classification was very high, a few errors in the smaller classes could have a significant impact on the final results of the price index. Therefore, the classification method should be chosen based on a metric that gives importance to smaller classes as well.
While the results could be considered good at a first glance, an error analysis could provide deeper insights into the performance and error sources. One first aspect that influenced the classification results was the composition of the training and testing sets. A simple verification showed that all 15 classes were present both in the training and testing datasets. We already presented the number of products in each class in Table 1. Therefore, the generalization of the prediction models was not influenced by missing observations from training set, and the values of the performance metrics on the testing set were not influenced by some missing classes that could raise the values of the metrics artificially.
Next, the imbalance between the classes could have impacted the performance of the classifiers, and this was the reason we provided the weighted F1-scores alongside the F1-score and accuracy. Indeed, as shown in Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12 and Table A13 and Figure 5, the values of the weighted F1-scores were less than the F1-scores and the accuracy.
To perform a more detailed error analysis, in Figure 9, Figure 10 and Figure 11, we plotted the confusion matrices for the support vector machines with a radial kernel, random forest, and logistic regression classification models. Combined with the FastText skip-gram vectorization method, these three methods provided the best performance metrics, and we analyzed the errors on a per-case basis. We also presented the detailed performance metrics for each class separately for the combinations among the classification method and embedding techniques listed in Table 7, in Appendix B, and Table A14, Table A15, Table A16, Table A17, Table A18, Table A19, Table A20, Table A21, Table A22, Table A23, Table A24, Table A25 and Table A26.
One general remark was that the predictions for smaller classes performed well for support vector machines, random forest, and logistic regression, when combined with FastText skip-gram vectorization. On the test dataset, only five observations were incorrectly predicted by the support vector machines, four observations by logistic regression, and seven by random forest, out of which four were from smaller classes in the support vector machines, four in logistic regression, and five in random forest. Three of the observations belonging to the smaller classes were incorrectly predicted by the support-vector-machine model. These belonged to the class 01.1.4.2 (low-fat milk), but they were included in class 01.1.4.1 (whole milk); one belonged to the class 01.1.1.3 (bread), but it was included in the class 01.1.1.2 (flours and other cereals). This later observation was predicted incorrectly by logistic regression, as well, which incorrectly predicted another observation from the same class. An explanation could be related to these incorrectly predicted observations having names very similar to the observations in the class where they had been predicted, and thus the distance in the feature space could have been very small, resulting in them being predicted in the wrong class.
The other observation predicted incorrectly by the support-vector-machine model (from the larger classes) belonged to class 05.3.1.3 (cookers), but it was predicted as class 05.3.1.1 (refrigerators). In this case, the record had a brand name that was present with several records in the class where it had been incorrectly attributed, which could explain the error. Such particularities of the product names were found in random forest and logistic regression, as well. All observations from the larger classes were correctly predicted by logistic regression, but one observation from the 01.1.7.4 class was incorrectly included in the 05.3.1.2 class, one of the largest classes in our set. The random forest classifier predicted four observations incorrectly from low-fat milk (they were incorrectly included in the whole-milk class), and one observation from 01.1.1.3 (bread) class was incorrectly predicted as the class 01.1.1.2 (flours and other cereals), a situation similar to the support-vector-machine classifier. The same explanation could be applied here as well, as the classes were related one to each other (low-fat milk and whole milk, bread and cereals) with similar names that most likely produced vectorizations very close together in the feature space.
To conclude, we can state that these three classifiers had very good performances at the class level, even for the smaller classes, where incorrect classifications could affect the quality of the final results.
Further inspection of the per-class performance metrics revealed that both simple decision-tree methods (CART with the Gini index and CART with information gain used for node-splitting) had several classes missing from the predicted values, which indicated they were less reliable for our purpose. This deficiency was solved in more sophisticated tree methods (C45, C50, Bagged CART), but there was still one class (01.4.1.2) where all these tree-based methods had a low accuracy of predictions. A low accuracy was noted also for the same class for the XGBoost method.
The support vector machines with a sigmoid kernel were also less reliable than the other methods, having one class of products missing from the predicted values, while kNN, neural networks, and multinomial naïve Bayes performed reasonably well at the class level.
When analyzing the per-class metrics, we also noticed that support vector machines, logistic regression, and random forest showed relatively good performances for all classes, the balanced accuracy for individual classes varying from 1 to 0.863 for the first method, from 1 to 0.75 for the second method, and from 1 to 0.888 for the final method, respectively.In contrast, the decision tree-based methods (CART, Bagged CART, C50, C4.5) showed a larger variation in the balanced accuracy between classes, in addition to the missing classes from the predictions. For all these methods, the lowest accuracy was recorded for the 01.1.4.2 class. However, these classifiers were not entirely incorrect when working with classes 01.1.4.1 and 01.1.4.2, as both were related to “milk”. A better separation in the feature spaces of these two classes could improve the accuracy of the predictions.
The per-class error analysis confirmed that the support vector machines with a radial kernel, logistic regression, and random forest were the methods with the best results in our case and also assisted in identifying methods (simple decision trees, support vector machines with a sigmoid kernel) that could be excluded because they had produced predictions that made the CPI computations almost impossible.

8. Conclusions and Future Work

Currently, with the advent of the digital revolution, new data sources are being used to increase the timeliness and decrease the costs of the calculation process for several economic indicators used by policymakers throughout the world. One of these statistical indicators is the well-known CPI, computed in every country by the official statistics bureaus and used to fine-tune public policies. In addition to the classical methods for CPI computations, new data sources such as scanner data or web-scraped data have been used to either augment the way the CPI was computed or to compute entirely new price indices. Nevertheless, using such data sources had introduced a problem: Its large volume makes it almost impossible to manually classify the products according to the statistical methodologies in place. To solve this problem, automatic classification procedures that use machine-learning methods can be used. In this paper, we presented the results obtained after the experimentation with several automatic classification procedures: logistic regression, multinomial naïve Bayes, decision trees, bagged decision trees, C4.5, C50, random forest, support vector machines, artificial neural networks, kNN, and XGBoost. To our knowledge, this was one of the most comprehensive experiments in the area of product classifications, combining 9 different word-embedding techniques with 13 classification models.
We started with the transformation of the product names into numerical vectors, then we applied a series of machine-learning-classification methods. The results obtained were encouraging, as the methods tested showed very good performance.
The best results, both in terms of the accuracy and the weighted F1-scores, were obtained by logistic regression and support vector machines, followed by random forest, with the FastText skip-gram embedding technique. Using this embedding technique also provided the advantage of being able to treat words not already in the vocabulary. Regarding the embedding techniques, we noticed that all decision tree-based methods obtained good results with either the count vectorization or TF-IDF, which could have been associated with their much higher number of features (3000), as compared to Word2Vec, FastText, and GloVe, for which we generated only 50 features for each product name. A per-class error analysis showed that these methods performed poorly, having several classes entirely absent from the predictions on the test set. There was only one method, the artificial neural networks, that performed better with the GloVe embedding technique, and kNN performed better with FastText CBOW. Surprisingly, the neural networks showed relatively poor results, as compared to the other methods, but we only used them with the default parameters. Choosing the optimum values for their parameters could greatly improve the classification results, but that would be a computationally intensive task that requires special programming techniques.
Nevertheless, these good results could also be explained by the structure of the product names, which did not vary significantly from one retailer to another, at least for the categories involved in this study. The “unseen” data used to test the performances of each method, i.e., the test subset, largely followed the same rules to build product names as the training set, and the classification consequently showed good results.
However, there were some issues to be considered in future work. Firstly, we conducted the classification algorithms on a relatively small dataset, yet the processing time was very high (approximately one day) when we used the repeated cross-validation procedure. The problem of computational complexity and high processing times will be even more acute when working with larger datasets. We envisage two solutions here: to use parallel programming techniques within the R software environment, or if the processing time is still high, to choose another language that could perform better, such as Python or even C++. With a faster execution, we could also use embedding with more dimensions than those used in this experiment.
Secondly, the machine-learning-classification methods have hyper-parameters that can greatly influence the results. To identify the optimum values, we intend to use a grid-search procedure, but only after we adopt another software environment. An experiment involving a grid search to choose the optimum value for the cost parameter of the support vector machines (with both kernels) and γ for the support vector machines with the radial kernel resulted in a processing time longer than 2 days, which we considered unacceptable for a pilot study.
Thirdly, there was the problem of the words not already in the vocabulary, i.e, words not present in the training set. While FastText could handle such words, the other embedding methods could not. One possible solution would be to build the vocabulary every time a new dataset needs to be classified, but this will likely result in a longer execution time.
Another direction for future research would be to use more complicated embedding techniques, such as BERT or one of its several variants, or other classification methods, such as LSTM or W2NER. Furthermore, finally, the implementation of an automatic procedure to rank the results and choose the best classification method should also be considered in future research.

Funding

This research received no external funding.

Data Availability Statement

The R scripts and the data used in this work are available at: https://github.com/bogdanoancea/autoencoder.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Performance Metrics for All Classification and Embedding Methods

Table A1. Performance metrics for logistic regression.
Table A1. Performance metrics for logistic regression.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9810.9430.8930.969
TF-IDF0.9790.9040.8780.909
Word2Vec CBOW ADD0.9390.8360.7670.876
Word2Vec CBOW MEAN0.9630.8860.8810.891
Word2Vec SKIP ADD0.8970.7480.5740.720
Word2Vec SKIP MEAN0.9240.7640.6560.825
FastText CBOW0.9910.9520.9610.979
FastText skip-gram0.9950.9630.9620.994
GLOVE0.9820.9460.9290.965
Table A2. Performance metrics for the Multinomial naïve Bayes.
Table A2. Performance metrics for the Multinomial naïve Bayes.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9910.9550.9150.996
TF-IDF0.9820.9010.8770.931
Table A3. Performance metrics for CART with Gini index for node splitting.
Table A3. Performance metrics for CART with Gini index for node splitting.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9690.9680.7660.974
TF-IDF0.9690.9680.7660.974
Word2Vec CBOW ADD0.8050.7090.25810.788
Word2Vec CBOW MEAN0.7780.6840.2550.726
Word2Vec SKIP ADD0.8340.6410.3690.716
Word2Vec SKIP MEAN0.8250.6650.2910.710
FastText CBOW0.8270.7880.2910.830
FastText skip-gram0.8820.7890.3870.856
GLOVE0.8740.7550.3540.838
Table A4. Performance metrics for CART with Information Gain for node splitting.
Table A4. Performance metrics for CART with Information Gain for node splitting.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9590.9480.5790.946
TF-IDF0.9590.9480.5790.946
Word2Vec CBOW ADD0.7870.6870.2570.751
Word2Vec CBOW MEAN0.7820.6820.1750.743
Word2Vec SKIP ADD0.8220.6330.3310.671
Word2Vec SKIP MEAN0.8050.6890.2040.736
FastText CBOW0.8500.8150.3850.872
FastText skip-gram0.9010.8170.4630.895
GLOVE0.8930.7130.3920.835
Table A5. Performance metrics for Bagged CART.
Table A5. Performance metrics for Bagged CART.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9920.9580.9190.997
TF-IDF0.9860.9490.9030.996
Word2Vec CBOW ADD0.9280.8360.7070.865
Word2Vec CBOW MEAN0.9240.7740.7250.888
Word2Vec SKIP ADD0.8990.6680.5960.816
Word2Vec SKIP MEAN0.9140.7570.6660.821
FastText CBOW0.9730.8990.8880.902
FastText skip-gram0.9680.8690.8440.926
GLOVE0.9800.9240.8990.965
Table A6. Performance metrics for C4.5.
Table A6. Performance metrics for C4.5.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9910.9630.9290.997
TF-IDF0.9890.9580.9190.997
Word2Vec CBOW ADD0.8610.6620.5750.759
Word2Vec CBOW MEAN0.8810.7470.6360.841
Word2Vec SKIP ADD0.8570.5810.4870.775
Word2Vec SKIP MEAN0.8700.6450.6280.794
FastText CBOW0.8950.7170.5970.876
FastText skip-gram0.9520.8310.7970.946
GLOVE0.9420.7830.7070.879
Table A7. Performance metrics for C50.
Table A7. Performance metrics for C50.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9910.9630.9290.997
TF-IDF0.9920.9630.9290.997
Word2Vec CBOW ADD0.8590.6850.5940.807
Word2Vec CBOW MEAN0.8680.6980.6430.823
Word2Vec SKIP ADD0.8710.6050.4870.786
Word2Vec SKIP MEAN0.8590.6330.5610.769
FastText CBOW0.9010.7390.6880.864
FastText skip-gram0.9600.8950.8730.946
GLOVE0.9430.7740.6960.878
Table A8. Performance metrics for random forest.
Table A8. Performance metrics for random forest.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9920.9580.9190.997
TF-IDF0.9920.9620.9240.997
Word2Vec CBOW ADD0.9510.8690.7990.9042
Word2Vec CBOW MEAN0.9460.8870.7920.9288
Word2Vec SKIP ADD0.9290.7430.6910.853
Word2Vec SKIP MEAN0.9250.7960.7190.825
FastText CBOW0.9870.9330.9350.935
FastText skip-gram0.9930.9620.9420.982
GLOVE0.9910.9680.9340.993
Table A9. Performance metrics for support vector machines with Sigmoid kernel.
Table A9. Performance metrics for support vector machines with Sigmoid kernel.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.6490.6820.0950.579
TF-IDF0.3260.4920.0210.500
Word2Vec CBOW ADD0.7980.7790.3180.769
Word2Vec CBOW MEAN0.8760.6750.6330.781
Word2Vec SKIP ADD0.7730.4880.1750.739
Word2Vec SKIP MEAN0.8220.5620.3260.784
FastText CBOW0.9580.8610.6620.927
FastText skip-gram0.9800.9300.7990.977
GLOVE0.92170.7460.5120.905
Table A10. Performance metrics for support vector machines with Radial kernel.
Table A10. Performance metrics for support vector machines with Radial kernel.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9910.9490.9110.991
TF-IDF0.9920.9580.9190.997
Word2Vec CBOW ADD0.9530.9060.7430.891
Word2Vec CBOW MEAN0.9660.9020.8430.939
Word2Vec SKIP ADD0.9090.7480.4840.849
Word2Vec SKIP MEAN0.9190.6820.6080.816
FastText CBOW0.9910.9410.9300.950
FastText skip-gram0.9940.9720.9570.983
GLOVE0.9760.9010.8570.965
Table A11. Performance metrics for Neural networks.
Table A11. Performance metrics for Neural networks.
Embedding TechniqueAccuracyF1Weighted F1AUC
Word2Vec CBOW ADD0.8740.6650.3830.758
Word2Vec CBOW MEAN0.9470.7800.7270.850
Word2Vec SKIP ADD0.9190.7170.5030.904
Word2Vec SKIP MEAN0.9450.7380.7180.872
FastText CBOW0.6190.6420.1590.734
FastText skip-gram0.8030.7770.1170.759
GLOVE0.9610.8390.8690.906
Table A12. Performance metrics for XGBoost.
Table A12. Performance metrics for XGBoost.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9910.9540.9090.996
TF-IDF0.9920.9660.9340.997
Word2Vec CBOW ADD0.9110.7690.6890.833
Word2Vec CBOW MEAN0.9100.8390.6310.857
Word2Vec SKIP ADD0.9100.7250.5230.787
Word2Vec SKIP MEAN0.9170.7280.6110.812
FastText CBOW0.9630.8460.8190.901
FastText skip-gram0.9630.8600.8240.902
GLOVE0.9660.8640.8260.918
Table A13. Performance metrics for kNN.
Table A13. Performance metrics for kNN.
Embedding TechniqueAccuracyF1Weighted F1AUC
Count Vectorization0.9870.9330.9020.981
TF-IDF0.9860.9330.8760.994
Word2Vec CBOW ADD0.9310.8060.7310.867
Word2Vec CBOW MEAN0.9520.8980.8410.937
Word2Vec SKIP ADD0.9020.6890.6290.899
Word2Vec SKIP MEAN0.9380.7840.7470.849
FastText CBOW0.9890.9430.9310.969
FastText skip-gram0.9870.9360.8950.979
GLOVE0.9840.9290.9140.973

Appendix B. Per Class Performance Metrics for Classification and Embedding Methods

For the performance metrics listed below we used the standard definitions, see for example [61].
Table A14. Performance metrics for support vector machines with Radial kernel with FastText SKIP GRAM vectorization at class level.
Table A14. Performance metrics for support vector machines with Radial kernel with FastText SKIP GRAM vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0000.9980.9561.0000.9561.0000.9770.0250.0250.0260.999
01.1.1.30.7501.0001.0000.9981.0000.7500.8570.0040.0030.0030.875
01.1.4.11.0000.9960.8231.0000.8231.0000.9030.0160.0160.0190.998
01.1.4.20.7271.0001.0000.9961.0000.7270.8420.0120.0090.0090.863
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0000.9980.9961.0000.9961.0000.9980.3250.3250.3270.999
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9941.0001.0000.9981.0000.9940.9970.2310.2300.2300.997
Table A15. Performance metrics for Logistic regression with FastText SKIP GRAM vectorization at class level.
Table A15. Performance metrics for Logistic regression with FastText SKIP GRAM vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0000.9970.9161.0000.9161.0000.9560.0250.0250.0280.998
01.1.1.30.5001.0001.0000.9971.0000.5000.6660.0040.0020.0020.750
01.1.4.11.0000.9980.9331.0000.9331.0000.9650.0160.0160.0170.999
01.1.4.20.9091.0001.0000.9981.0000.9090.9520.0120.0110.0110.954
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.40.8331.0001.0000.9981.0000.8330.9090.0070.0050.0050.916
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0001.0001.0001.0001.0001.0001.0000.3250.3250.3251.000
05.3.1.21.0000.9980.9951.0000.9951.0000.9970.2680.2680.2690.999
05.3.1.31.0001.0001.0001.0001.0001.0001.0000.2310.2310.2311.000
Table A16. Performance metrics for random forest with FastText SKIP GRAM vectorization at class level.
Table A16. Performance metrics for random forest with FastText SKIP GRAM vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.20.9561.0001.0000.9981.0000.9560.9770.0260.0250.0250.978
01.1.1.31.0000.9980.7501.0000.7501.0000.8570.0030.0030.0040.999
01.1.4.10.7771.0001.0000.9951.0000.7770.8750.0210.0160.0160.888
01.1.4.21.0000.9950.6361.0000.6361.0000.7770.0080.0080.0120.997
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0000.9980.9001.0000.9001.0000.9470.0100.0100.0110.999
05.3.1.10.9921.0001.0000.9961.0000.9920.9960.3280.3250.3250.996
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.31.0000.9980.9941.0000.9941.0000.9970.2300.2300.2310.999
Table A17. Performance metrics for kNN with FastText-CBOW vectorization at class level.
Table A17. Performance metrics for kNN with FastText-CBOW vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.20.9540.9980.9540.9980.9540.9540.9540.0250.0240.0250.970
01.1.1.30.7501.0001.0000.9981.0000.7500.8570.0040.0030.0030.875
01.1.4.10.9280.9960.8120.9980.8120.9280.8660.0160.0150.0180.962
01.1.4.20.7270.9980.8880.9960.8880.7270.8000.0120.0090.0100.863
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0000.9980.8331.0000.8331.0000.9090.0050.0050.0070.999
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.40.8331.0001.0000.9981.0000.8330.9090.0070.0050.0050.916
01.1.8.10.9000.9970.8180.9980.8180.9000.8570.0110.0100.0120.948
05.3.1.11.0000.9980.9961.0000.9961.0000.9980.3250.3250.3270.999
05.3.1.20.9951.0001.0000.9981.0000.9950.9970.2680.2670.2670.999
05.3.1.31.0001.0001.0001.0001.0001.0001.0000.2310.2310.2311.000
Table A18. Performance metrics for C50 with TF-IDF vectorization at class level.
Table A18. Performance metrics for C50 with TF-IDF vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0001.0001.0001.0001.0001.0001.0000.0250.0250.0251.000
01.1.1.31.0001.0001.0001.0001.0001.0001.0000.0040.0040.0041.000
01.1.4.11.0000.9920.7001.0000.7001.0000.8230.0160.0160.0230.996
01.1.4.20.4541.0001.0000.9921.0000.4540.6250.0120.0050.0050.727
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0000.9980.9961.0000.9961.0000.9980.3250.3250.3270.999
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9941.0001.0000.9981.0000.9940.9970.2310.2300.2300.997
Table A19. Performance metrics for Bagged CART with CV vectorization at class level.
Table A19. Performance metrics for Bagged CART with CV vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0001.0001.0001.0001.0001.0001.0000.0250.0250.0251.000
01.1.1.31.0001.0001.0001.0001.0001.0001.0000.0040.0040.0041.000
01.1.4.10.9280.9920.6840.9980.6840.9280.7870.0160.0150.0220.960
01.1.4.20.4540.9980.8330.9920.8330.4540.5880.0120.0050.0070.726
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0001.0001.0001.0001.0001.0001.0000.3250.3250.3251.000
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.31.0001.0001.0001.0001.0001.0001.0000.2310.2310.2311.000
Table A20. Performance metrics for C45 with CV vectorization at class level.
Table A20. Performance metrics for C45 with CV vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0001.0001.0001.0001.0001.0001.0000.0250.0250.0251.000
01.1.1.31.0001.0001.0001.0001.0001.0001.0000.0040.0040.0041.000
01.1.4.11.0000.9920.7001.0000.7001.0000.8230.0160.0160.0230.996
01.1.4.20.4541.0001.0000.9921.0000.4540.6250.0120.0050.0050.727
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0000.9960.9921.0000.9921.0000.9960.3250.3250.3280.998
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9891.0001.0000.9961.0000.9890.9940.2310.2280.2280.994
Table A21. Performance metrics for Multinomial naïve Bayes with CV vectorization at class level.
Table A21. Performance metrics for Multinomial naïve Bayes with CV vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0001.0001.0001.0001.0001.0001.0000.0250.0250.0251.000
01.1.1.31.0001.0001.0001.0001.0001.0001.0000.0040.0040.0041.000
01.1.4.10.9280.9920.6840.9980.6840.9280.7870.0160.0150.0220.960
01.1.4.20.4540.9980.8330.9920.8330.4540.5880.0120.0050.0070.726
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.10.9001.0001.0000.9981.0000.9000.9470.0110.0100.0100.950
05.3.1.11.0001.0001.0001.0001.0001.0001.0000.3250.3250.3251.000
05.3.1.21.0000.9980.9951.0000.9951.0000.9970.2680.2680.2690.999
05.3.1.31.0001.0001.0001.0001.0001.0001.0000.2310.2310.2311.000
Table A22. Performance metrics for XGBoost with TF-IDF vectorization at class level.
Table A22. Performance metrics for XGBoost with TF-IDF vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0001.0001.0001.0001.0001.0001.0000.0250.0250.0251.000
01.1.1.31.0001.0001.0001.0001.0001.0001.0000.0040.0040.0041.000
01.1.4.10.8570.9950.7500.9970.7500.8570.8000.0160.0140.0210.926
01.1.4.20.6360.9970.7770.9950.7770.6360.7000.0120.0050.0080.817
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0000.9980.9961.0000.9961.0000.9980.3250.3250.3250.999
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9951.0001.0000.9981.0000.9950.9970.2310.2310.2310.997
Table A23. Performance metrics for support vector machines with Sigmoid with FastText skip-gram vectorization at class level.
Table A23. Performance metrics for support vector machines with Sigmoid with FastText skip-gram vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0000.9950.8461.0000.8461.0000.9160.0250.0250.0300.997
01.1.1.30.2501.0001.0000.9961.0000.2500.4000.0040.0010.0010.625
01.1.4.11.0000.9860.5601.0000.5601.0000.7170.0160.0160.0290.993
01.1.4.20.0001.000NaN0.987NA0.000NA0.0120.0000.0000.500
01.1.4.71.0001.0001.0001.0001.0001.0001.0000.0210.0210.0211.000
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.11.0001.0001.0001.0001.0001.0001.0000.0050.0050.0051.000
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.10.9921.0001.0000.9961.0000.9920.9960.3250.3230.3230.996
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9940.9960.9890.9980.9890.9940.9920.2310.2300.2320.995
Table A24. Performance metrics for CART with Information Gain with CV vectorization at class level.
Table A24. Performance metrics for CART with Information Gain with CV vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.20.8630.9920.7600.9960.7600.8630.8080.0250.0220.0290.928
01.1.1.30.0001.000NaN0.995NA0.000NA0.0040.0000.0000.500
01.1.4.11.0000.9860.5601.0000.5601.0000.7170.0160.0160.0290.993
01.1.4.20.0001.000NaN0.987NA0.000NA0.0120.0000.0000.500
01.1.4.70.8881.0001.0000.9971.0000.8880.9410.0210.0180.0180.944
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.10.0001.000NaN0.994NA0.000NA0.0050.0000.0000.500
01.1.7.30.0001.000NaN0.992NA0.000NA0.0070.0000.0000.500
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0000.9680.9391.0000.9391.0000.9680.3250.3250.3460.984
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9791.0001.0000.9931.0000.9790.9890.2310.2260.2260.989
Table A25. Performance metrics for CART with Gini index with CV vectorization at class level.
Table A25. Performance metrics for CART with Gini index with CV vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.21.0001.0001.0001.0001.0001.0001.0000.0250.0250.0251.000
01.1.1.30.0001.000NaN0.995NA0.000NA0.0040.0000.0000.500
01.1.4.11.0000.9860.5601.0000.5601.0000.7170.0160.0160.0290.993
01.1.4.20.0001.000NaN0.987NA0.000NA0.0120.0000.0000.500
01.1.4.70.8881.0001.0000.9971.0000.8880.9410.0210.0180.0180.944
01.1.5.11.0001.0001.0001.0001.0001.0001.0000.0160.0160.0161.000
01.1.5.31.0001.0001.0001.0001.0001.0001.0000.0300.0300.0301.000
01.1.5.41.0001.0001.0001.0001.0001.0001.0000.0150.0150.0151.000
01.1.6.10.0001.000NaN0.994NA0.000NA0.0050.0000.0000.500
01.1.7.31.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.7.41.0001.0001.0001.0001.0001.0001.0000.0070.0070.0071.000
01.1.8.11.0001.0001.0001.0001.0001.0001.0000.0110.0110.0111.000
05.3.1.11.0000.9740.9481.0000.9481.0000.9730.3250.3250.3430.987
05.3.1.21.0001.0001.0001.0001.0001.0001.0000.2680.2680.2681.000
05.3.1.30.9791.0001.0000.9931.0000.9790.9890.2310.2260.2260.989
Table A26. Performance metrics for Neural Networks with GLOVE vectorization at class level.
Table A26. Performance metrics for Neural Networks with GLOVE vectorization at class level.
ClassSensitivitySpecificityPos Pred ValueNeg Pred ValuePrecisionRecallF1PrevalenceDetection RateDetection PrevalenceBalanced Accuracy
01.1.1.20.8180.9940.7820.9950.7820.8180.8000.0250.0210.0260.906
01.1.1.30.5000.9960.4000.9970.4000.5000.4440.0040.0020.0050.748
01.1.4.11.0000.9980.9331.0000.9331.0000.9650.0160.0160.0170.999
01.1.4.21.0001.0001.0001.0001.0001.0001.0000.0120.0120.0121.000
01.1.4.70.9440.9980.9440.9980.9440.9440.9440.0210.0190.0210.971
01.1.5.10.8571.0001.0000.9971.0000.8570.9230.0160.0140.0140.928
01.1.5.31.0000.9970.9281.0000.9281.0000.9620.0300.0300.0320.998
01.1.5.40.7690.9980.9090.9960.9090.7690.8330.0150.0110.0120.884
01.1.6.10.4000.9970.5000.9960.5000.4000.4440.0050.0020.0040.698
01.1.7.30.8330.9970.7140.9980.7140.8330.7690.0070.0050.0080.915
01.1.7.40.6660.9980.8000.9970.8000.6660.7270.0070.0040.0050.832
01.1.8.10.8000.9980.8880.9970.8880.8000.8420.0110.0090.0100.899
05.3.1.10.9930.9960.9850.9960.9850.9920.9890.3250.3230.3280.992
05.3.1.20.9690.9930.9820.9880.9820.9690.9750.2680.2600.2650.981
05.3.1.30.9790.9910.9700.9930.9700.9790.9740.2310.2260.2330.985

References

  1. Harchaoui, T.M.; Janssen, R.V. How can big data enhance the timeliness of official statistics?: The case of the U.S. consumer price index. Int. J. Forecast. 2018, 4392, 225–234. [Google Scholar] [CrossRef]
  2. Ivancic, L.; Erwin Diewert, W.; Fox, K.J. Scanner data, time aggregation and the construction of price indexes. J. Econom. 2011, 161, 24–35. [Google Scholar] [CrossRef]
  3. Macias, P.; Stelmasiak, D.; Szafranek, K. Nowcasting food inflation with a massive amount of online prices. Int. J. Forecast. 2022, 39, 809–826. [Google Scholar] [CrossRef]
  4. Yim, S.T.; Son, J.C.; Lee, J. Spread of E-commerce, prices and inflation dynamics: Evidence from online price big data in Korea. J. Asian Econ. 2022, 80, 101475. [Google Scholar] [CrossRef]
  5. De Haan, J.; van der Grient, H.A. Eliminating chain drift in price indexes based on scanner data. J. Econom. 2011, 161, 36–46. [Google Scholar] [CrossRef]
  6. Cavallo, A.; Rigobon, R. The Billion Prices Project: Using Online Prices for Inflation Measurement and Research. J. Econ. Perspect. 2016, 30, 151–178. [Google Scholar] [CrossRef] [Green Version]
  7. Abe, N.; Shinozaki, K. Compilation of Experimental Price Indices Using big data and Machine Learning: A Comparative Analysis and Validity Verification of Quality Adjustments; Bank of Japan Working Paper Series, 18-E-13; Bank of Japan: Tokyo, Japan, 2018. [Google Scholar]
  8. Oancea, B.; Necula, M. Web Scraping Techniques for Price Statistics—The Romanian Experience. J. IAOS 2019, 35, 657–667. [Google Scholar] [CrossRef]
  9. Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artifficial Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
  10. Van den Bulk, L.M.; Bouzembrak, Y.; Gavai, A.; Liu, N.; van den Heuvel, L.J.; Marvin, H.J.P. Automatic classification of literature in systematic reviews on food safety using machine learning. Curr. Res. Food Sci. 2022, 5, 84–95. [Google Scholar] [CrossRef]
  11. Santos, T.; Tariq, A.; Gichoya, J.W.; Trivedi, H.; Banerjee, I. Automatic Classification of Cancer Pathology Reports: A Systematic Review. J. Pathol. Inform. 2022, 13, 100003. [Google Scholar] [CrossRef]
  12. Blanz, V.; Scholokopf, B.; Bulthoff, H.; Burges, C.; Vapnik, V.N.; Vetter, V. Comparison of view-based object recognition algorithms using realistic 3D models. In Proceedings of the International Conference on Artificial Neural Networks—ICNN96, Berlin, Germany, 16–19 July 1996. [Google Scholar]
  13. Calainho, F.D.; van de Minne, A.M.; Francke, M.K. A Machine Learning Approach to Price Indices: Applications in Commercial Real Estate. J. Real Estate Financ. Econ. 2022. [Google Scholar] [CrossRef]
  14. RAMON—Reference and Management of Nomenclatures. Available online: https://ec.europa.eu/eurostat/ramon/nomenclatures/index.cfm?TargetUrl=\LST_NOM_DTL&StrNom=COICOP_2018&StrLanguageCode=EN&IntPcKey=&StrLayoutCode=HIERARCHIC (accessed on 10 August 2022).
  15. Roberson, A. Automatic Product Categorization for Official Statistics. In Proceedings of the 2019 Workshop on Widening NLP, Florence, Italy, 28 July 2019; pp. 68–72. [Google Scholar]
  16. Roberson, A. Applying Machine Learning for Automatic Product Categorization. J. Off. Stat. 2021, 37, 395–410. [Google Scholar] [CrossRef]
  17. Martindale, H.; Rowland, E.; Flower, T.; Clews, G. Semi-supervised machine learning with word embedding for classification in price statistics. Data Policy 2020, 2, e12. [Google Scholar] [CrossRef]
  18. Muller, D.M. Classification of Consumer Goods into 5-Digit COICOP 2018 Codes. Master’s Thesis, Norwegian University of Life Sciences, As, Norway, December 2021. [Google Scholar]
  19. Myklatun, K.H. Using Machine Learning in the Consumer Price Index. In Proceedings of the Nordic Statistical Meeting, Helsinki, Finland, 26–28 August 2019. [Google Scholar]
  20. Shankar, S.; Irving, L. Applying Machine Learning to Product Classification. 2011. Available online: https://cs229.stanford.edu/proj2011/LinShankar-Applying%20Machine\%20Learning%20to%20Product%20Categorization.pdf (accessed on 10 August 2022).
  21. Haynes, C.; Palomino, M.A.; Stuart, L.; Viira, D.; Hannon, F.; Crossingham, G.; Tantam, K. Automatic Classification of National Health Service Feedback. Mathematics 2022, 10, 983. [Google Scholar] [CrossRef]
  22. Ghahroodi, R.Z.; Ranji, H.; Rezaei, A. Using Machine Learning Classification Algorithms in Official Statistics. J. Stat. Sci. 2021, 15, 119–146. [Google Scholar] [CrossRef]
  23. Gweon, H.; Schonlau, M.; Kaczmirek, L.; Blohm, M.; Steiner, S. Three Methods for Occupation Coding Based on Statistical Learning. J. Off. Stat. 2017, 33, 101–122. [Google Scholar] [CrossRef] [Green Version]
  24. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  25. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  26. Shen, Y.; Wang, X.; Tan, Z.; Xu, G.; Xie, P.; Huang, F.; Lu, W.; Zhuang, Y. Parallel Instance Query Network for Named Entity Recognition, In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022.
  27. Fei, L.J.; Liu, H.; Wu, J.; Zhang, S.; Teng, M.; Ji, C.; Li, F. Unified Named Entity Recognition as Word-Word Relation Classification. Proc. AAAI Conf. Artif. Intell. 2022, 36, 10965–10973. [Google Scholar]
  28. Spark, J.K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 1972, 28, 11–21. [Google Scholar] [CrossRef]
  29. Rajaraman, A.; Ullman, J. Data Mining. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2011; pp. 1–17. [Google Scholar]
  30. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  31. Joulin, A.; Grave, E.; Bojanovski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. arXiv 2016, arXiv:1607.01759. [Google Scholar]
  32. Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1523–1543. [Google Scholar]
  33. Saraswat, M. superml: Build Machine Learning Models Like Using Python’s Scikit-Learn Library in R. R Package Version 0.5.3. 2020. Available online: url=https://CRAN.R-project.org/package=superml (accessed on 10 August 2022).
  34. Wijffels, J. word2vec: Distributed Representations of Words. R Package Version 0.3.4. 2021. Available online: https://CRAN.R-project.org/package=word2vec (accessed on 10 August 2022).
  35. Mouselimis, L. fastText: Efficient Learning of Word Representations and Sentence Classification using R. R Package Version 1.0.1. 2021. Available online: https://CRAN.R-projet.org/package=fastText (accessed on 10 August 2022).
  36. Selivanov, D.; Bickel, M.; Wang, Q. text2vec: Modern Text Mining Framework for R. R package version 0.6. 2020. Available online: https://CRAN.R-project.org/package=text2vec (accessed on 10 August 2022).
  37. Mertler, C.; Vannatta, R. Advanced and Multivariate Statistical Methods, 2nd ed.; Pyrczak Publishing: Los Angeles, CA, USA, 2002. [Google Scholar]
  38. Ooi, H. glmnetUtils: Utilities for ’Glmnet’. R package version 1.1.8. 2021. Available online: https://CRAN.R-project.org/package=glmnetUtils (accessed on 10 August 2022).
  39. Xu, S. Bayesian Naïve Bayes classifiers to text classification. J. Inf. Sci. 2018, 44, 48–59. [Google Scholar] [CrossRef]
  40. Majka, M. naivebayes: High Performance Implementation of the naïve Bayes Algorithm in R. R Package Version 0.9.7. 2019. Available online: https://CRAN.R-project.org/package=naivebayes (accessed on 10 August 2022).
  41. Wu, X.; Kumar, V.; Quinlan, J.R.; Grosch, J.; Yang, Q.; Motoda, H. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2008, 14, 1–37. [Google Scholar] [CrossRef] [Green Version]
  42. Therneau, T.; Atkinson, B. rpart: Recursive Partitioning and Regression Trees. R Package Version 4.1-15. 2019. Available online: https://CRAN.R-project.org/package=rpart (accessed on 10 August 2022).
  43. Kotsiani, S.B.; Tsekouras, G.E.; Pintelas, P.E. Bagging Model Tress for classification Problems. In Advances in Informatics. PCI 2005; Bozanis, P., Houstis, E.N., Eds.; Springer: Berlin/Heildeberg, Germany, 2005; Volume 3746. [Google Scholar]
  44. Meyer, D.; Dimitriadou, E.; Hornik, K.; Weingessel, A.; Leisch, F. e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R Package Version 1.7-9. 2021. Available online: https://CRAN.R-project.org/package=e1071 (accessed on 10 August 2022).
  45. Kuhn, M. caret: Classification and Regression Training. R Package Version 6.0-91. 2022. Available online: https://CRAN.R-project.org/package=caret (accessed on 10 August 2022).
  46. Quinlan, J. C4.5: Programs for Machine Learning, 1st ed.; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  47. Hornik, K.; Buchta, C.; Zeileis, A. Open-Source Machine Learning: R Meets Weka. Comput. Stat. 2009, 24, 225–232. [Google Scholar] [CrossRef] [Green Version]
  48. Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  49. Kuhn, M.; Quinlan, R. C50: C5.0 Decision Trees and Rule-Based Models. R Package Version 0.1.6. 2022. Available online: https://CRAN.R-project.org/package=C50 (accessed on 10 August 2022).
  50. Breiman, L. random forest. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  51. Wright, N.M.; Ziegler, A. ranger: A Fast Implementation of random forest for High Dimensional Data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef] [Green Version]
  52. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  53. Haykin, S. Neural Networks and Learning Machines; Pearson Education: New York, NY, USA, 2009. [Google Scholar]
  54. Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002. [Google Scholar]
  55. Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef] [Green Version]
  56. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016. [Google Scholar]
  57. Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. xgboost: Extreme Gradient Boosting. R Package Version 1.5.2.1. 2022. Available online: https://CRAN.R-project.org/package=xgboost (accessed on 10 August 2022).
  58. Van der Maaten, L.J.P.; Hinton, G.E. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  59. Krijthe, J.H. Rtsne: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation. 2015. Available online: https://github.com/jkrijthe/Rtsne (accessed on 10 August 2022).
  60. Hand, D.J.; Till, R.J. A simple generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Learn. 2001, 45, 171–186. [Google Scholar] [CrossRef]
  61. Gardini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2008, arXiv:2008.05756. [Google Scholar]
Figure 3. A schematic view classification pipeline.
Figure 3. A schematic view classification pipeline.
Mathematics 11 01588 g003
Figure 4. 2D visualization for the product name-embedding.
Figure 4. 2D visualization for the product name-embedding.
Mathematics 11 01588 g004
Figure 6. The performances of the classification versus the number of features.
Figure 6. The performances of the classification versus the number of features.
Mathematics 11 01588 g006
Figure 7. The execution time for the embedding process and the training time for logistic regression (LR), random forest (RF), and support vector machines (SVMs).
Figure 7. The execution time for the embedding process and the training time for logistic regression (LR), random forest (RF), and support vector machines (SVMs).
Mathematics 11 01588 g007
Figure 8. The classification process using Word2Vec embedding method.
Figure 8. The classification process using Word2Vec embedding method.
Mathematics 11 01588 g008
Figure 9. The confusion matrix for SVM with a radial kernel using the FastText skip-gram vectorization.
Figure 9. The confusion matrix for SVM with a radial kernel using the FastText skip-gram vectorization.
Mathematics 11 01588 g009
Figure 10. The confusion matrix for random forest using the FastText skip-gram vectorization.
Figure 10. The confusion matrix for random forest using the FastText skip-gram vectorization.
Mathematics 11 01588 g010
Figure 11. The confusion matrix for logistic regression using the FastText skip-gram vectorization.
Figure 11. The confusion matrix for logistic regression using the FastText skip-gram vectorization.
Mathematics 11 01588 g011
Table 2. Descriptive statistics for the words in the dataset.
Table 2. Descriptive statistics for the words in the dataset.
Number of WordsNumber of Unique WordsAverage Word Length (in Chars) (Std. Dev.)Min. Word Length
(in Chars)
Max. Word Length
(in Chars)
Entire dataset41,98545364.86 (2.98)124
Training dataset29,39536694.86 (2.98)124
Testing dataset12,59021294.85 (2.99)121
Table 3. Descriptive statistics for the product names.
Table 3. Descriptive statistics for the product names.
Average Number of Words (Std. Dev.)Minimum Number of WordsMaximum Number of Words
Entire dataset14.71 (6.34)238
Training dataset14.72 (6.42)238
Testing dataset14.71 (6.17)234
Table 4. Count vectorization.
Table 4. Count vectorization.
Product Name000CakesFlourForFromSpongeSuperiorWheatWhite
P1111101001
P2101010111
Table 6. Performance metrics for classification problems.
Table 6. Performance metrics for classification problems.
MetricsFormula
Accuracy T P + T N P + N
Recall T P T P + F N
Precision P r e c i s i o n = T P T P + F P
F1-score T P T P + 1 2 × ( F P + F N )
Where P = TP + FN and N = FP + TN.
Table 9. The maximum values of the performance metrics (all classification combined with FastText skip-gram).
Table 9. The maximum values of the performance metrics (all classification combined with FastText skip-gram).
Classification MethodNumber of FeaturesAccuracyF1Weighted F1
LR2350.9980.9990.999
RF1450.9980.9990.999
SVMs with a Radial kernel135111
Table 10. The slope of the execution time versus the number of features.
Table 10. The slope of the execution time versus the number of features.
Classification MethodSlope ValueStd. Dev.
logistic regression0.0220.0011
random forest3.840.0737
SVM with Radial kernel0.790.0100
Table 11. The processing time for embedding and training (all classification combined with FastText skip-gram).
Table 11. The processing time for embedding and training (all classification combined with FastText skip-gram).
Classification MethodNo.
Features
Embedding TimeTraining TimeTotal Time
s% of totals% of totals
logistic regression23583.2591.20%8.048.8%98.29
random forest14552.687.88%616.1592.12%668.83
SVM with Radial kernel13546.5124.25%145.2875.75%191.77
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Oancea, B. Automatic Product Classification Using Supervised Machine Learning Algorithms in Price Statistics. Mathematics 2023, 11, 1588. https://doi.org/10.3390/math11071588

AMA Style

Oancea B. Automatic Product Classification Using Supervised Machine Learning Algorithms in Price Statistics. Mathematics. 2023; 11(7):1588. https://doi.org/10.3390/math11071588

Chicago/Turabian Style

Oancea, Bogdan. 2023. "Automatic Product Classification Using Supervised Machine Learning Algorithms in Price Statistics" Mathematics 11, no. 7: 1588. https://doi.org/10.3390/math11071588

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop