You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

11 March 2023

Multilabel Text Classification with Label-Dependent Representation

,
and
1
Escuela de Ingeniería Informática, Pontificia Universidad Católica de Valparaíso, Valparaiso 2362807, Chile
2
Departamento de Ingeniería Informática, Universidad Técnica Federico Santa María, Valparaiso 2390123, Chile
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
This article belongs to the Special Issue New Techniques of Machine Learning and Deep Learning in Text Classification

Abstract

Assigning predefined classes to natural language texts, based on their content, is a necessary component in many tasks in organizations. This task is carried out by classifying documents within a set of predefined categories using models and computational methods. Text representation for classification purposes has traditionally been performed using a vector space model due to its good performance and simplicity. Moreover, the classification of texts via multilabeling has typically been approached by using simple label classification methods, which require the transformation of the problem studied to apply binary techniques, or by adapting binary algorithms. Over the previous decade, text classification has been extended using deep learning models. Compared to traditional machine learning methods, deep learning avoids rule design and feature selection by humans, and automatically provides semantically meaningful representations for text analysis. However, deep learning-based text classification is data-intensive and computationally complex. Interest in deep learning models does not rule out techniques and models based on shallow learning. This situation is true when the set of training cases is smaller, and when the set of features is small. White box approaches have advantages over black box approaches, where the feasibility of working with relatively small sets of data and the interpretability of the results stand out. This research evaluates a weighting function of the words in texts to modify the representation of the texts during multilabel classification, using a combination of two approaches: problem transformation and model adaptation. This weighting function was tested in 10 referential textual data sets, and compared with alternative techniques based on three performance measures: Hamming Loss, Accuracy, and macro- F 1 . The best improvement occurs on the macro- F 1 when the data sets have fewer labels, fewer documents, and smaller vocabulary sizes. In addition, the performance improves in data sets with higher cardinality, density, and diversity of labels. This proves the usefulness of the function on smaller data sets. The results show improvements of more than 10% in terms of macro- F 1 in classifiers based on our method in almost all of the cases analyzed.

1. Introduction

In the age of information explosion, processing and classifying enormous amounts of text data manually is time-consuming, and it is a huge challenge to automate this task using computational methods. Furthermore, the performance of manual text classification can be easily influenced by human factors such as experience and fatigue. This requires the use of machine learning methods to speed up text classification processing and to obtain less subjective and more reliable results. In addition, this can also aid in improving efficiency in information retrieval and to alleviate the problem of information overload in locating the required information.
The problems related to the classification of multilabel text exists in different domains. Even though the basic models normally assume the existence of two classes, they have been extended to problems with more than two classes and multilabel, which are closer to real applications.
Although there has been an exponential increase in the number of publications based on deep learning models in recent years, it is not possible to completely rule out the techniques and models based on shallow learning. Today there is a debate between the “white box” vs. “black box” approaches, with advantages and disadvantages for both. Superficial learning highlights the feasibility of working with relatively small data sets and interpretability. Deep learning models stand out for their robustness and good performance.
The aim of this research is to evaluate a new term weighting function called relevance frequency for a label ( r f l ), initially introduced in [1] for tf-rfl, and extended and deepened in [2] as bin-rfl. This document presents greater maturity than previous papers, which were previously presented at conferences and in journals, and has a better foundation as well as, a better description of the proposed modification and an improved analysis of the results. Likewise, the analysis of relevant characteristics in the data sets is deepened for a better understanding of the method and choice of the representation. The impact of rendering is shown by considering different performance measures for multilabel classification problems. For this, two types of linear classifiers used in these type of problems were used; these are: Linear Support Vector Machine (SVM) and one-layer artificial neural networks (ANN). The use of linear classifiers allows for the evaluation of the improvements in the performance of the algorithms just by modifying the input space by means of the new representation.
The contribution of this research lies in the proposal of a simple and interpretable representation that combines ensemble machine learning and shallow classification models. The aim of this new representation is improved classifier performance. Testing of the proposed representation was carried out on ten multilabel text data sets that are widely referenced in the literature, obtaining alternate performance measurements.
This document is structured as follows. In Section 1, the subject is introduced, in Section 2, the state of the art is presented. In Section 3, the proposal is presented, and in Section 4, the applied methodological framework is described. The experimental results are described in Section 5, where the results are discussed and the performance of the proposal is compared with other models. The Section 6 presents the final conclusions and future work.

3. Label-Dependent Representation

Although in recent years there has been a growth in the interest of the scientific community in deep learning models, there is a debate between the “white box” vs. “black box” approaches, with advantages and disadvantages of both approaches. Superficial learning highlights the feasibility of working with relatively small data sets and interpretability. Deep learning models stand out for their good performance and robustness. Therefore, it is not possible to completely rule out techniques and models based on shallow learning, especially when the set of training cases does not have a large volume of data and the set of features is not very extensive.
As already mentioned, in this approach, it is planned to use classification methods based on shallow learning combining representation modification with problem transformation. Although this research does not use classification methods based on deep learning, such as BERT and its variants, it would also be possible to use this approach to modify the characteristics of the representation of the original texts and generate to new layers or inputs to deep learning methods, especially when working with small data sets. This would also give better interpretability to the deep learning models.
On the other hand, in the case of the problem of multilabeled texts, the classification models must deal with data sets with a high cardinality, density, and diversity of labels. Cardinality measures the average number of labels associated with each document, density is the cardinality divided by the number of labels, and diversity represents the percentage of label sets present in the corpus divided by the number of possible label sets. In this work, we use data sets with a cardinality of between 1.18 and 3.28, a density of between 0.014 and 0.098, and a diversity of between 0.041 and 0.442.
In this section, the well-known tf-idf [39] representation is explained, and our proposed function for the weighting of rfl terms is presented. Based on the latter, we propose two new representations, one based on the Multivariate Bernoulli model called bin-rfl, and another based on the Multinomial tf-rfl model. In this type of representation, the indicator f t , d can have values of between zero and one ([0, 1]), is called the Multinomial Model by [36], and is different from the Bernoulli Multivariate Model, where the indicator is b i n t , d , which is represented by one when the term t exists at least once in the document d; that is, it can have a value of zero or one ({0, 1}). The factor based on the Bernoulli Multivariate Model is called a Binary Representation or Boolean Model. Many problems, either by their nature or by the measurements that can be obtained from them, use the representation model based on the Multivariate Bernoulli Model.
This raises the hypothesis that a supervised modification to the text representation that considers frequency representations or binary representations, together with a function for the supervised weighting of the terms that is based on the known examples, according to their labels, could improve the performances of the classifiers significantly. When referring to supervised modification, what we are proposing is a modification of the representation based on the analysis of the labeled examples of the training set, which is why it is supervised. For the term weighting method for multilabel problems, we will use as variables those described in Table 4: a t , λ j , which represents the number of documents in the category λ j containing the term t and d t , λ j representing the number of documents in the category λ j that do not contain the term t.
Table 4. Variables used for weighting in a multilabel problem, given a term t and 4 categories.

3.1. Term Frequency-Inverse Document Frequency ( t f - i d f ) Representation

According to [20], the most widely used text representation for text classification is t f - i d f from [39]. This is where each component of the vector is calculated according to Equation (1):
t f - i d f t d = f t , d × l o g 10 N N t ,
where f t , d is the frequency of the term t in the document d. For the two-category problem, N = ( a t , λ 1 + d t , λ 1 + a t , λ 2 + d t , λ 2 ) is the number of documents, and N t = ( a t , λ 1 + a t , λ 2 ) is the number of documents that contain the term t.
The main contribution of this representation is that it weights with less importance the terms that are very frequent in the collection of documents through the factor N / N t .

3.2. Term Frequency-Relevance Frequency for a Label ( t f - r f l ) Representation

In the research carried out by [1], the preliminary results of the representation Relevance frequency for a label, tf-rfl were presented. This representation is described in the following equation, as a new representation for multilabel problems.
t f - r f l t d l = f t , d × l o g 2 2 + a t , l m a x ( 1 , m e a n ( a t , λ j / l ) ) ,
where f t , d is the frequency of the term t in the document d, a t , l is the number of documents under the category under the evaluation l that contain the term t, and m e a n ( a t , λ j / l ) is the average number of documents containing the term t among the set of documents labeled in a category other than l, i.e. a t , λ j / l = { a t , λ 1 , , a t , λ l 1 , a t , λ l + 1 , , a t , | L | } .
The constant value of 2 on the right-hand side of the formula is assigned because the base of the logarithmic operation is 2. Without the constant 2, it could have the effect of setting the other terms to zero. Other bases could be used for the logarithm function, which would also imply a modification of the value of this parameter.
The main contribution of this representation is that it weights with less importance the terms that are equally frequent in the different categories, and weights with greater importance the terms that are more frequent in the category under evaluation.
It is also possible to use b i n - i d f based on the Bernoulli Multivariate Model, instead of t f - i d f , based on the Multivariate Model. In this case, instead of using f t , d , b i n t , d is used.
In order to evaluate the performance improvement due to the use of r f l weighting, this paper will present a new representation based on the occurrence of terms in each document; that is, Binary Representation or Boolean Representation. This representation, based on the Multivariate Bernoulli Model, uses less information than the one based on the Multinomial Model, since only information on the existence or not of a word in the text is used, and not its frequency of appearance.

3.3. Multivariate Bernoulli Model—Label-Dependent ( b i n - r f l ) Representation

A new representation for the multilabel problem, which is proposed in this work, called b i n - r f l , is based on a representation of the Multivariate Bernoulli Model that is weighted using the frequency term of a label, and calculated as in Equation (3):
b i n r f l t d l = b i n t , d × l o g 2 2 + a t , 1 m a x ( 1 , m e a n ( a t , λ j / l ) ) ,
where b i n t , d takes the value of 1 if the term t is present in the document d, and 0, if the term t is not present in the document d; a t , l is the number of documents in the category under evaluation that contain the term t, and m e a n ( a t , λ j / l ) is the average number of documents that contain the term term t for each set of tagged documents other than l. This new representation helps to make a better distinction of the terms, which is reflected in a better performance classification, as will be seen in Section 5.
The term t weighting method here considers each term occurrence frequency within each group of documents with labels that are different from those of the document under evaluation. The occurrence measurement rfl, m e a n ( a t , λ j / l ) will be larger if term t is appears with higher frequency in documents with label λ j = l than in documents with other λ j / l labels. Moreover, the occurrence measurement will be lower if term t appears with higher frequency in documents with labels other than I. Therefore, the weighting rfl results a better discriminator among categories.
The modification of the representation based on this method allows the binary classifiers that will evaluate whether the texts should be classified in each l label to have better information to recognize the patterns and for each classifier to specialize in each l label.
This research proposes a representation method based on bin-rfl and tf-rfl as well as binary classifiers based on the problem transformation Binary Relevance and Label Powerset. The method transforms the multiple labeling problem into binary problems and then generates b i n - r f l representations for each label in each document d and classifies them. Each document is represented with a different vector when evaluating each label due to the dependency on the weighting factor.

3.4. Probabilistic Interpretation

A probabilistic interpretation of this representation is that f t , d is an estimate of P ( t i / d j ) ; that is, of the probability that the term i is in the document collection j. Likewise, the weighting i d f = l o g 10 ( N / N t ) is a function of the 1 / P ( t i / N ) , that is, of the probability that the term i is in the documentset, that is, P ( t i ) . So, the t f - i d f function is given by:
t f - i d f t d = P ( t i / d j ) l o g ( P ( t i / N ) ) .
Note that the idf weighting factor does not take into account that documents may have multiple categories.
For the case of the weighting function rfl, it can be considered that it is an estimate of the P ( t i / N l ) / P ( t i / N j / l ) , that is, of the probability that the term i is in the set of documents labeled under the label l, over the probability that it is in the set of documents labeled in other label, different from l. So, the function t f - r f l can be represented as:
t f - r f l t d l = P ( t i / d j ) l o g ( P ( t i / N l ) P ( t i / N j / l ) ) .
With this, a term weighting function r f l is proposed that addresses the classification problem with multiple labels, something that i d f does not consider.

3.5. Ensemble Interpretation

Following the taxonomy proposed by [42], it can be argued that our proposal introduces greater diversity through two routes. First, from the manipulation of the training set, information from the domain of the labels for each member of the ensemble is incorporated by processing different previously manipulated inputs. Second, from specializing each one of the members of the committee of classifiers in each one of the labels l of the training set.
The rfl representation modifies the training sets by incorporating information about the features that differentiate the instance sets of different labels. In turn, each classifier uses these examples through binary classifications of each label l: belongs or does not belong.
The following scheme represents our proposal as an ensemble:
Figure 1 outlines how each text is modified according to the label that will be submitted for evaluation. In this way, before each text is input to a classifier, it will be submitted to a supervised modification and adjusted to the label under evaluation.
Figure 1. View as an Ensemble.

3.6. Geometric Interpretation

Based on the Vector Space Model, we can interpret that each feature of a document is represented as a dimension of the feature vector.
The rfl modification considers the relationship between the features of the documents belonging to the label under classification and the features of the documents belonging to other labels, different from the l label under evaluation.
By applying the weight a t , l / μ ( a t , λ j / l ) , it is sought that the value corresponding in the vector to the characteristic t increases when that characteristic occurs more in the set of documents labeled under l than in the rest of the labels, and in turn, that it decreases when t occurs less in l than in the rest of the labels.
The geometric interpretation is that the value in dimension t of the vector increases, that is, it moves away from the other vectors corresponding to the other labels. This is exemplified in the following graph.
Figure 2 shows how under the logic of the Vector Space Model, every time a representation is modified, depending on the label to which it will be evaluated, the texts will “move away” from each other, facilitating the search for the separating hyperplane.
Figure 2. Geometric interpretation.

4. Experimental Method

This section presents the classification method, the data sets, and the performance measurements of the evaluation.

4.1. Classification Method

The computational experiments considered 10 widely available data sets. Firstly, each multilabeled data set was preprocessed for conversion into single-label data sets using Binary Relevancy and Label Powerset transformations. New representations were obtained for each post-processed data set. Secondly, classification of the newly generated data sets was accomplished by means of binary machine learning techniques. Thirdly, the classification performance was analyzed. The whole procedure is graphically presented in Figure 3.
Figure 3. Text processing flow.

4.2. Data Sets

There are many standardized data sets for the testing models; the top 10 multilabeled textual data sets are: REUTERS-21578, OHSUMED, ENRON, SLASHDOT, LANGLOG, BIBTEX, TMC, Yahoo Education, Yahoo Science, and MEDICAL. For REUTERS-21578, which is a set of news texts, a modified subset that was proposed in [30] was considered in order to be able to obtain comparative performance measures. The OHSUMED data set is a partition of the MEDLINE database, which is a library of scientific articles published in medical journals. The OHSUMED collection has also been reduced from 50,216 to 13,929 texts. This subset contains the 10 most representative categories of the original 23 categories. The Enron data set is a collection of texts created by the CALO (Cognitive Assistant that Learns and Organizes) project, containing 1702 email messages and 52 categories. Finally, the Medical data set was created by the Computational Medicine Center, 2007, for the Language Processing Challenge, 2007; it contains 978 clinical texts of radiology reports and considers 45 categories of medical codes. TMC2007 is a subset of the Aviation Safety Reporting System data set. Finally, we use real web pages linked from the “yahoo.com” domain, specifically comparing “Science” and “Education”. Table 5 presents the characteristics of the preprocessed data set.
Table 5. Characteristics of the Preprocessed Data Set. Cardinality (Card) measures the average number of labels associated with each document. Density (Dens) is defined as the cardinality divided by the number of labels. The Diversity (Div) represents the percentage of label sets present in the set divided by the number of possible label sets. Vocabulary Size considers the volume of distinct words.

4.3. Performance Measures

Traditional evaluation measures such as the F measure, Hamming Loss, and Accuracy are useful in the case of multilabeled sets.
To describe the performance measures, the following notation was used: considering the vector Y i [ 0 , 1 ] | L | : i = 1 d, then each label will be relevant if y i , j = 1 , and for its part, the prediction of the classifier will be y i , j = 1 , where d is the number of documents and | L | is the number of possible labels.
Based on the notation above, Hamming Loss is defined as in Equation (6):
H a m m i n g - L o s s ( Y , Y ) = 1 d 1 | L | i = 1 d j = 1 | L | y i , j Δ y i , j
where y i , j Δ y i , j represents the difference between the labels assigned by the classifier and the actual labels. This measure seeks to measure the difference between each label that the texts actually have, with each label that the classifier assigned to said texts. The lower the value obtained, the better the performance.
Another multilabel measure is the label set precision (Accuracy), and this is defined as in Equation (7):
A c c u r a c y ( y , y ) = 1 d i = 1 d 1 ( y i = y i ) .
This average performance allowed us to measure, for each text, the correctly assigned labels. In multilabel classification, the function returns the precision of the subset. If the entire set of predicted labels for a sample strictly matches the actual set of labels, then the subset precision is 1; otherwise, it is 0. The higher the returned value, the better the performance.
The F measure, commonly used in information retrieval, is very popular in multilabeled text classification. The measure F is the harmonic mean between precision and completeness (recall). The measure F ( F 1 ) for each label was calculated as shown in Equation (8):
F 1 ( Y i , Y i ) = 2 × precision × recall precision + recall ,
where Accuracy is the fraction of the predictions that are actually relevant, and recall is the fraction of actual relevance with respect to the predictions. The higher the value of F, the better the performance.
For the multilabel case, it is necessary to combine the different F 1 of each evaluation of the label. For that we use macro- F 1 , which is the average of F 1 for each label.

5. Results and Discussion

In order to compare the effects of using the r f l function to modify the representation, we carried out different classification experiments using 10 different data sets widely worked on in the literature (Reuters, Ohsumed, Enron, Slashdot, Langdot, Bibtex, Medical, TMC2007, and Science and Education), using bin-rfl and bin-idf representations, both with the Binary Relevance and Label Powerset transformations, and with two different linear classifiers (SVM and ANN). The objective of using these linear classifiers was to evaluate the modification, independent of the classification model.
The impact of modifying the representation can be assessed using shallow learning models, which work well under conditions of limited computational complexity, as they do not require prior domain knowledge or experience to extract features from the original text. In turn, smaller data sets of the multilabel classification problem are used, which are widely known in the literature. These sets had already been preprocessed in the standard way.
Table 6, Table 7, Table 8, Table 9, Table 10 and Table 11 show the different methods and their performances in terms of the different performance measures described previously.
Table 6. Experimental results of different transformations of the problem (PT: BR and LP), and Representations with SVM in terms of Accuracy.
Table 7. Experimental results of different transformations of the problem (PT: BR and LP), and Representations with ANN in terms of Accuracy.
Table 8. Experimental results of different transformations of the problem (PT: BR and LP), and Representations with SVM in terms of Hamming Loss.
Table 9. Experimental results of different transformations of the problem (PT: BR and LP), and Representations with ANN in terms of Hamming Loss.
Table 10. Experimental results of different transformations of the problem (PT: BR and LP), and Representations with SVM in terms of macro- F 1 .
Table 11. Experimental results of different transformations of the problem (PT: BR and LP), and Representations with ANN in terms of macro- F 1 .
Regarding the classifiers, the results of the SVMs are at odds with those of the ANNs. Binary Relevance in general performs better than Label Powerset, unless the evaluation is in terms of Accuracy, where some LP data sets perform better than BR.
Regarding rendering, in almost all cases, the bin-rfl rendering has improvements to bin-idf. As shown in Table 6 and Table 7, an average improvement of over 15% (with SVM) and 40% (with ANN) is obtained in terms of Accuracy. Similarly, improvements of 12% in terms of Hamming Loss are obtained with ANN, as shown in Table 8 and Table 9.
In [24], the Classification Neural Networks (CNN) model was used, and it was tested in the following data sets: Enron, and Medical and Science, among others. From this, we can compare the results in terms of Hamming Loss. A value of 0.046 is reported for Enron, 0.013 for Medical, and 0.031 for Science. In this comparison, our proposal receives a Hamming Loss value of 0.039 , 0.013 , and 0.027 , respectively. In all three data sets, the proposal evaluated in this research is better than CNN.
Finally, as can be seen in Table 10 and Table 11, the performance improvement in terms of macro- F 1 are 40% (with SVM) and 50% (with ANN), averaged using the bin-representation rfl instead of bin-idf and the Binary Relevance transformation.
Here, it can also be mentioned that from [29], performance measures are reported using the same data sets used in this research: Science, Education, and Enron and Bixtex, comparing their PNML proposal, achieving a macro- F 1 of 0.298 , 0.31 , 0.262 , and 0.418 , respectively. In this comparison, our proposal achieves a macro- F 1 of 0.461 , 0.285 , 0.319 , and 0.423 , respectively. From the above, it is possible to appreciate that superficial learning models can deliver better results than some deep learning models, in 3 of the 10 data sets compared.
To present the impact of the r f l function on the experimental results, Figure 4 graphically shows how, in almost all cases, the bin-rfl representation presents significant improvements in relation to bin-idf. This percentage is calculated as the ratio between the difference of the metric with the new representation and the old representation. It can be seen from the figure that the improvements, in many cases, are greater than 20%, in terms of macro- F 1 .
Figure 4. Percentage performance improvement in terms of macro- F 1 .
In order to analyze the relationship between the performance improvements introduced by the r f l function in the different evaluated metrics—Hamming Loss, Accuracy, and macro- F 1 , with the different characteristics of the sets of documents analyzed (number of labels, number of documents, number of terms in the vocabulary, cardinality, density and diversity)—a correlation analysis of the metrics and characteristics was carried out, identifying the relationships that are explained below. A correlation analysis was carried out by analyzing the output of each classifier and the input characteristics as variables. The proximity of the correlation coefficient is to +1 or −1 indicates a positive (+1) or negative (−1) correlation between variables. A positive correlation means that if the values in one matrix increase, the values in the other matrix also increase. A correlation coefficient that is closer to 0 indicates no correlation or a weak correlation.
Remember that the cardinality metric is calculated as the average number of labels that a document has, density as the cardinality divided by the total number of labels, and diversity as the percentage of label sets present in the split document set, by the number of possible label sets.
First, the relationship was analyzed in the Hamming Loss metric, as shown in Figure 5. In this analysis, it was possible to identify an inverse correlation between the use of SVM with the transformation of the L a b e l P o w e r s e t problem with the number of labels and with the diversity of labels. In addition, a direct correlation exists between this transformation of the problem and the number of documents, vocabulary size, label density, and label diversity. Likewise, it is possible to appreciate that there is an inverse correlation between the use of ANN with the transformation of the L a b e l P o w e r s e t problem with the cardinality and diversity of the document set.
Figure 5. Correlation between performance improvements in terms of H a m m i n g L o s s and the different characteristics of the data set (number of labels, number of documents, number of vocabulary terms, cardinality, density, and diversity).
Secondly, the relationship in the Accuracy metric is analyzed, which, as shown in Figure 6, presents an inverse correlation with the number of documents and a direct correlation with the diversity of labels. It can also be seen that using SVM with the L a b e l P o w e r s e t transformation obtains better performances with fewer documents, a smaller vocabulary size, and a lower value of the label cardinality and density measures.
Figure 6. Correlation between performance improvements in terms of A c c u r a c y and the different characteristics of the data set (number of labels, number of documents, number of vocabulary terms, cardinality, density, and diversity).
Third, as shown in Figure 7, the relationship in the macro- F 1 metric with the different sets of documents was analyzed. In this performance measure. it is possible to identify a negative correlation of the two classifiers (SVM and ANN) with the two transformations of the problem (BT and LP) with the number of labels, the number of documents, and the size of the vocabulary. Likewise, a direct correlation with the cardinality, density, and diversity of labels is presented. This can be interpreted as the smaller the number of documents or the smaller the vocabulary, the greater the improvement introduced by the r f l function. Additionally, it shows that with the greater the cardinality of labels, the diversity of labels and, to a lesser extent, the density of labels, the improvement introduced by the r f l function is greater in the macro- F 1 measure.
Figure 7. Correlation between the performance improvements in terms of macro- F 1 and the different characteristics of the data set (number of labels, number of documents, number of vocabulary terms, cardinality, density, and diversity).
To evaluate the results as in [1], a test based on a two-tailed paired t-test at the 5% significance level was implemented. According to these results, the transformation of the Binary Relevance problem with ANN and bin-rfl is better than Binary Relevance with ANN and bin-idf in all measures (p = 0.0103 for Accuracy, p = 0.0491 for Hamming Loss, and p = 0.0078 for F 1 ). The p value shown in parentheses provides additional quantification of the significance level.

6. Conclusions and Future Scope

6.1. Conclusions

The growth of interest in deep learning models does not rule out the techniques and models based on shallow learning, especially when the set of training cases is smaller and the set of features is not very extensive. The “white box” versus the “black box” approaches have some advantages, especially the feasibility of working with relatively small data sets and the interpretability of the results. Issues in some fields of application are fundamental.
Classification with multiple labels is an important topic in information retrieval and machine learning, which has become more relevant in recent years. Text representation and classification have traditionally been handled using t f - i d f , due to its simplicity and good performance. However, the t f - i d f representation does not take into account that the examples may have different labels. The latter is very relevant in data sets with high cardinality and label diversity.
Changes in the input representation to classifiers can use knowledge about the problem, its domain, a particular label, or the category to which the document belongs. The r f l function can be written to solve a particular problem directly and without complex problem transformations, using the information from the examples and their different labels.
In this work, we have introduced the r f l function to build new text representations for the multilabel classification approach. This function allows for discriminating the terms that best describe a category, in contrast to other categories, thus taking advantage of the characteristics of the domain of documents that make up the corpus.
This proposal was evaluated using two different linear classifiers, Artificial Neural Networks (ANN) and Support Vector Machines (SVM), with the aim of evaluating the impact of the function on simple classifiers. In turn, the impact was evaluated on 10 different sets of texts, which correspond to medical scientific articles, journalistic documents, medical diagnostic reports, email messages, and web pages. A comparison with b i n i d f was made, and two transformations of the multilabeling problem were used (Binary Relevance and Label Powerset).
The performance of this function shows an improvement in almost all cases, using the Binary Relevance transformation and Support Vector Machines. Only to the extent of Hamming Loss was it better to use Label Powerset and Support Vector Machines.
The greatest impact of using the r f l function occurs on the macro- F 1 performance metric when the data sets have fewer labels, fewer documents, and smaller vocabulary sizes. In addition, this measure improves on data sets with higher cardinalities, densities, and diversities of labels. This reflects the utility of the function on smaller data sets.
We believe that the contribution of the use of the r f l function, when using it as a weighting factor to modify the multilabel representation, is due to a better resolution of the considered problem, since it is capable of making a better identification of the terms in the documents, which is reflected in a better performance of the classification models. From the perspective of machine learning applications and the increasing rate of their adoption in the industry, one must consider the need to develop computationally lightweight models that can be implemented under affordable technological conditions for companies of different sizes.

6.2. Future Scope

In future studies, we plan to use the r f l function for the task of selecting features or for identifying the most significant attributes to discriminate. In addition, other representations, e.g., Part of Speech or N-grams, or based on other probability distributions, could be used to construct a label-dependent representation.
We will also take an in-depth look at the impact of the r f l function on the performance of non-linear classifiers, such as Random Forest and Decision Tree. Previous results show important improvements in these non-linear classifiers, and the challenge is to understand how these classifiers recognize the changes caused by the r f l function to improve their performance.
We will also use the r f l function to process the outputs of more complex learning models—for example, with word2vec—in order to improve its performance, starting from the incorporation of information from the labels to weight the synthesized concepts.
Another line of work is to incorporate weights into the r f l function that allow the attacking of the imbalance problem, which is very common in multilabel classification. This can be achieved by adding the number of documents for each label in relation to the total number of documents and labels as a parameter of the r f l function.
Finally, we will use the representation to perform sentiment analysis, email classification, and other pattern recognition applications.

Author Contributions

Conceptualization, R.A.; methodology, R.A., H.A.-C. and H.A.; software, R.A.; validation, R.A., H.A.-C. and H.A.; formal analysis, R.A. and H.A.; investigation, R.A. and H.A.; resources, R.A. and H.A.; data curation, R.A.; writing — original draft preparation, R.A., H.A.-C. and H.A.; writing, R.A., H.A.-C. and H.A.; visualization, R.A.; supervision, H.A.-C. and H.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by ANID Fondef Idea I+D ID21I10206 (2021–2023) and PUCV Grant 039.406/2021 and 039.344/2022. As well as the Applied Natural Language Processing Nucleo (NIPLNA, www.niplna.com, accessed on 8 March 2023).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Alfaro, R.; Allende, H. Text Representation in Multi-label Classification: Two New Input Representations. In Adaptive and Natural Computing Algorithms; Dobnikar, A., Lotric, U., Ster, B., Eds.; ICANNGA: Ljubljana, Slovenia, 2011. [Google Scholar]
  2. Alfaro, R.; Allende, H. Clasificación de Textos Multi-etiquetados con Modelo Bernoulli Multi-variado y Representación Dependiente de la Etiqueta. Rev. Signos 2020, 53, 549–567. [Google Scholar] [CrossRef]
  3. Ñanculef, R.; Concha, C.; Allende, H.; Candell, D.; Moraga, C. AD-SVMs: A Light Extension of SVMs for Multicategory Classification. Int. J. Hybrid Intell. Syst. 2009, 6, 69–79. [Google Scholar] [CrossRef]
  4. Yang, L.; Su, H.; Zhong, C.; Meng, Z.; Luo, H.; Li, X.; Tang, Y.Y.; Lu, Y. Hyperspectral image classification using wavelet transform-based smooth ordering. Int. J. Wavelets Multiresolution Inf. Process. 2019, 17, 1950050. [Google Scholar] [CrossRef]
  5. Guariglia, E.; Silvestrov, S. Fractional-Wavelet Analysis of Positive definite Distributions and Wavelets on D′ (ℂ). In Proceedings of the Engineering Mathematics II; Silvestrov, S., Rančić, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 337–353. [Google Scholar]
  6. Mallat, S. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef]
  7. Yu, B.; Li, B. Fractal-like tree networks reducing the thermal conductivity. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 2006, 18, 066302. [Google Scholar] [CrossRef]
  8. Guariglia, E. Entropy and Fractal Antennas. Entropy 2016, 18, 84. [Google Scholar] [CrossRef]
  9. Berry, M.V.; Lewis, Z.V.; Nye, J.F. On the Weierstrass-Mandelbrot fractal function. Proc. R. Soc. Lond. A Math. Phys. Sci. 1980, 370, 459–484. [Google Scholar]
  10. Viswanathan, P.; Chand, A. Fractal rational functions and their approximation properties. J. Approx. Theory 2014, 185, 31–50. [Google Scholar] [CrossRef]
  11. Li, Q.; Peng, H.; Li, J.; Xia, C.; Yang, R.; Sun, L.; Yu, P.S.; He, L. A Survey on Text Classification: From Traditional to Deep Learning. ACM Trans. Intell. Syst. Technol. 2020, 13, 31. [Google Scholar] [CrossRef]
  12. Maron, M.E. Automatic Indexing: An Experimental Inquiry. J. ACM 1961, 8, 404–417. [Google Scholar] [CrossRef]
  13. Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  14. Joachims, T. Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms; Kluwer Academic: Dordrecht, The Netherlands, 2002. [Google Scholar]
  15. Anthes, G. Deep learning comes of age. Commun. ACM 2013, 56, 13–15. [Google Scholar] [CrossRef]
  16. Severyn, A.; Moschitti, A. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 373–382. [Google Scholar]
  17. Samir, K.; Takehisa, Y. A review on the application of deep learning in system health management. Mech. Syst. Signal Process. 2018, 107, 241–265. [Google Scholar]
  18. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
  19. Zeng, J.; Ustun, B.; Rudin, C. Interpretable classification models for recidivism prediction. J. R. Stat. Soc. Ser. (Stat. Soc.) 2016, 180, 689–722. [Google Scholar] [CrossRef]
  20. Sebastiani, F. Machine learning in automated text categorization. ACM Comput. Surv. 2002, 34, 1–473. [Google Scholar] [CrossRef]
  21. Tsoumakas, G.; Katakis, I. Multi label classification: An overview. Int. J. Data Wareh. Min. 2007, 3, 1–13. [Google Scholar] [CrossRef]
  22. Lee, S.; Jiang, J. Multilabel text categorization based on fuzzy relevance clustering. Fuzzy Syst. IEEE Trans. 2014, 22, 1457–1471. [Google Scholar] [CrossRef]
  23. Nam, J.; Kim, J.; Mencía, E.; Gurevych, I.; Fürnkranz, J. Large-scale multi-label text classification -Revisiting Neural Networks. In Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2014; pp. 437–452. [Google Scholar]
  24. Giunchiglia, E.; Lukasiewicz, T. Multi-Label Classification Neural Networks with Hard Logical Constraints. arXiv 2021, arXiv:2103.13427v1. [Google Scholar] [CrossRef]
  25. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  26. Pal, A.; Selvakumar, M.; Sankarasubbu, M. MAGNET: Multi-Label Text Classification using Attention-based Graph Neural Network. arXiv 2020, arXiv:2003.11644. [Google Scholar]
  27. Murawaki, Y. Global model for hierarchical multi-label text classification. In Proceedings of the International Joint Conference on Natural Language Processing, Guangzhou, China, 24–26 March 2013; pp. 46–54. [Google Scholar]
  28. Liu, J.; Chang, W.C.; Wu, Y.; Yang, Y. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’17), Tokyo, Japan, 7–11 August 2017; pp. 115–124. [Google Scholar]
  29. Yang, Z.; Han, Y.; Yu, G.; Yang, Q.; Zhang, X. Prototypical Networks for Multi-Label Learning. arXiv 2020, arXiv:1911.07203. [Google Scholar]
  30. Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
  31. Fink, E. Automatic evaluation and selection of problem-solving methods: Theory and experiments. J. Exp. Theor. Artif. Intell. 2004, 16, 73–105. [Google Scholar] [CrossRef]
  32. Kadhim, A.I. Term weighting for feature extraction on Twitter: A comparison between BM25 and TF-IDF. In Proceedings of the International Conference on Advanced Science and Engineering (ICOASE), Duhok, Iraq, 2–4 April 2019; pp. 124–128. [Google Scholar]
  33. Chatterjee, A.; Gupta, U.; Chinnakotla, M.K.; Srikanth, R.; Galley, M.; Agrawal, P. Understanding emotions in text using deep learning and Big Data. Comput. Hum. Behav. 2019, 93, 309–317. [Google Scholar] [CrossRef]
  34. Keikha, M.; Razavian, N.; Oroumchian, F.; Razi, H.S. Document Representation and Quality of Text: An Analysis. En Survey of Text Mining II: Clustering, Classification, and Retrieval; Springer: Berlin/Heidelberg, Germany, 2008; pp. 135–168. [Google Scholar]
  35. Manning, C.; Schütze, H. Foundations of statistical natural language Processing; The MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
  36. McCallum, A.; Nigam, K. A comparison of event models for naive bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, WI, USA, 26–27 July 1998; pp. 41–48. [Google Scholar]
  37. Leopold, E.; Kindermann, J. Text categorization with support vector machines. How to represent texts in input space? Mach. Learn. 2002, 46, 423–444. [Google Scholar] [CrossRef]
  38. Lan, M.; Tan, C.L.; Su, J.; Lu, Y. Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 721–735. [Google Scholar] [CrossRef]
  39. Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. Int. J. 1988, 24, 513–523. [Google Scholar] [CrossRef]
  40. Kowsari, K.; Jafari, M.K.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
  41. Valle, C. Ensemble Learning with Locally Coupled Learners. Ph.D. Thesis, Universidad Técnica Federico Santa Maria, Valparaiso, Chile, 2014. [Google Scholar]
  42. Rokach, L. Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography. Comput. Stat. Data Anal. 2009, 53, 4046–4072. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.