Next Article in Journal
Grouting Process Simulation Based on 3D Fracture Network Considering Fluid–Structure Interaction
Previous Article in Journal
A Static-loop-current Attack Against the Kirchhoff-Law-Johnson-Noise (KLJN) Secure Key Exchange System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Feature Selection Method for Multi-Label Text Based on Feature Importance

College of Information and Electrical Engineering, China Agricultural University, Beijing 100083, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2019, 9(4), 665; https://doi.org/10.3390/app9040665
Submission received: 14 December 2018 / Revised: 25 January 2019 / Accepted: 12 February 2019 / Published: 15 February 2019

Abstract

:
Multi-label text classification refers to a text divided into multiple categories simultaneously, which corresponds to a text associated with multiple topics in the real world. The feature space generated by text data has the characteristics of high dimensionality and sparsity. Feature selection is an efficient technology that removes useless and redundant features, reduces the dimension of the feature space, and avoids dimension disaster. A feature selection method for multi-label text based on feature importance is proposed in this paper. Firstly, multi-label texts are transformed into single-label texts using the label assignment method. Secondly, the importance of each feature is calculated using the method based on Category Contribution (CC). Finally, features with higher importance are selected to construct the feature space. In the proposed method, the feature importance is calculated from the perspective of the category, which ensures the selected features have strong category discrimination ability. Specifically, the contributions of the features to each category from two aspects of inter-category and intra-category are calculated, then the importance of the features is obtained with the combination of them. The proposed method is tested on six public data sets and the experimental results are good, which demonstrates the effectiveness of the proposed method.

1. Introduction

With the rapid development of information technology, all walks of life have been generating a large amount of data. Therefore, it is of great significance to promote the development and progress of various industries by mining useful knowledge and information from text data and applying them to information retrieval and data analysis. Text classification is a technology that can effectively organize and manage texts, and facilitate users to quickly obtain useful information, thus becomes an important research direction in the field of information processing.
Text classification refers to a text divided into one or multiple predefined categories according to the content of text data, including single-label text classification and multi-label text classification. In single-label text classification, a text is associated with one predefined topic and only divided into one category [1,2,3]. However, in multi-label text classification, a text is associated with multiple predefined topics and divided into multiple categories [4,5,6]. For example, a news report about a movie may be associated with the topics of “movie”, “music”, and others. Also, an article about computers may be associated with the topics of “computers”, “software engineering”, and others. In the real world, it is very common for a text to be associated with multiple topics, so it is more practical to study multi-label text classification.
Feature selection is an important part of multi-label text classification. Text data is unstructured data, consisting of words. The vector space model is often used to represent text data for facilitating computer processing. In this method, all the terms in all the texts are used as features to construct the feature space which is called the original feature space. Since a text may consist of thousands of different terms, the original feature space is high-dimensional and sparse. If the original feature space is directly used for text classification, it will be time-consuming and over-fitting. Therefore, reducing the dimension of the original feature space is of significant importance. Feature selection is a commonly used dimensionality reduction technology. It removes useless and redundant features, and only keeps the features with strong category discrimination ability to construct the feature subset, realizing the reduction of feature space dimension and the improvement of classification performance [7,8]. We study the feature selection method for multi-label text in this paper.
Currently, there are some researches on feature selection for multi-label data, which can be divided into three categories: wrapped methods, embedded methods, and filter methods. In wrapped methods, several different feature subsets are constructed in advance, the pros and cons of the feature subsets are then evaluated by the predictive precision of the classification algorithm, and the final feature subset is determined based on the evaluation [9,10,11,12,13]. In embedded methods, the feature selection process is integrated into the classification model training process, that is, features with high contribution to model training are selected to construct the feature subset in the process of classification model construction [14,15,16]. In filter methods, the classification contribution of each feature is calculated as its feature importance, and features with higher importance are selected to construct the feature subset [17,18,19,20]. Based on this selected feature subset, the training of the classification model is performed. Therefore, this type of feature selection methods has low computational complexity and high operational efficiency, and is very suitable for text data [21].
In filter methods, the core is how to calculate the feature importance. At present, the commonly used feature importance calculation method is to count the frequency of features in the whole training set. However, the selected features using this method do not necessarily have strong category discrimination ability. For example, the feature “people” appears frequently in the whole training set, so the importance of “people” is very high, calculated by the method of counting its frequency in the whole training set. However, such a feature may appear similar times in each category of the training set, and doesn’t have category discrimination ability indeed. The category discrimination ability of features is an ability to distinguish one category from others. Therefore, the importance of features should be calculated from the perspective of the category discrimination ability. Features with strong category discrimination ability have high correlation with one category and low correlation with others. To the best of our knowledge, there are some researches on the calculation of feature importance based on the two aspects [22,23]. They consider one aspect or two aspects abovementioned, design the feature importance calculation formula, and apply it to single-label text classification. In this paper, on the basis of these researches, the contributions of features to category discrimination are redefined from the above two aspects, and the importance of the features is obtained with the combination of them.
A filter feature selection method for multi-label text is proposed in the paper. Firstly, multi-label texts are transformed into single-label texts using the label assignment method. Secondly, the importance of each feature is calculated using the proposed method of Category Contribution (CC). Finally, features with higher importance are selected to construct the feature space. Thus, the contributions of this paper can be summarized as follows.
(1)
The importance of features for classification is analyzed from two aspects of inter-category and intra-category, and then the formulas for calculating inter-category contribution and intra- category contribution are proposed.
(2)
A formula for calculating feature importance based on CC is proposed.
(3)
The proposed feature selection method is combined with Binary Relevance k-Nearest Neighbor (BRKNN) [24] and Multi-label k-Nearest Neighbor (MLKNN) [25] algorithms, achieving a good classification performance.
The rest of this paper is organized as follows. Section 2 describes some related works. Section 3 introduces the details of the proposed method. Experimental results are shown in Section 4. Finally, Section 5 concludes the findings shown in this paper.

2. Related Works

The purpose of feature selection is to reduce the dimension of the feature space and improve the efficiency and performance of the classification through removing irrelevant and redundant features. The filter feature selection methods are viewed as a pure pre-processing tool and have low computational complexity. As a result, there are many scholars focusing on the research of this type of methods. For multi-label data, there are two main research ideas in filter feature selection methods.
One method is the feature selection method based on the idea of algorithm adaptation. In this type of methods, the feature importance calculation methods commonly used in single-label feature selection are adapted to be suitable for multi-label data, then the feature selection for multi-label data is performed on them [18,19,20,26,27]. Lee et al. [18] proposed a filter multi-label feature selection method by adapting mutual information. This method was evaluated on multi-label data sets of various fields, including text. Lin et al. [20] proposed a method named max-dependency and min-redundancy, which considers two factors of multi-label feature, feature dependency, and feature redundancy. Three text data sets, Artificial, Health, and Recreation were used to evaluate the method. Lastra et al. [26] extended the technique fast correlation-based filter [28] to deal with multi-label data directly. Text data sets were also used in his experiments. In this type of methods, multi-label data is directly processed for feature selection, so the computational complexity is very high when there are many categories in the data set.
The other is the feature selection method based on the idea of problem transformation. In this type of methods, multi-label data is transformed into single-label data, then the feature selection is performed on single-label data [17,29,30,31,32,33,34]. Since a piece of multi-label data belongs to multiple categories, the single-label feature importance calculation methods can’t directly deal with it. A simple transformation technique is used to convert a piece of multi-label data into a single-label one, selecting just one label for each sample from its multi-label subset. This label can be the most frequent label in the data set (select-max), the least frequent label (select-min), or a random label (select-random). Xu et al. [17] designed a transformation method based on the definition of ranking loss for multi-label feature selection and tested it on four text data sets. Chen et al. [29] proposed a transformation method based on entropy and made the application of traditional feature selection techniques to the multi-label text classification problem. Spolaôr et al. [31] proposed four multi-label feature selection methods by combining two transformation methods and two feature importance calculation methods, and verified them on data sets of various fields, including the field of text. Lin et al. [34] focused on the feature importance calculation method based on mutual information and presented a multi-label feature selection method. Five text data sets were used to evaluate the proposed method. In this type of methods, feature selection is performed on single-label data, which reduces the complexity of feature importance calculation and is very suitable for text feature space. The transformation technique and feature importance calculation method are the two key technologies in this type of multi-label feature selection methods. As a result, the label assignment methods, which are a type of commonly used transformation techniques, and some feature importance calculation methods are briefly introduced in the following.

2.1. Label Assignment Methods

Label assignment methods are used to transform multi-label data into single-label data. The commonly used label assignment methods include All Label Assignment (ALA), No Label Assignment (NLA), Largest Label Assignment (LLA), and Smallest Label Assignment (SLA) [29]. In order to describe these methods conveniently, we define the following variables: d denotes a piece of data in the multi-label data set and {C1,C2,...,Cn} denotes the set of categories to which d belongs.

2.1.1. All Label Assignment (ALA)

In the ALA method, a piece of multi-label data is assigned to multiple categories to which it belongs, that is, a copy of the multi-label data exists in multiple categories to which it belongs. ALA aims to keep as much as category information as possible on each category by generating multiple copy data. Meanwhile it may introduce multi-label noise, which could affect the classification performance. The results of d transformed into n pieces of single-label data using ALA are as follows.
d 1 { C 1 } ; d 2 { C 2 } ; d n { C n } .
where d1 is a copy of d existing in category C1.

2.1.2. No Label Assignment (NLA)

In the NLA method, all multi-label data is regarded as noised data, and only single-label data in the original data set is kept. NLA can get rid of the noise by only introducing single-label data, but it may lose some useful information because the multi-label data is discarded. Thus, it is suitable for the data sets with more single-label data and less multi-label data. The transformation processing of d using NLA can be described as follows.
If   n = = 1 ,   d { C 1 } ; else   delete   d .
where n is the number of categories to which d belongs.

2.1.3. Largest Label Assignment (LLA)

In the LLA method, the multi-label data belongs to the category with the largest size. Assuming that |Ck| is the number of samples in category Ck, then |Ck| is called the size of the category Ck. LLA is based on the assumption that the data with larger categories has higher anti-noise ability than those with smaller categories. Let Cmax denote the category with the largest size, which can be described as follows.
C m a x = a r g max k = 1 , 2 , , n { | C k | }
The result of d transformed into single-label data using LLA is as follows.
d { C m a x }

2.1.4. Smallest Label Assignment (SLA)

In the SLA method, the multi-label data belongs to the category with the smallest size. SLA assumes that the categories with smaller sizes need more training data in order to make the data as balance as possible. Let Cmin denote the category with the smallest size, which can be described as follows.
C m i n = a r g min k = 1 , 2 , , n { | C k | }
The result of d transformed into single-label data using SLA is as follows.
d { C m i n }

2.2. Feature Importance Calculation Methods

Feature importance, also named feature category discrimination ability, refers to the contribution of features to classification. Adapting an appropriate feature importance method to accurately calculate the contribution of each feature to the classification and selecting features with higher importance to construct the feature space are very important to classification performance. The commonly used methods for calculating the importance of features are document frequency (DF) [35], mutual information (MI) [36], and information gain (IG) [35].
In order to describe these methods conveniently, we define the following variables: Fj denotes a feature, N denotes the total number of samples in the training set, A denotes the number of samples with Fj in category k, B denotes the number of samples with Fj in categories except category k, C denotes the number of samples without Fj in category k, D denotes the number of samples without Fj in categories except category k and q is the number of categories.

2.2.1. Importance Calculation Method Based on DF

DF refers to the number of samples in which the feature appears in the training set. In this method, features with higher DF will be selected to construct the feature space. The DF of Fj is calculated as follows.
DF ( F j ) = A + B N
The method based on DF is the simplest feature importance calculation method. It is very suitable for the feature selection of large-scale text data sets because it has a good time performance. However, the correlation between features and categories is not considered in this method, as a result, features with low-frequency but high classification discrimination can’t be selected.

2.2.2. Importance Calculation Method Based on MI

MI is an extended concept of information entropy, which measures the correlation between two random events. It measures the importance of the feature to a category according to whether the feature appears or not. The MI of Fj for the category k is as follows.
MI ( F j , k ) = log A N ( A + C ) ( A + B )
The MI of Fj for the whole training set is as follows.
MI ( F j ) = A + C N k = 1 q MI ( F j , k )
The method based on MI considers the correlation between features and categories, but its formula focuses on giving the features with low-frequency higher importance, which is too biased to low-frequency features.

2.2.3. Importance Calculation Method Based on IG

IG is a feature importance calculation method based on information entropy. It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a feature in a document. The IG of Fj is calculated as follows.
IG ( F j ) = A + C N k = 1 q log ( A + C N ) + A N k = 1 q log ( A A + B ) + C N k = 1 q C C + D
The method based on IG considers the case where a feature does not occur. When the distribution of the categories and features in the data set is not uniform, the classification effect may be affected.
In this paper, we design a feature importance calculation method from two aspects of inter-category and intra-category, which can effectively select features with strong category discrimination ability and is beneficial to improve the performance of multi-label text classification.

3. Proposed Method

In this paper, a feature importance calculation method based on CC is proposed, and based on this method, a feature selection method for multi-label text is proposed. The process of the feature selection method proposed in this paper is shown in Figure 1.
The purpose of feature selection is to select features with strong category discrimination ability. Calculating the importance of each feature accurately and selecting features with higher importance to construct the feature space are very important for the classification performance. In this paper, the contribution of features to the classification is considered from two aspects of inter-category and intra-category, and a method of feature importance based on CC is proposed. The main steps of the feature importance method are as follows.
(1)
The inter-category contribution and intra-category contribution of each feature to each category are calculated respectively.
(2)
For each feature, the inter-category contribution variance and intra-category contribution variance are calculated respectively based on the inter-category contribution and intra-category contribution to each category calculated in (1).
(3)
The importance of each feature is calculated by fusing the intra-category contribution variance and the inter-category contribution variance.

3.1. Category Contribution

Category contribution refers to the role that the features play in distinguishing one category from others, including inter-category contribution and intra-category contribution.

3.1.1. Inter-Category Contribution

If a feature appears many times in category a, while rarely appears in other categories, thus, the feature may be closely associated with category a and contributes a lot to distinguish category a. Therefore, we calculate the contribution of a feature to the classification based on the number of samples with the feature in one category and the average number of samples with the feature in other categories. In this method, the contribution is calculated by the occurrence of the feature among different categories, so we call it inter-category contribution.
The information entropy measures the uncertainty between random variables in a quantified form [37]. In text processing, information entropy can be used to describe the distribution of features in different categories. The distribution of the feature in different categories reflects the contribution of the feature to the classification. Therefore, we introduce the information entropy of the feature into the calculation of inter-category contribution in this paper. The information entropy of the feature Fj is as follows.
H ( F j ) = k = 1 q T f ( F j , L k ) T F ( F j ) log T f ( F j , L k ) T F ( F j )
where TF (Fj) is the number of samples with Fj in the training set, Tf (Fj,Lk) is the number of samples with Fj in category k, and q is the number of categories.
The more uniformly the feature is distributed in different categories, the greater the information entropy of the feature is, and vice versa. Therefore, the formula for calculating the inter-category contribution of the feature Fj based on information entropy is as follows.
e j k = ( T f ( F j , L k ) t = 1 q T f ( F j , L t ) T f ( F j , L k ) q 1 ) 2 lg ( 1 H ( F j ) + 0.0001 + 1 )
where ejk is the inter-category contribution of Fj to category k, Tf (Fj,Lk) is the number of samples with Fj in category k, H(Fj) is the information entropy of Fj and q is the number of categories.

3.1.2. Intra-Category Contribution

If a feature appears in one sample in category a, but appears in all samples in category b, thus, the feature may be a word occasionally appearing in category a, but a very relevant word to category b. Therefore, we calculate the contribution of the feature to the classification by the proportion of the number of samples with the feature in the category to the total number of samples in the category. The proportion reflects the degree of correlation between the feature and the category. In this method, the contribution is calculated by the occurrence of the feature in one category, so we call it intra-category contribution.
r j k = T f ( F j , L k ) N k
where rjk is the intra-category contribution of Fj to category k, Tf (Fj,Lk) is the number of samples with Fj in category k, and Nk is the total number of samples in category k.
The intra-category contribution of the feature is a value between 0 and 1, including 0 and 1. The closer the value is to 1, the greater the contribution of the feature to the classification of the category is.

3.2. Feature Importance Calculation Method Based on Category Contribution

Calculating the importance of each feature accurately and selecting features with higher importance to construct the feature space directly determine the performance of classification. In this paper, the contribution of the feature to classification is considered from two aspects of inter-category and intra-category, and a method of feature importance based on CC is proposed.
Let Ej={e1j,e2j,…,eqj} denote the inter-category contribution set of the feature Fj in q categories, and Rj={r1j,r2j,…,rqj} denote the intra-category contribution set of the feature Fj in q categories. When the difference between the data in Ej is great, the category discrimination ability of Fj is strong and when the difference between the data in Rj is great, the category discrimination ability of Fj is strong too. In this paper, the variance is used to represent the difference between the data in the set. The variance of the data in Ej and Rj are called the inter-category contribution variance and intra-category contribution variance respectively. The formulas are as follows.
V E ( F j ) = ( e j k e j ¯ ) 2 q
V R ( F j ) = ( r j k r j ¯ ) 2 q
where VE(Fj) is the inter-category contribution variance of Fj, VR(Fj) is the intra-category contribution variance of Fj, ejk is the inter-category contribution of Fj to category k, rjk is the intra-category contribution of Fj to category k, e j ¯ is the mean inter-category contribution of Fj, r ¯ j is the mean intra-category contribution of Fj and q is the number of categories.
The greater the inter-category contribution variance and intra-category contribution variance are, the stronger the category discrimination ability of the feature is. Therefore, it is necessary to consider from the perspective of making the inter-category contribution variance and intra-category contribution variance both greater when defining the formula of feature importance calculation. In the field of information retrieval, F-measure is the harmonic mean of precision and recall, which ensures that both precision and recall get a greater value [38]. Based on this idea, the formula for feature importance calculation is defined as follows.
f ( F j ) = 2 V E ( F j ) V R ( F j ) V E ( F j ) + V R ( F j )
where f (Fj) is the feature importance of Fj.
After the importance of each feature is calculated, the features with higher importance are selected based on the predefined dimension, to construct the feature space.

4. Experiment and Results

4.1. Data Sets

In order to demonstrate the effectiveness of the proposed method, we collected six public multi-label text data sets, including the fields of medical, business, computers, entertainment, health, and social [39,40], from the Mulan website (http://mulan.sourceforge.net/datasets.html) for experiments.
For the data set S, we describe it from five aspects: the number of samples |S|, the number of features dim(S), the number of categories L(S), label cardinality LCard(S), and label density LDen(S). The data set S is defined as follows.
S = { ( x i , Y i ) | 1 i p }
where xi is a sample, Yi is the set of categories of xi and p is the number of samples in S.
Label cardinality, which measures the average number of categories per sample.
L C a r d ( S ) = 1 p i = 1 p | Y i |
Label density, which is the label cardinality normalized by the number of categories.
L D e n ( S ) = 1 L ( S ) L C a r d ( S )
Details of the experimental data sets are described in Table 1.

4.2. Evaluation Metrics

The results of the multi-label classification experiments were evaluated by the following five evaluation metrics, which are described in FormulaS (20)–(24) [41].
In order to describe these formulas conveniently, we define the following variables: S = { ( x i , Y i ) | 1 i p } denotes a multi-label test set, γ = { L 1 , L 2 , , L q } denotes the category set, h ( x i ) denotes the multi-label classifier, f ( x i , L s ) denotes the prediction function and r a n k f ( x i , L s ) denotes the ranking function. Where xi is a sample, Yi is the set of categories of xi, and Y i γ , and p is the number of samples in S. Also, if f ( x i , L s ) > f ( x i , L t ) , then r a n k f ( x i , L s ) < r a n k f ( x i , L t ) .
Average precision (AP), which evaluates the average fraction of categories ranked above a particular category Ls. The higher the value is, the better the performance is.
A P = 1 p i = 1 p 1 | Y i | L Y i | { L s | r a n k f ( x i , L s ) r a n k f ( x i , L ) , L s Y i } | r a n k f ( x i , L )
Hamming loss (HL), which evaluates how many times a sample-label pair is misclassified. The smaller the value is, the better the performance is.
H L = 1 p i = 1 p 1 q | h ( x i ) Δ Y i |
where Δ denotes the symmetric difference between two sets.
One error (OE), which evaluates how many times the top-ranked category is not in the set of proper categories of the sample. The smaller the value is, the better the performance is.
O E = 1 p i = 1 p [ arg max L γ f ( x i , L ) Y i ]
where for any predicate π, [π] equals 1 if π holds and 0 otherwise.
Coverage (CV), which evaluates how many steps are needed, on average, to move down the category list in order to cover all the proper categories of the sample. The smaller the value is, the better the performance is.
C V = 1 p i = 1 p max   L Y i r a n k f ( x i , L ) 1
Ranking loss (RL), which evaluates the average fraction of category pairs that are not correctly ordered. The smaller the value is, the better the performance is.
R L = 1 p i = 1 p 1 | Y i | | Y i ¯ | | ( L s , L t ) | f ( x i , L s ) f ( x i , L t ) , ( L s , L t ) Y i Y i ¯ } |
where Y i ¯ denotes the complementary set of Y i in γ .

4.3. Experimental Settings

In order to demonstrate the effectiveness of the proposed method, we designed two parts of experiments.
(1) Proposed algorithm validation experiment. This part includes two groups of experiments. In the first group, in different feature space dimensions, the proposed feature selection method is compared with the baseline method which keeps all features to demonstrate the effectiveness of the proposed method. ALA is used to transform multi-label texts into single-label texts, and BRKNN and MLKNN are used as the classifiers. For MLKNN, the number of nearest neighbors and the value of smooth are set as 10 and 1 respectively [42]. For BRKNN, the number of hidden neurons is set as 20% of the number of features in the feature space, the learning rate is set as 0.05, and the number of iterations for training is set as 100 [42]. Let t denote the proportion of the dimension of the feature space to the dimension of the original feature space, this is, t denotes the proportion of the number of selected features to the number of all features. We run experiments of this group with the value of t ranging from 10% to 90%, and 10% as an interval. In the second group, the method based on CC proposed in the paper is performed on different label assignment methods to further demonstrate the effectiveness of the proposed method. BRKNN is used as the classifier and the value of t ranges from 10% to 50%, with 10% as an interval.
(2) Performance comparison experiment. In this part, we compare the performance of the proposed feature selection method with that of the commonly used feature selection methods to demonstrate the effectiveness of the proposed method. The feature selection method based on DF and the feature selection method based on MI are selected as the comparison methods. ALA is used to transform multi-label texts into single-label texts, and BRKNN is used as the classifier. The value of t ranges from 10% to 50%, with 10% as an interval.
All the code in this paper is implemented in MyEclipse version 2014 in a Windows 10 using 3.30 GHz Intel (R) CPU with 8 GB of RAM. The Term Frequency-Inverse Document Frequency (TF-IDF) method [43] is used to calculate the weight of the feature in each text. BRKNN and MLKNN multi-label classification algorithms are implemented based on MULAN software package [44]. The cross validation is used in the experiments. And all the experimental results shown in this paper are the average of ten-fold cross validation.

4.4. Experimental Results and Analysis

4.4.1. Proposed Algorithm Validation Experiment

The classification results on the six public data sets in different dimensions of average precision, hamming loss, one error, coverage and ranking loss are shown in Figure 2, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7. In these figures, the horizontal axis denotes the proportion of the selected features, that is, the horizontal axis denotes the value of t, and the vertical axis denotes the value of the evaluation metric; CC+BRKNN denotes that the multi-label classification is performed on the feature space constructed by the proposed method, and BRKNN is used as the classifier; CC+MLKNN denotes that the multi-label classification is performed on the feature space constructed by the proposed method, and MLKNN is used as the classifier; BaseLine+BRKNN denotes that the multi-label classification is performed on the original feature space, and BRKNN is used as the classifier; and BaseLine+MLKNN denotes that the multi-label classification is performed on the original feature space, and MLKNN is used as the classifier.
It can be seen from Figure 2 to Figure 7 that, on the six data sets, the classifications performed on the feature spaces constructed by the proposed feature selection method universally have better performance than that of the baseline method in each dimension. In classification experiments, the average precision is the most intuitive and concerned evaluation metric. Therefore, we regard the best value of the average precision as the best performance of the classification, so as to analyze the experimental results. Compared with the classification results in the original feature space, the increase (decrease) percentage of each evaluation metric is shown in Table 2.
From Table 2, it can be seen that on the six data sets, most of the classification evaluation metrics obtained in the feature space constructed by the proposed method are universally better than those obtained in the original feature space. Only on the Computers data set, when the value of t is 20% and BRKNN is used as the classifier, the coverage and ranking loss obtained in the feature space constructed by the proposed method are worse than those obtained in the original feature space. Excitingly, on the Health data set, when the value of t is 20% and BRKNN is used as the classifier, the average precision is increased by 67.79%, which is the greatest increase among all.
In order to further demonstrate the effectiveness of the proposed feature importance method based on CC, we experimented it on ALA, NLA, LLA, and SLA.
The experimental results of the six public data sets are shown in Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8. In these tables, BaseLine denotes the multi-label classifications performed on the original feature space; 10%, 20%, 30%, 40%, and 50% denote the multi-label classifications performed on the feature spaces in different dimensions, respectively; and Average denotes the average of the classification results in five different dimensions.
From Table 3 to Table 8, it can be seen that the feature importance method based on CC performed on ALA, NLA, LLA, and SLA all can effectively select the features with strong category discrimination ability. The performance of the classifications based on the feature space constructed by the proposed method is all superior to that on the original feature space, demonstrating that the proposed feature importance method is effective and universal.

4.4.2. Performance Comparison Experiment

In this section, we demonstrate the effectiveness of the proposed method by comparing its performance with that of the feature selection method based on DF and the feature selection method based on MI. The classification results on the six data sets in different dimensions using different feature selection methods are shown in Table 9, Table 10, Table 11, Table 12, Table 13 and Table 14.
In these tables, CC denotes the proposed feature selection method, DF denotes the feature selection method based on DF, MI denotes the feature selection method based on MI, CC+BRKNN denotes the multi-label classification performed on the feature space constructed by the proposed method, and BRKNN is used as the classifier. The naming rules for other symbols are the same. Also, we use bold font to denote the best performance in one dimension and underline to denote the best performance in all dimensions.
From Table 9 to Table 14, it can be seen that there are 150 comparison results using three feature selection algorithms on six data sets with five evaluation metrics. Among them, the feature selection method based on CC wins 126 times, and the winning percentage is up to 84.0%.
Aside from dimensions, the five evaluation metrics have 30 best values on six data sets, 29 of which are obtained in the feature selection method based on CC proposed in this paper. In addition, in the proposed feature selection method, most of the best values of the evaluation metrics are obtained when the dimensions are 10% and 20%, only a few are obtained when the dimension are 30% and 40%, but none when the dimension is 50%.
From the perspective of the average precision, compared with the method base on DF, the evaluation metric has the largest increase on the Entertainment data set, which is 8.22%, and compared with the method based on MI, the evaluation metric has the largest increase on the Medical data set, which is 91.65%.
Therefore, the performance of the multi-label text feature selection method based on CC is better than that of the common feature selection methods based on DF and MI. Also, the best values of the evaluation metrics are all obtained in smaller dimensions, which greatly reduces the feature space dimension and improves the classification performance.
In summary, it can be seen from the experimental results and analysis on the six data sets that
(1)
Compared with the baseline method, the classification performance of the feature selection method proposed in this paper are generally superior in all dimensions, which demonstrates the effectiveness of the proposed method.
(2)
Good classification performance has been achieved when the proposed method of feature importance is performed on different label assignment methods, demonstrating that the feature importance method based on CC is effective and universal.
(3)
Compared with the commonly used feature selection methods, the percentage of the best values of the evaluation metrics obtained on the proposed feature selection method is 84.0%, demonstrating that the proposed method has a good performance.
(4)
The best values of the evaluation metrics are obtained in the proposed multi-label feature selection method all in smaller dimensions, which has an obvious dimension reduction effect, and is suitable for high-dimensional text data.

5. Conclusions

Aiming at the high dimensionality and sparsity of text feature space, a multi-label text feature selection method was proposed in this paper. Firstly, the label assignment method was used to transformed multi-label texts into single-label texts. Then, on this basis, an importance method based on CC was proposed to calculate the importance of each feature. Finally, features with higher importance were selected to construct the feature space. In the proposed method, the multi-label feature selection problem has been transformed into a single-label one. Thus, the feature selection process is simple and fast, and the dimension reduction effect is obvious. The proposed method is very suitable for high-dimensional text data. Compared with the baseline method and the commonly used feature selection methods, the proposed feature selection methods all achieved better performance on the six public data sets, which demonstrates the effectiveness of the method.
In this paper, a method of feature importance based on CC is proposed from the perspective of category. The contribution of features to classification of different categories was calculated from two aspects of inter-category and intra-category, clarifying the importance of features to different categories, and selecting features with strong category classification ability. The proposed method has a good performance.
However, the proposed algorithm does not consider the correlation between categories, which should be studied in the future.

Author Contributions

L.Z. and Q.D. conceived the algorithm and designed the experiments. L.Z. implemented the experiments, analyzed the results and wrote the paper. Q.D. revised the clarity of the work as well as helping to write and organize the paper. All authors read and approved the final manuscript.

Funding

This research was supported by the Agricultural Finance Project under Grant 051821301112421014, the Provincial-School Cooperation Project under Grant 201704070, and the National High Technology Research and Development Program of China (863 Program) under Grant 2013AA102306.

Acknowledgments

We are deeply grateful to the reviewers and the editors for their valuable comments and suggestions, which improved the technical content and presentation of the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wei, F.; Duan, Q.; Xiao, X.; Zhang, L. Classification technique of Chinese agricultural text information based on SVM. Trans. Chin. Soc. Agric. Mach. 2015, 46, 174–179. [Google Scholar]
  2. Ren, F.; Deng, J. Background Knowledge Based Multi-Stream Neural Network for Text Classification. Appl. Sci. 2018, 8, 2472. [Google Scholar] [CrossRef]
  3. Al-Anzi, F.S.; AbuZeina, D. Toward an enhanced Arabic text classification using cosine similarity and Latent Semantic Indexing. J. King Saud Univ. Comput. Inf. Sci. 2017, 29, 189–195. [Google Scholar] [CrossRef] [Green Version]
  4. Li, X.; Ouyang, J.; Zhou, X. Labelset topic model for multi-label document classification. J. Intell. Inf. Syst. 2016, 46, 83–97. [Google Scholar] [CrossRef]
  5. Liu, J.; Chang, W.; Wu, Y.; Yang, Y. Deep Learning for Extreme Multi-label Text Classification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 115–124. [Google Scholar]
  6. Liu, P.; Qiu, X.; Huang, X. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics 2017, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1–10. [Google Scholar]
  7. Guo, Y.; Chung, F.; Li, G. An ensemble embedded feature selection method for multi-label clinical text classification. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 15–18 December 2016; pp. 823–826. [Google Scholar]
  8. Glinka, K.; Wozniak, R.; Zakrzewska, D. Improving Multi-label Medical Text Classification by Feature Selection. In Proceedings of the 2017 IEEE 26th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), Poznan, Poland, 21–23 June 2017; pp. 176–181. [Google Scholar]
  9. Zhang, M.; Pe, J.M.; Robles, V. Feature selection for multi-label naive Bayes classification. Inf. Sci. 2009, 179, 3218–3229. [Google Scholar] [CrossRef]
  10. Shao, H.; Li, G.; Liu, G.; Wang, Y. Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine. Sci. China Inf. Sci. 2013, 56, 1–13. [Google Scholar] [CrossRef]
  11. Yu, Y.; Wang, Y. Feature selection for multi-label learning using mutual information and GA. In International Conference on Rough Sets and Knowledge Technology; Springer: Cham, Switzerland, 2014; pp. 454–463. [Google Scholar]
  12. Gharroudi, Q.; Elghazel, H.; Aussem, A. A Comparison of Multi-Label Feature Selection Methods Using the Random Forest Paradigm. In Advances in Artificial Intelligence; Springer: Cham, Switzerland, 2014. [Google Scholar]
  13. Lee, J.; Kim, D.W. Memetic feature selection algorithm for multi-label classification. Inf. Sci. 2015, 293, 80–96. [Google Scholar] [CrossRef]
  14. Gu, Q.; Li, Z.; Han, J. Correlated multi-label feature selection. In Proceedings of the ACM International Conference on Information and Knowledge Management, Glasgow, UK, 24–28 October 2011; pp. 1087–1096. [Google Scholar]
  15. You, M.; Liu, J.; Li, G.; Chen, Y. Embedded Feature Selection for Multi-label Classification of Music Emotions. Int. J. Comput. Intell. Syst. 2012, 5, 668–678. [Google Scholar] [CrossRef] [Green Version]
  16. Cai, Z.; Zhu, W. Multi-label feature selection via feature manifold learning and sparsity regularization. Int. J. Mach. Learn. Cybern. 2017, 9, 1321–1334. [Google Scholar] [CrossRef]
  17. Xu, H.; Xu, L. Multi-label feature selection algorithm based on label pairwise ranking comparison transformation. In Proceedings of the International Joint Conference on Neural Networks, Anchorage, AK, USA, 14–19 May 2017; pp. 1210–1217. [Google Scholar]
  18. Lee, J.; Kim, D.W. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 2013, 34, 349–357. [Google Scholar] [CrossRef]
  19. Doquire, G.; Verleysen, M. Mutual information-based feature selection for multilabel classification. Neurocomputing 2013, 122, 148–155. [Google Scholar] [CrossRef]
  20. Lin, Y.; Hu, Q.; Liu, J.; Duan, J. Multi-label feature selection based on max-dependency and min-redundancy. Neurocomputing 2015, 168, 92–103. [Google Scholar] [CrossRef]
  21. Deng, X.; Li, Y.; Weng, J.; Zhang, J. Feature selection for text classification: A review. Multimed. Tools Appl. 2018, 78, 3797–3816. [Google Scholar] [CrossRef]
  22. Largeron, C.; Moulin, C.; Géry, M. Entropy based feature selection for text categorization. In Proceedings of the 2011 ACM Symposium on Applied Computing, TaiChung, Taiwan, 21–24 March 2011; pp. 924–928. [Google Scholar]
  23. Zhou, H.; Guo, J.; Wang, Y.; Zhao, M. A Feature Selection Approach Based on Interclass and Intraclass Relative Contributions of Terms. Comput. Intell. Neurosci. 2016, 2016, 1715780. [Google Scholar] [CrossRef] [PubMed]
  24. Spyromitros, E.; Tsoumakas, G.; Vlahavas, I. An Empirical Study of Lazy Multilabel Classification Algorithms; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5138, pp. 401–406. [Google Scholar]
  25. Zhang, M.; Zhou, Z. ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit. 2007, 40, 2038–2048. [Google Scholar] [CrossRef] [Green Version]
  26. Lastra, G.; Luaces, O.; Quevedo, J.R.; Bahamonde, A. Graphical Feature Selection for Multilabel Classification Tasks. In Proceedings of the Advances in Intelligent Data Analysis X-international Symposium, Porto, Portugal, 29–31 October 2011. [Google Scholar]
  27. Li, F.; Miao, D.; Pedrycz, W. Granular multi-label feature selection based on mutual information. Pattern Recognit. 2017, 67, 410–423. [Google Scholar] [CrossRef]
  28. Yu, L.; Liu, H. Efficient Feature Selection via Analysis of Relevance and Redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
  29. Chen, W.; Yan, J.; Zhang, B.; Chen, Z.; Yang, Q. Document Transformation for Multi-label Feature Selection in Text Categorization. In Proceedings of the IEEE International Conference on Data Mining, Omaha, NE, USA, 21–31 October 2007; pp. 451–456. [Google Scholar]
  30. Trohidis, K.; Tsoumakas, G.; Kalliris, G.; Vlahavas, I. Multi-label classification of music by emotion. EURASIP J. Audio Speech Music Process. 2011, 2011, 1–9. [Google Scholar] [CrossRef]
  31. Spolaôr, N.; Cherman, E.A.; Monard, M.C.; Lee, H.D. A Comparison of Multi-label Feature Selection Methods using the Problem Transformation Approach. Electron. Notes Theor. Comput. Sci. 2013, 292, 135–151. [Google Scholar] [CrossRef] [Green Version]
  32. Newton, S.; Carolina, M.M.; Grigorios, T.; HueiDiana, L. A systematic review of multi-label feature selection and a new method based on label construction. Neurocomputing 2016, 180, 3–15. [Google Scholar]
  33. Doquire, G.; Verleysen, M. Feature Selection for Multi-label Classification Problems; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6691, pp. 9–16. [Google Scholar]
  34. Lin, Y.; Hu, Q.; Liu, J.; Chen, J.; Duan, J. Multi-label feature selection based on neighborhood mutual information. Appl. Soft Comput. 2016, 38, 244–256. [Google Scholar] [CrossRef]
  35. Yang, Y.; Pedersen, J.O. A Comparative Study on Feature Selection in Text Categorization. Proc. Int. Conf. Mach. Learn. 1997, 412, 420. [Google Scholar]
  36. Church, K.W.; Hanks, P. Word association norms, mutual information, and lexicography. Comput. Linguist. 1990, 16, 76–83. [Google Scholar]
  37. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  38. Van Rijsbergen, C. Information Retrieval; Butterworth-Heinemann: London, UK, 1979; pp. 119–135. [Google Scholar]
  39. Pestian, J.P.; Brew, C.; Matykiewicz, P.; Hovermale, D.J.; Johnson, N.; Cohen, K.B.; Duch, W. A shared task involving multi-label classification of clinical free text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, Prague, Czech Republic, 29 June 2007; pp. 97–104. [Google Scholar]
  40. Ueda, N.; Saito, K. Parametric mixture models for multi-labeled text. In International Conference on Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2002; pp. 737–744. [Google Scholar]
  41. Schapire, R.E.; Singer, Y. BoosTexter: A boosting-based system for text categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef]
  42. He, Z.; Yang, M.; Liu, H. Joint learning of multi-label classification and label correlations. J. Softw. 2014, 25, 1967–1981. [Google Scholar]
  43. Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 1987, 24, 513–523. [Google Scholar] [CrossRef]
  44. Tsoumakas, G.; Spyromitros-Xioufis, E.; Vilcek, J.; Vlahavas, I. MULAN: A Java library for multi-label learning. J. Mach. Learn. Res. 2011, 12, 2411–2414. [Google Scholar]
Figure 1. Multi-label text feature selection method based on feature importance.
Figure 1. Multi-label text feature selection method based on feature importance.
Applsci 09 00665 g001
Figure 2. Experimental results on Medical data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Figure 2. Experimental results on Medical data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Applsci 09 00665 g002
Figure 3. Experimental results on Business data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Figure 3. Experimental results on Business data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Applsci 09 00665 g003
Figure 4. Experimental results on Computers data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Figure 4. Experimental results on Computers data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Applsci 09 00665 g004aApplsci 09 00665 g004b
Figure 5. Experimental results on Entertainment data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Figure 5. Experimental results on Entertainment data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Applsci 09 00665 g005
Figure 6. Experimental results on Health data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Figure 6. Experimental results on Health data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Applsci 09 00665 g006
Figure 7. Experimental results on Social data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Figure 7. Experimental results on Social data set. (a) Average Precision; (b) Hamming Loss; (c) One Error; (d) Coverage; (e) Ranking Loss.
Applsci 09 00665 g007aApplsci 09 00665 g007b
Table 1. Data sets description.
Table 1. Data sets description.
NO.Data set|S|dim(S)L(S)LCard(S)LDen(S)
1Medical9781449451.24540.0277
2Business11,214438301.59900.0533
3Computers12,444681331.50720.0457
4Entertainment12,730640211.41370.0673
5Health9205612321.64410.0514
6Social12,1111047391.27930.0328
Table 2. Increase (decrease) percentage of each evaluation metrics on six data sets.
Table 2. Increase (decrease) percentage of each evaluation metrics on six data sets.
Data SetAlgorithmtAP(%)↑HL(%)↓OE(%)↓CV(%)↓RL(%)↓
MedicalBRKNN10%23.9222.7740.8149.1653.21
MLKNN10%11.9119.8229.3927.5230.10
BusinessBRKNN40%2.698.6318.8229.2930.20
MLKNN40%1.657.1412.4912.2918.91
ComputersBRKNN20%6.8812.999.33-3.59*-1.07*
MLKNN20%5.8811.598.0714.1716.42
EntertainmentBRKNN20%36.3931.8828.8225.5631.58
MLKNN20%16.0912.6723.3914.8818.25
HealthBRKNN20%67.7954.0944.3238.3043.77
MLKNN30%9.9622.4720.9915.3821.59
SocialBRKNN10%7.7416.4819.767.0310.98
MLKNN10%6.6112.8916.5615.6218.35
Table 3. Experimental results on Medical data set.
Table 3. Experimental results on Medical data set.
AlgorithmEvaluationBaseLine10%20%30%40%50%Average
ALAAP0.6375 0.7900 0.7786 0.7413 0.7010 0.6770 0.7376
HL0.0224 0.0173 0.0186 0.0203 0.0214 0.0223 0.0200
OE0.4734 0.2802 0.3007 0.3446 0.4099 0.4386 0.3548
CV5.2680 2.6780 2.6681 3.2413 3.7864 4.3058 3.3359
RL0.0966 0.0452 0.0446 0.0538 0.0658 0.0775 0.0574
NLAAP0.4264 0.6053 0.4808 0.4587 0.4424 0.4346 0.4844
HL0.0277 0.0224 0.0262 0.0268 0.0272 0.0274 0.0260
OE0.6769 0.4929 0.6248 0.6452 0.6626 0.6697 0.6190
CV8.5582 5.3865 7.2309 7.8667 8.1998 8.3575 7.4083
RL0.1724 0.1003 0.1422 0.1566 0.1642 0.1678 0.1462
LLAAP0.6431 0.7939 0.7736 0.7403 0.70930.6847 0.7404
HL0.0229 0.0164 0.0196 0.0206 0.02210.0234 0.0204
OE0.4673 0.2761 0.3098 0.3548 0.40380.4355 0.3560
CV5.1517 3.0082 2.8911 3.2850 3.61693.9974 3.3597
RL0.0942 0.0487 0.0475 0.0550 0.06330.0709 0.0571
SLAAP0.6431 0.7977 0.7786 0.7423 0.7015 0.6764 0.7393
HL0.0229 0.0166 0.0188 0.0205 0.0223 0.0232 0.0203
OE0.4673 0.2638 0.3048 0.3549 0.4120 0.4477 0.3566
CV5.1517 2.6885 2.5881 3.1583 3.8354 4.1709 3.2882
RL0.0942 0.0454 0.0432 0.0523 0.0672 0.0751 0.0566
Table 4. Experimental results on Business data set.
Table 4. Experimental results on Business data set.
AlgorithmEvaluationBaseLine10%20%30%40%50%Average
ALAAP0.8500 0.8482 0.8611 0.8692 0.8729 0.8696 0.8642
HL0.0278 0.0287 0.0268 0.0255 0.0254 0.0260 0.0265
OE0.1233 0.1298 0.1156 0.1055 0.1001 0.1073 0.1117
CV4.8337 4.1161 3.9341 3.6375 3.4179 3.6064 3.7424
RL0.0831 0.0756 0.0696 0.0627 0.0580 0.0632 0.0658
NLAAP0.8617 0.8668 0.8713 0.8696 0.8684 0.8662 0.8685
HL0.0413 0.0259 0.0252 0.0310 0.0353 0.0378 0.0310
OE0.1323 0.1328 0.1277 0.1279 0.1271 0.1283 0.1288
CV2.6461 2.5978 2.5058 2.5241 2.5443 2.5792 2.5502
RL0.0510 0.0492 0.0462 0.0468 0.0475 0.0487 0.0477
LLAAP0.8459 0.8518 0.8626 0.8685 0.8644 0.8585 0.8612
HL0.0283 0.0289 0.0269 0.0260 0.0260 0.0267 0.0269
OE0.1286 0.1315 0.1177 0.1104 0.1083 0.1144 0.1165
CV4.9435 3.9726 3.8237 3.4919 4.0517 4.2886 3.9257
RL0.0853 0.0731 0.0672 0.0596 0.0691 0.0734 0.0685
SLAAP0.8459 0.8460 0.8648 0.8697 0.8687 0.8650 0.8628
HL0.0283 0.0287 0.0267 0.0256 0.0260 0.0266 0.0267
OE0.1286 0.1305 0.1141 0.1077 0.1071 0.1128 0.1144
CV4.9435 4.2223 3.7887 3.5617 3.4172 3.6875 3.7355
RL0.0853 0.0784 0.0664 0.0612 0.0587 0.0651 0.0660
Table 5. Experimental results on Computers data set.
Table 5. Experimental results on Computers data set.
AlgorithmEvaluationBaseLine10%20%30%40%50%Average
ALAAP0.62060.6290 0.6633 0.6558 0.6448 0.6354 0.6457
HL0.04080.0373 0.0355 0.0374 0.0389 0.0396 0.0377
OE0.43190.4327 0.3916 0.4005 0.4159 0.4197 0.4121
CV5.48376.6824 5.6804 5.0189 4.8361 4.9038 5.4243
RL0.12180.1472 0.1231 0.1080 0.1039 0.1066 0.1178
NLAAP0.61060.6357 0.6468 0.6306 0.6185 0.6134 0.6290
HL0.04210.0380 0.0386 0.0405 0.0413 0.0417 0.0400
OE0.44490.4211 0.4081 0.4265 0.4368 0.4427 0.4270
CV5.28355.5427 5.0874 5.0331 5.1437 5.1691 5.1952
RL0.11990.1286 0.1142 0.1124 0.1155 0.1163 0.1174
LLAAP0.62020.6308 0.6527 0.6503 0.6393 0.6344 0.6415
HL0.04080.0371 0.0366 0.0381 0.0394 0.0401 0.0383
OE0.43410.4226 0.4024 0.4099 0.4196 0.4262 0.4161
CV5.44566.7910 5.8355 5.1061 4.9505 4.9109 5.5188
RL0.12040.1508 0.1261 0.1082 0.1069 0.1067 0.1197
SLAAP0.62020.6322 0.6659 0.6576 0.6405 0.6345 0.6461
HL0.04080.0372 0.0354 0.0373 0.0390 0.0395 0.0377
OE0.43410.4301 0.3905 0.4012 0.4151 0.4219 0.4118
CV5.44566.5895 5.5776 5.0150 5.0003 4.9911 5.4347
RL0.12040.1453 0.1206 0.1081 0.1090 0.1089 0.1184
Table 6. Experimental results on Entertainment data set.
Table 6. Experimental results on Entertainment data set.
AlgorithmEvaluationBaseLine10%20%30%40%50%Average
ALAAP0.44680.5939 0.6094 0.5797 0.5408 0.5084 0.5664
HL0.08250.0550 0.0562 0.0650 0.0720 0.0768 0.0650
OE0.70310.5068 0.5005 0.5585 0.6177 0.6607 0.5688
CV5.87384.7235 4.3727 3.9918 4.0888 4.2952 4.2944
RL0.23940.1804 0.1638 0.1499 0.1561 0.1681 0.1637
NLAAP0.44870.6285 0.6279 0.5593 0.5039 0.4775 0.5594
HL0.08560.0587 0.0608 0.0707 0.0781 0.0816 0.0700
OE0.70470.4820 0.4806 0.5629 0.6335 0.6673 0.5653
CV4.22583.9436 3.7231 3.8558 4.0325 4.1280 3.9366
RL0.16870.1553 0.1438 0.1502 0.1591 0.1638 0.1544
LLAAP0.44480.5814 0.6000 0.5779 0.5311 0.5123 0.5605
HL0.08290.0569 0.0546 0.0658 0.0738 0.0780 0.0658
OE0.70490.5264 0.5112 0.5533 0.6316 0.6647 0.5774
CV5.90314.8710 4.4250 4.1108 4.0846 4.3590 4.3701
RL0.24070.1860 0.1653 0.1535 0.1553 0.1673 0.1655
SLAAP0.44480.5914 0.6100 0.5786 0.5335 0.5087 0.5644
HL0.08290.0554 0.0565 0.0659 0.0737 0.0778 0.0659
OE0.70490.5133 0.4936 0.5588 0.6303 0.6682 0.5728
CV5.90314.7438 4.3654 4.0314 4.1107 4.2302 4.2963
RL0.24070.1812 0.1631 0.1501 0.1574 0.1646 0.1633
Table 7. Experimental results on Health data set.
Table 7. Experimental results on Health data set.
AlgorithmEvaluationBaseLine10%20%30%40%50%Average
ALAAP0.3977 0.6490 0.6673 0.6288 0.4893 0.4527 0.5774
HL0.0893 0.0434 0.0410 0.0418 0.0482 0.0824 0.0514
OE0.7045 0.4311 0.3923 0.4598 0.6324 0.6667 0.5165
CV9.1015 6.1071 5.6158 5.0217 5.1979 5.8342 5.5553
RL0.1981 0.1227 0.1114 0.1004 0.1106 0.1257 0.1142
NLAAP0.4868 0.6743 0.6294 0.5599 0.5164 0.5015 0.5763
HL0.0657 0.0450 0.0504 0.0584 0.0628 0.0643 0.0562
OE0.7300 0.4115 0.4890 0.6118 0.6837 0.7083 0.5809
CV4.3185 4.2340 4.1030 4.1283 4.2314 4.2801 4.1954
RL0.0922 0.0893 0.0851 0.0861 0.0894 0.0910 0.0882
LLAAP0.3986 0.6479 0.6639 0.5821 0.5018 0.4612 0.5714
HL0.08930.0431 0.0417 0.0434 0.0570 0.0785 0.0527
OE0.7047 0.4295 0.3987 0.5340 0.6084 0.6633 0.5268
CV9.0197 6.0337 5.6436 5.3229 5.6505 5.6923 5.6686
RL0.19620.1207 0.1111 0.1061 0.1164 0.1200 0.1149
SLAAP0.3986 0.6496 0.6680 0.5571 0.4717 0.4545 0.5602
HL0.08930.0433 0.0410 0.0435 0.0497 0.0815 0.0518
OE0.7047 0.4306 0.3892 0.5662 0.6544 0.6656 0.5412
CV9.0197 6.0943 5.6392 5.2713 5.6608 5.8244 5.6980
RL0.19620.1223 0.1122 0.1085 0.1227 0.1256 0.1183
Table 8. Experimental results on Social data set.
Table 8. Experimental results on Social data set.
AlgorithmEvaluationBaseLine10%20%30%40%50%Average
ALAAP0.6716 0.7236 0.7008 0.6610 0.6944 0.6923 0.6944
HL0.0267 0.0223 0.0237 0.0275 0.0237 0.0242 0.0243
OE0.4255 0.3414 0.3796 0.4398 0.3937 0.4068 0.3923
CV5.6477 5.2504 4.8211 5.0733 5.0137 4.8636 5.0044
RL0.1120 0.0997 0.0926 0.0993 0.0965 0.0939 0.0964
NLAAP0.6083 0.6711 0.6504 0.6231 0.6125 0.6117 0.6338
HL0.0315 0.0267 0.0279 0.0301 0.0311 0.0313 0.0294
OE0.5214 0.4270 0.4580 0.4983 0.5161 0.5170 0.4833
CV4.5495 4.7592 4.4514 4.4851 4.5952 4.5599 4.5702
RL0.0933 0.0987 0.0907 0.0916 0.0945 0.0935 0.0938
LLAAP0.6684 0.7089 0.6912 0.5948 0.7012 0.6986 0.6789
HL0.0264 0.0231 0.0251 0.0283 0.0234 0.0237 0.0247
OE0.4308 0.3563 0.3908 0.5909 0.3768 0.3806 0.4191
CV5.6894 5.6646 5.2306 5.3454 5.7251 5.7900 5.5511
RL0.1127 0.1095 0.1007 0.1060 0.1127 0.1134 0.1085
SLAAP0.6685 0.7154 0.6977 0.6583 0.6892 0.6955 0.6912
HL0.0264 0.0230 0.0245 0.0268 0.0241 0.0241 0.0245
OE0.4308 0.3535 0.3858 0.4448 0.4017 0.3887 0.3949
CV5.6838 5.4147 4.8384 5.1892 5.0926 5.7444 5.2559
RL0.1126 0.1032 0.0931 0.1024 0.0983 0.1131 0.1020
Table 9. Comparison of classification performance on Medical data set.
Table 9. Comparison of classification performance on Medical data set.
EvaluationAlgorithm10%20%30%40%50%
AP↑CC+BRKNN0.79000.77860.74130.70100.6770
DF+BRKNN0.73570.75770.74380.71750.6821
MI+BRKNN0.41220.39790.37760.35210.3368
HL↓CC+BRKNN0.01730.01860.02030.02140.0223
DF+BRKNN0.01980.01990.02000.02070.0210
MI+BRKNN0.02740.02750.02750.02730.0275
OE↓CC+BRKNN0.28020.30070.34460.40990.4386
DF+BRKNN0.33950.32520.33950.37930.4191
MI+BRKNN0.70860.75160.82210.79760.8078
CV↓CC+BRKNN2.67802.66813.24133.78644.3058
DF+BRKNN3.88613.07693.49423.80644.7442
MI+BRKNN7.25647.32077.38098.59938.6850
RLCC+BRKNN0.04520.04460.05380.06580.0775
DF+BRKNN0.06930.04990.05860.06600.0851
MI+BRKNN0.13720.13810.13980.17060.1732
Table 10. Comparison of classification performance on Business data set.
Table 10. Comparison of classification performance on Business data set.
EvaluationAlgorithm10%20%30%40%50%
AP↑CC+BRKNN0.84820.86110.86920.87290.8696
DF+BRKNN0.84600.85410.85670.86060.8561
MI+BRKNN0.83300.83620.84240.84530.8433
HL↓CC+BRKNN0.02870.02680.02550.02540.0260
DF+BRKNN0.02830.02760.02750.02660.0273
MI+BRKNN0.02930.02890.02880.02850.0284
OE↓CC+BRKNN0.12980.11560.10550.10010.1073
DF+BRKNN0.13090.12530.12180.11300.1188
MI+BRKNN0.13420.13170.13140.13090.1308
CV↓CC+BRKNN4.11613.93413.63753.41793.6064
DF+BRKNN4.22243.99103.92184.14434.3527
MI+BRKNN4.73044.65954.17743.95144.3903
RLCC+BRKNN0.07560.06960.06270.05800.0632
DF+BRKNN0.07810.07190.07000.07180.0755
MI+BRKNN0.09120.08820.07820.07190.0810
Table 11. Comparison of classification performance on Computers data set.
Table 11. Comparison of classification performance on Computers data set.
EvaluationAlgorithm10%20%30%40%50%
AP↑CC+BRKNN0.62900.66330.65580.64480.6354
DF+BRKNN0.63520.65320.64800.63910.6314
MI+BRKNN0.59000.60930.61480.61350.6171
HL↓CC+BRKNN0.03730.03550.03740.03890.0396
DF+BRKNN0.03600.03630.03820.03950.0399
MI+BRKNN0.04070.04010.04090.04160.0415
OE↓CC+BRKNN0.43270.39160.40050.41590.4197
DF+BRKNN0.41490.39910.40930.42030.4253
MI+BRKNN0.46990.44910.44460.44830.4435
CV↓CC+BRKNN6.68245.68045.01894.83614.9038
DF+BRKNN6.73075.71935.03774.86934.9880
MI+BRKNN7.53246.54565.86395.53005.4504
RLCC+BRKNN0.14720.12310.10800.10390.1066
DF+BRKNN0.14760.12300.10800.10420.1081
MI+BRKNN0.16820.14540.12830.12040.1184
Table 12. Comparison of classification performance on Entertainment data set.
Table 12. Comparison of classification performance on Entertainment data set.
EvaluationAlgorithm10%20%30%40%50%
AP↑CC+BRKNN0.59390.60940.57970.54080.5084
DF+BRKNN0.56310.56110.52190.50700.4894
MI+BRKNN0.47020.47640.46870.45440.4409
HL↓CC+BRKNN0.05500.05620.06500.07200.0768
DF+BRKNN0.05740.06310.07240.07680.0789
MI+BRKNN0.06480.07140.08030.08440.0867
OE↓CC+BRKNN0.50680.50050.55850.61770.6607
DF+BRKNN0.54820.56280.62870.65800.6742
MI+BRKNN0.67990.69140.72790.74430.7555
CV↓CC+BRKNN4.72354.37273.99184.08884.2952
DF+BRKNN5.17034.69914.46754.44054.7396
MI+BRKNN5.77415.22734.64814.53804.6361
RLCC+BRKNN0.18040.16380.14990.15610.1681
DF+BRKNN0.20040.17870.17230.17300.1884
MI+BRKNN0.23120.20820.18570.18030.1837
Table 13. Comparison of classification performance on Health data set.
Table 13. Comparison of classification performance on Health data set.
EvaluationAlgorithm10%20%30%40%50%
AP↑CC+BRKNN0.64900.66730.62880.48930.4527
DF+BRKNN0.66680.66690.49250.46450.4424
MI+BRKNN0.56050.57740.54200.43850.4195
HL↓CC+BRKNN0.04340.04100.04180.04820.0824
DF+BRKNN0.04110.04160.05240.08070.0842
MI+BRKNN0.05030.04990.04990.05070.0816
OE↓CC+BRKNN0.43110.39230.45980.63240.6667
DF+BRKNN0.39860.38970.62420.64350.6684
MI+BRKNN0.53450.51450.58080.68530.7023
CV↓CC+BRKNN6.10715.61585.02175.19795.8342
DF+BRKNN6.08855.58385.84726.92597.4004
MI+BRKNN7.58646.72045.97216.13776.0252
RLCC+BRKNN0.12270.11140.10040.11060.1257
DF+BRKNN0.12080.11070.12390.14670.1586
MI+BRKNN0.16380.14320.12760.13520.1331
Table 14. Comparison of classification performance on Social data set.
Table 14. Comparison of classification performance on Social data set.
EvaluationAlgorithm10%20%30%40%50%
AP↑CC+BRKNN0.72360.70080.66100.69440.6923
DF+BRKNN0.70810.67240.60970.68630.6912
MI+BRKNN0.58170.59240.59060.54210.6016
HL↓CC+BRKNN0.02230.02370.02750.02370.0242
DF+BRKNN0.02260.02710.02760.02380.0232
MI+BRKNN0.03100.03120.03130.03210.0308
OE↓CC+BRKNN0.34140.37960.43980.39370.4068
DF+BRKNN0.35250.42030.55220.40430.3916
MI+BRKNN0.53550.54640.53850.61410.5044
CV↓CC+BRKNN5.25044.82115.07335.01374.8636
DF+BRKNN5.80124.99905.82135.43525.9549
MI+BRKNN7.46906.28136.45518.11378.3188
RLCC+BRKNN0.09970.09260.09930.09650.0939
DF+BRKNN0.11180.09690.11540.10630.1180
MI+BRKNN0.15490.12860.12870.16750.1732

Share and Cite

MDPI and ACS Style

Zhang, L.; Duan, Q. A Feature Selection Method for Multi-Label Text Based on Feature Importance. Appl. Sci. 2019, 9, 665. https://doi.org/10.3390/app9040665

AMA Style

Zhang L, Duan Q. A Feature Selection Method for Multi-Label Text Based on Feature Importance. Applied Sciences. 2019; 9(4):665. https://doi.org/10.3390/app9040665

Chicago/Turabian Style

Zhang, Lu, and Qingling Duan. 2019. "A Feature Selection Method for Multi-Label Text Based on Feature Importance" Applied Sciences 9, no. 4: 665. https://doi.org/10.3390/app9040665

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop