Unified Graph-Based Missing Label Propagation Method for Multilabel Text Classification

: In multilabel classification, each sample can be allocated to multiple class labels at the same time. However, one of the prominent problems of multilabel classification is missing labels (incomplete labels) in multilabel text. The multilabel classification performance is reduced significantly with the presence of missing labels. In order to address the incomplete or missing label problem, this study proposes two methods: an aggregated feature and label graph-based missing label handling method (GB-AS), and a unified graph-based missing label propagation method (UG-MLP). GB-AS is used to obtain an initial label matrix based on the similarity of both document levels: feature-based weighting representation and label-based weighting representation. On the other hand, UG-MLP is introduced to construct a mixed graph that combines GB-AS and label correlations into a single groundwork. A high-order label correlation is learned from the incomplete training data and applied to supplement the missing label matrix, which guides the creation of multilabel classification models. The combination of the mixed graphs by UG-MLP is aimed to obtain the benefits of both graphs to increase the classification performance. To evaluate UG-MLP, the metrics of precision, recall and F-measure were used on three benchmark datasets, namely, the Reuters-21578, Bibtex and Enron datasets. The experimental results show that UG-MLP outperformed GB-AS as well as other state-of-the-art approaches. Therefore, we can infer from the findings that by plotting a unified graph based on joining aggregated feature and label weightings together with the label correlation, the performance of multilabel classification can be improved.


Introduction
In multilabel learning, each label is connected with one or more labels simultaneously. The main key difference between multilabel and single-label learning is that the labels in multilabel learning are correlated. Therefore, the multilabel learning task is slightly more difficult to resolve. In machine learning and data mining, multilabel learning is a task that suffers from the curse of high dimensionality. There are various intricacies in real-world multilabel datasets that reduce classifiers' performance. The following are open problems: high dimensionality, feature and label correlations and missing labels in multilabel classification [1]. This paper intended to focus on the problem of multilabel learning with missing labels or incomplete labels. Given training cases that have an incomplete or partial collection of these labels (i.e., some of their labels are missing), the suggested approach in this research seeks to label each test item with multiple labels. Handling high dimensionality and feature correlations in multilabel learning may not effectively work if it does not consider the missing label problem (incomplete and noisy label space). Most contemporary approaches treat this problem as a supervised weak-label learning problem, assuming that there are enough partially labeled examples available [2][3][4]. Collecting or annotating such instances, on the other hand, is costly and time consuming. In multilabel learning, usually, the label sets of objects sharing the same cluster are strongly connected, whereas label sets of other clusters are loosely correlated [5].
Most of the existing multilabel learning methods assume a full dataset is given, and each instance in it is attached to a full form of the label set. It is hard to believe that multilabel datasets are full, and it is incorrect to do so (all possible class combinations have training instances), mainly when the size of their base classes is large. In addition, it is also hard to believe that all examples are assigned to complete and correct label sets. Several factors contribute to the difficulty of gathering high-quality data: Firstly, the missing label problem occurs because of the existence of synonyms and ambiguities among distinct classes, causing annotators to select only a portion of the labels with similar meanings [3,6]. The missing label problem is also caused by the subjectivity of manual labeling by tag providers, and the large size of the category vocabulary in some datasets makes it difficult for labelers to annotate each instance [3]. Furthermore, incomplete labeling problems caused by the huge number of instances and possible assigned labels result in a high cost in labor and time. In other words, the performance of multilabel learning algorithms is influenced by label incompleteness. Based on this, an effective multilabel text classification method should handle missing labels and label correlations. Thus, the following are the key contributions of this paper.
1. Proposing the graph-based aggregated similarity weighting of features and labels method (GB-AS) to predict missing labels in multilabel text classification; 2. Further improving GB-AS by combining it with label correlations to create a unified graph method called UG-MLP.
This paper is organized into six sections: It starts with the introduction of the paper. Section 2 introduces related work on multilabel learning with missing labels, while Section 3 presents GB-AS. Section 4 presents UG-MLP. Section 5 introduces the implementation of GB-AS and UG-MLP in multilabel text classification. The experiments are highlighted in Section 6. Discussions are introduced in Section 7. Lastly, the last section (Section 8) winds up this paper.
Hashemi et al. [15] presented the MGFS method, which is a graph-based multilabel feature selection. The correlation distance matrix (CDM) is formed using their proposed method, which estimates the correlation distance between characteristics and each class label. The Euclidean distance is then used with the CDM to create a complete weighted feature label network with nodes representing features. Lastly, the relevance of graph nodes is determined via the weighted PageRank algorithm. LSML, a new technique for learning label-specific features for multilabel classification with missing labels, was presented by Huang et al. [4]. First, by learning high-order label correlations, a new supplemental label matrix is created from the partial label matrix. Then, for each class label, a label-specific data representation is learned, and the multilabel learner is built concurrently using the learned high-order label correlations. Sun et al. [18] presented the costsensitive label ranking approach with low-rank and sparse constraints, called CORALS, to enhance missing labels and remove noisy labels at the same time, the relevance ordering of all possible labels on each instance, including both missing and noisy labels, is optimized by reducing a cost-sensitive ranking loss. Zhu et al. [12] suggested a novel multilabel feature selection with missing labels to address missing labels, multilabel learning and feature selection simultaneously. Multilabel feature selection, on the other hand, solely characterizes pairwise label associations by generating a graph at the instance level.
Meanwhile, He et al. [10] proposed a novel multilabel classification method with label correlations, missing labels and feature selection, named MLMF. MLMF allows for combined learning of multilabel classification and label correlations as well as joint learning of independent binary classifiers. Wu et al. [8] suggested a novel methodology based on a unified network of label dependencies to solve the challenge of multilabel learning with missing labels. To convey label information from provided labels to missing labels, a uniform network of label dependencies is established using a mixed dependency graph.
Guan and Li [7] presented a novel Bayesian model with label regularization and label confidence constraints, named BM-LRC, to handle the difficulties of incomplete labels in multilabel text classification and exploit label correlations with two label constraints. The label manifold regularization might aid in the handling of incomplete labels. The label confidence constraints, contrarily, can prevent overestimation of negative labels induced by regularizing labels, resulting in a safer inference. Ibrahim et al. [9] proposed a weighted loss function to account for the confidence in each label/sample pair that can easily be incorporated to adjust a pre-trained model on missing labels or incomplete labels in multilabel text dataset problems. Pal et al. [11] presented an attention-based graph neural network (AGNET) model for capturing the attentive correlation structure between labels. A feature matrix and a correlation matrix are used by the graph attention network to capture and examine the fundamental dependencies between the labels and to build classifiers for the assignment. The generated classifiers are used on sentence feature vectors obtained from the text feature extraction network to enable end-to-end training. Label imputation in training sets was performed by Ma and Chow using their two-level label recovery approach [13]. This approach recovers the label matrix by using an instance-wise semantic relational graph and a label-wise semantic relational graph. These two graphs show that two-level semantic relationships may be reliably captured. In addition, a label-specific feature selection method was proposed for performing label prediction in testing sets. Wang et al. [14] presented new principles of multilabel information entropy and multilabel correlative information used to identify the unnecessary features, feature independence and feature interaction. In a multilabel text dataset, feature interaction is used to choose more valuable characteristics that could otherwise be overlooked due to the inadequate label space.
Song et al. [17] proposed a label mask multilabel text classification model (LM-MTC), which is inspired by the idea of cloze questions in language models. LM-MTC is able to capture implicit relationships among labels through the powerful ability of pre-trained language models. To create a label-based masked language model (MLM), they assigned a separate token to each conceivable label and randomly masked the token with a certain probability. However, six multilabel datasets, including the Reuters-21578 text dataset, were used to test the proposed method. The proposed method was compared against eleven other methods including the following: binary relevance (BR), classifier chains (CC), CNN, CNN-RNN, hierarchical attention network (HAN), HAN + label graph (LG), BERT, BERT + MLM, MEGNET and label-wise (LW). On all datasets, the proposed method outperformed the other methods in terms of the F-measure. Li and Yang [18] proposed a dependence maximization-based label embedding approach for obtaining the latent space, where the label and feature information can both be included at the same time. In addition, instead of using the encoding method, the low-rank factorization model on the label matrix is used to leverage label correlations. The Hilbert-Schmidt independence criterion increases the reliance between the feature space and label space in order to improve predictability. For multilabel text classification, a CNN integrated with a capsule network was presented by Yan S et al. [19]. To extract information relating to classification outcomes in high-dimensional features, they utilized a capsule network instead of a pool layer in the CNN. In addition, they explored joining a recurrent neural network (RNN) with a convolutional neural network (CNN) to describe the frequency and space properties of a capsule network to complete categorization. Nevertheless, two multilabel datasets, including the Reuters-21578 text dataset, were used to evaluate the proposed method. The results demonstrated that the proposed method outperformed Conv, Conv-Cap and Rec-Conv. In terms of the F-measure, the proposed method achieved a higher result on all the datasets.
Class labels typically have connections with one another in multilabel learning. Previous research has shown that solving the problem of missing labels can considerably increase multilabel learning performance, both theoretically and practically [7][8][9][10][11]. However, when class labels from the training data are missing, the label correlation directly acquired from the incomplete label matrix may be erroneous, greatly affecting the performance of multilabel classifiers. In the meantime, previous approaches to multilabel learning with missing labels primarily use a similar representation of the data consisting of all the features in the discrimination of all the class labels [4,16,18]. As previously described, this common strategy may be suboptimal. Therefore, the difficult problem of multilabel learning with missing labels is how to learn accurate label correlations from the incomplete label data and use them to guide the development of classification models.
In this paper, we propose GB-AS to predict the missing labels and then further improve it by combining it with label correlations to create a unified graph method, called UG-MLP, to improve the performance of multilabel learning.

Graph-Based Aggregated Similarity Weighting of Features and Labels Method (GB-AS)
This section presents the proposed graph-based aggregated similarity weighting of features and labels method, GB-AS, to aid in handling missing labels in multilabel text classification. To solve missing label problems, GB-AS recovers the underlying label matrix (Y) via transferring the label information from the feature space and label distribution of the nearest neighbor to the label space. Nearest neighbor instances often share similar features and label information, which means that it is highly likely that these instances have the same set of labels. Prior to describing the GB-AS method, the following presents an instance (document) in a weighted graph representation.

Instance-Level Feature/Label Graph Construction
A weighted graph of instances is constructed based on an assumption: an edge connecting two instances carries a notion of similarity. Thus, if instances are connected, they share similar features and label information, which means that it is highly likely that these instances have the same set of labels. The graph shown in Figure 1 demonstrates the similarity between different documents, e.g., between documents D1 and D3, there is a similarity of 70%. Therefore, this information can be used to predict missing labels. Multilabel text data can be represented using either a feature space representation or a label space representation as follows: Feature space representation : where is an instance (text or document), and is the value of feature in document .
Label space representation : where is an instance (text or document), and is the value of feature in document .
Based on the representation of the instance-level feature and label space above, the handling of the missing label problem is based on the following considerations: • Graph-based missing label handling with feature similarity weighting (GB-FS); • Graph-based missing label handling with label similarity weighting (GB-LS).
With the assumption that, by combining information of both the feature and the label, predicting missing label can be more accurate, we propose the graph-based missing label handling that aggregates both document-level feature-based weighting and label-based weighting into one aggregated similarity weighting, GB-AS. The following Figure 2 illustrates the construction of the GB-AS algorithm, where the output of GB-AS is an input to predict missing labels. The following subsection describes the four phases of the graph-based aggregated similarity weighting of features and labels to predict missing labels shown in Figure 2 3.1.1. Phase 1. Instance-Level Feature Space Similarity Weighting As mentioned above, the feature space representation represents each instance (document) as a vector of feature values as follows: The instance feature space similarity weighting is obtained as follows: 1. Given the feature space representation matrix, each instance is modeled as a vector of feature values: where both xi and are documents.
2. To compute the pairwise similarity values between two documents where each document is represented by a numerical feature vector, the cosine similarity is used to measure the similarity. Cosine similarity is one of the most well-known similarity measures which is applied to text documents [20,21] based on the following Equation (1): 3. The feature space is built using the k-highest similarity neighborhoods to maintain the intrinsic local correlation information. To ensure feature space representation validity in recovering label structures, the weight matrix of the feature space similarity is defined as in the following Equation (2) where ∈ × and ( ) denote the k-highest similarity neighborhoods of the − ℎ instance measured by cos , ; is the natural logarithm; ∑ , is the summation of cosine similarities of the − ℎ instance with all its k-highest similarity neighborhoods. The summation is used as a normalization method to make sure that the similarities are between 0 and 1.

Phase 2. Instance-Level Label Space Similarity Weighting
The label space representation represents each instance (document), X, as a vector of its assigned labels, l, where the assigned labels are assigned by a human as in training data or by a multilabel classifier for testing data values.
The instance label similarity matrix is obtained as follows: 1. Given the label space representation matrix, each instance is modeled as a vector of assigned label values: where both xi and are documents, and can be 1 if label is assigned to document xi, 0 if is not assigned to document xi or −1 if is unknown or missing for document xi.
2. The hamming-based similarity is often used for the measurement of label similarity relationships in multilabel datasets, and it demonstrates how close the label sets of x and y instances are [23][24][25]. R , is the complement of the normalized Hamming distance between the label sets of two elements. It is defined as follows (see Equation (3)): ∆ is the number of labels that have different assignments with and , ∆ is the symmetric difference between its two arguments, |·| is the cardinality of the resulting set and z is the number of labels in the label set.

The weighted matrix of the label space,
, is constructed as follows (see Equation (4)): where ∈ R × and ( ) denote the k-highest similarity neighborhoods of the − ℎ instance measured by the Hamming-based similarity, and ∑ R , is the summation of the Hamming-based similarities of the − th instance with all its k-highest similarity neighborhoods. The summation is used as a normalization method to make sure that all similarities are between 0 and 1.

Phase 3. Instance-Level Aggregated Similarity Weighting
In the previous steps (Phase I and Phase II), two document-level weighting matrices are obtained. The first one is the document-level feature-based weighting matrix in which the similarities between documents are estimated based on the weighting of their shared features (see Phase I). The second one is the document-level label-based weighting matrix in which the similarities between documents are calculated based on their shared labels (see Phase II).
The similarity weighting of GB-AS is the document-level aggregated similarity weighting of both document-level feature-based weighting and document-level labelbased weighting. The following is the function of the aggregated similarity weighting (see Equation (5)): where and are the weight numbers decided by the characteristics of the label space similarity matrix (see Equation (2)) and feature space similarity matrix (see Equation (4)). Several experiments are conducted to find the best values of and .

Phase 4: GB-AS
As shown in Algorithm 1, step 1 is used to build a graph using Equation (5), and the following describes, in more detail, the steps of the algorithm (see Algorithm 1): Step 1: A graph = { , , } is constructed that consists of nodes which represent documents and edges E which reflect the similarity between the vertices.
Step 2: The transition matrix or weighting matrix is obtained using either the document-level feature-based weighting matrix , the document-level label-based weighting matrix or the aggregated similarity weighting matrix . The weight between the vertex of document and the vertex of document represents the transition probability of information between these two vertices.
Step 3: Next, the algorithm initializes the label matrix . For the label matrix , , is 1 if document is labeled as class K, 0 if document is not labeled as class K or if the label is missing.
Step 4: First iterative step in missing label prediction: In this step, the new label matrix is calculated using Equation (6) [26]: where is used to control the percentage of label values which are obtained from neighbors in each iteration. If φ approaches 0, it means label values are obtained from the neighbors gradually (iteration after iteration or step by step); if it approaches 1, it means that the node's label is decided by its neighbors completely.
Step 5: Second iterative step in label matrix normalization: This step fixes the label matrix by changing back the values of assigned and unassigned labels to their initial state. This means that for any , which has a value of 1 or 0 in the original label matrix and whose value changed in the previous iteration, it returns to its original value of 1 or 0. Only missing labels that were originally in the original label matrix whose values are kept change gradually in the iterative process. At the end of the iteration process, for each missing label (assigned in the original matrix), if its final value approaches 1, the missing label is assigned a 1; if its value approaches 0, the missing label is assigned a 0.  (2) and (4) α: Parameter used in Equation (6) Output: Y labeled matrix //Build Graph Step 1: W ← build a graph () using either Equations (2), (4) or (5) Step 2: T ← obtain_ Transition matrix (W) Step 3: initialize the label matrix Once GB-AS has been constructed (as in Figure 1), GB-AS can be used as an input in UG-MLP. The following describes, in detail, how GB-AS is used in UG-MLP, as well as how UG-MLP predicts missing labels.

Unified Graph-Based Missing Label Propagation (UG-MLP)
Unlike single-label text classification methods, multilabel text classification assigns more than one label to each instance, and these labels are frequently related. Existing methods for handling missing labels in multilabel learning are built based on the assumption that missing label information of an instance can be propagated from its k-nearest neighbors [3,13]. Some of these methods use the first-order label correlation exploitation strategy which ignores label correlations [2][3][4].
Different from most of the existing algorithms [2,4,13], in multilabel classification with missing labels, this work defines the missing label recovery as a function of both (1) information propagated from its k-nearest neighbors based on the feature space-based similarity and label distribution-based similarity to the label space, and (2) label correlation and label-specific feature learning, which are combined into a single framework. A high-order label correlation is learned from the incomplete training data and applied to augment the missing label matrix and guide the construction of multilabel classification models.
To address the missing label problem, this work proposes a merged model of missing label handling by assembling a mixed graph, as shown in Figure 3. The model jointly incorporates (1) the instance-level feature space-based similarity and the label distributionbased similarity, GB-AS, and (2) accurate label correlations. The following describes the unified graph-based model for missing label handling based on nearest neighbor feature and label similarity and label correlation methods: The input of the UG-MLP method shown in Figure 3 is the datasets with missing labels, and its output is datasets with recovered labels. In addition, Figure 3 illustrates the overall architecture of graph-based integrated information missing label propagation and handling methods in multilabel text classification. The following describes, in detail, each of the phases mentioned in Figure 3.

Phase 1. Inferring Accurate Label Correlation Phase
It is essential and crucial to use the label correlation to handle the missing labels in multilabel text classification, and in multilabel text classification in general. In real-world multilabel classification, labels can have strong interdependencies, and some of them may even be missing. Based on this, if two or more labels have strong interdependencies, i.e., they frequently co-occur in many cases, in the new case, this strong interdependency information can be used to predict if one of them is missing. In this phase, the label correlation matrix is obtained using the following steps: Step 1: Pairwise label probability correlation estimation: A pairwise label probability correlation from the dataset is obtained by calculating the probability of pairwise labels [4]. The pairwise label probability correlation is defined as the conditional probability of a label given another label, as shown in Equation (7): where , are the two labels from the label set, is the number of document instances with the label and , represents the number of document instances that simultaneously have both labels and . > 0 is the smoothness parameter. It is important to note that label correlations are not asymmetric.
Step 2: Pearson's coefficient estimation: In this step, the label correlation asymmetry from the dataset is obtained by calculating Pearson's correlation coefficient [27]. The input to the algorithm is a × Boolean matrix, E, whose rows are indexed by the set of labels, = { , , … . , }, and columns by the elements of a set of text instances, = { , … . , }. If label is assigned to document , , = 1. The input matrix is obtained from the label space representation. The following matrix is an example of a Boolean matrix, E, as shown in Table 1.
where n is the size of the dataset, , and , are the values of and assigned to instance documents , and are the averages of values of and , respectively, and are the respective standard deviations of and .
Step 3: Cover coefficients: As in step 2, the input to the algorithm is a × Boolean matrix, E. The entries of C denote pairwise cover coefficient values among the labels. The cover coefficient measure between two labels and is the probability that a text labeled by is also labeled by . Informally, the cover coefficient of a label with respect to another denotes the extent to which the assignment profile of the first label is covered by that of the second one [28].
Let be the reciprocal of the sum of the entries in the ℎ row and the reciprocal of the sum of the entries in the ℎ row column of the E matrix. The cover coefficient between labels and is obtained using the following formula in Equation (9): Step 4: Final label correlation estimation: In this work, the final label correlation, , is calculated using the three labels to label relation measures, namely, pairwise label probability correlation, (see step 1), Pearson's coefficient, ( , ) (see step 2), and cover coefficient correlation coefficient, (see step 3), based on the following aggregated label correlation function (see Equation (10)): where = = , and + + = 1.

Phase 2. Prior Missing Label Induction Phase
This phase uses the output from phase 1, which is the label correlation matrix, to give a prior estimation of the missing labels in the label matrix .
Based on the problem definition, the label matrix Y of multilabel text classification with missing labels is = The final output of this phase is a new label matrix with a prior prediction of missing values. It is used in the missing label prediction iterative step, Equation (12).

Phase 3. UG-MLP Phase
In this phase, UG-MLP is implemented. UG-MLP addresses the missing label problem by jointly incorporating (i) the instance-level feature space-based similarity and label distribution-based similarity and (ii) accurate label correlations. The following describes, in more detail, the steps of the UG-MLP algorithm (see Algorithm 2): Step 1: Missing label prediction iterative step: In this step, the new label matrix is calculated using the following Equation (12) [10]: where is used to control the percentage of label values which are obtained from neighbors in each iteration. If α approaches 0, it means label values are obtained from neighbors iteratively (slowly). If α approaches 1, it means that the node's label is decided by its neighbors completely.
Step 2: Second iterative step in label matrix normalization: This step fixes the label matrix by changing back the values of assigned and unassigned labels to their initial state. This means that for any , which has a value of 1 or 0 in the original label matrix and whose value changed in the previous iteration, it returns back to its original value of 1 or 0. Only missing labels that were originally in the original label matrix whose values were kept change gradually in the iterative process. At the end of the iteration process, for each missing label (assigned in the original matrix), if its final value approaches 1, it is assigned a 1; if its value approaches 0, it is assigned a 0.

Implementation of GB-AS and UG-MLP in Multilabel Text Classification
In order to prove whether the proposed GB-AS and UG-MLP methods are effective in solving multilabel classification with missing or incomplete labels, we implemented both GB-AS and UG-MLP in multilabel text classification. The following Figure 4 illustrates the overall architecture of all the stages in implementing the GB-AS and UG-MLP methods, which include the following: (1) pre-processing phase, missing label handling or missing label recovery phase, (2) ensemble feature selection phase, (3) multilabel text classification phase, (4) evaluation phase.

Pre-Processing
Pre-processing is a necessary step before implementing machine learning techniques. It includes four steps, namely, (1) tokenization, (2) normalization, (3) stop word removal and (4) stemming. To begin, tokenization attempts to convert a document's text into a machine learning-friendly structure. The tokenization method entails converting a text into discrete fragments separated by a space or a specific indicator, with each unit matching a single word. Next, the normalization stage seeks to clean the data by removing noise and undesirable data such as special characters. After that, the stop word task is used to eliminate superfluous words such as conjunctions, pronouns and prepositions. Finally, stemming is the process of determining the root or stem of a word. Stemming is a crucial step for dealing with high-dimensional and sparse data, especially with multilabel text data classification, because it isolates the word's root form from its inflectional or derivational variants.

Multilabel Two-Layer MI and Clustering-Based Ensemble Feature Selection Method (DMMC-EFS)
Feature selection (FS) methods improve the performance of text classification tasks in terms of their learning speed and efficacy. FS methods also decrease the number of data dimensions and eliminate data that are useless, unnecessary or noisy. In order to reduce the dimensionality and to improve the classification performance, this work used a dynamic two-layer MI and clustering-based ensemble feature selection (DMMC-EFS) method suggested by [29] to select features with strong class discrimination ability. The DMMC-EFS method considers the (1) dynamic global weight of features, (2) heterogeneous ensemble and (3) maximum dependency and relevancy and minimum redundancy of features. This method aims to overcome the high dimensionality of multilabel datasets and achieve a superior multilabel text classification performance, through Equation (13) and Equation (14):

Multilabel Classification Model
For evaluation, the AdaBoost.MH multilabel learning model was used in this work. The AdaBoost.MH model was selected because it is a state-of-the-art multilabel classification algorithm that is frequently utilized in multilabel text classification studies [29,30]. AdaBoost.MH iteratively builds several weak classifiers before grouping them into a final classifier that can estimate multiple labels for a given occurrence. Boosting algorithms, such as the AdaBoost adaptive booster, transform a weak classifier into a strong one through integration and training. The AdaBoost algorithm can change the weight distribution of training data and consistently select the best weak classifier from the sample weight distribution. The AdaBoost algorithm can adaptively alter the weight distribution of training data and consistently select the best weak classifier from the sample weight distribution to integrate all weak classifiers and vote by a specific weight to produce a classifier model. The AdaBoost.MH algorithm is a multilabel variant of the AdaBoost algorithm [31].

Multilabel Text Dataset
Three datasets were used in this study: Bibtex, Enron and Reuters-21578, which are described in Table 2. They are publicly available datasets for multilabel text classification problems. The values of cardinality, instances, labels, attributes and average imbalance ratio per label (avgIR) are displayed. The cardinality is used to measure the average number of classes concerning each instance. As for the density, it is calculated by dividing the cardinality by the total sum of labels. These datasets can be downloaded from the Mulan website [32].

Evaluation Metric and Experiment Setup
The results of the experiment on multilabel classification were measured using the following three evaluation metrics: precision, recall and F-measure, using Equations (15)- (17), respectively. In this domain, these evaluation metrics are well known for drawing comparisons [29,[33][34][35].
To verify the effectiveness of UG-MLP as well as GB-AS, this study used the Reuters-21578, Bibtex and Enron datasets. Experiments were conducted with the three multilabel datasets under various missing percentages. Thus, the percentage of missing labels was set as 10%, 20%, 30%, 40% and 50%, as suggested in [4]. In particular, when there is a 0% missing proportion, this indicates that the label matrix is full. The label structure is degraded to a larger extent as the missing percentage rises.
In the experiments, all the datasets were set up to have percentages of missing labels of 10%, 20%, 30%, 40% and 50%. The multilabel classification followed the diagram shown in Figure 3 and was used to classify the datasets. To see the effectiveness of GB-AS and UG-MLP separately, we set up three types of multilabel classifications. First, as a baseline, both GB-AS and UG-MLP were not applied. Thus, there was no label recovery mechanism used in the multilabel classification. In other words, the baseline multilabel classification just implements the feature selection DMMC-EFS, and thus we named the baseline as DMMC-EFS. The second approach applied only GB-AS to recover labels and DMMC-EFS as the FS. This was to assess the effectiveness of only using aggregated feature and label information. The third approach applied UG-MLP to recover labels and DMMC-EFS as the FS. The third approach is a label recovery mechanism that not only has aggregated feature and label information (as in GB-AS) but also autocorrelation labels. In other words, UG-MLP should hypothetically be able to classify the dataset better than the other approaches despite how high the percentage of missing labels is.

Evaluation Metric and Experiment Setup
The results obtained (F-measure) for DMMC-EFS after label recovery with one of the four missing label handling methods are shown in Table 2 and Figure 4. Based on the results of this experiment, almost the same observations were made: The incompleteness of class labels significantly influences the performance of multilabel classifiers, and these approaches to modeling missing labels offer a better performance than DMMC-EFS in most cases. Since DMMC-EFS does not deal with missing labels, its performance degrades rapidly as the missing rate rises. The performance of missing label handling methods, GB-AS and UG-MLP, declines relatively slow with the increase in the missing rate. As expected, the UG-MLP approach outperforms the other methods is due to the recovery of the missing labels by exploiting label correlations. Additionally, the results of two stateof-the-art methods (i.e., LM-MTC [17] and Rec-Conv-Cap [19]) that were reviewed in Section 2 are also compared in Table 2. These results are discussed in the next section.

Discussion
From the obtained results (see Tables 3-5) regarding the evaluation and comparison of missing label handling methods, i.e., GB-AS and UG-MLP, on all the datasets, the following important observations were made: The missing class label problem influences the performance of multilabel text classification. The performance of good multilabel classifiers which achieve high results with complete label datasets such as the multilabel classifier with DMMC-EFS decreases rapidly as the missing rate increases. Therefore, to preserve their good performance, a multilabel text classification model should have a missing label handling method. Meanwhile, the performance of multilabel learning with DMMC-EFS, which handles missing label problems using missing label handling methods, i.e., GB-AS and UG-MLP, declines somewhat relative slowly with the increase in the missing rate. It can be noticed that the results for the Bibtex dataset are better than those for the Reuters-21578 dataset. In addition, comparing the proposed methods against LM-MTC [17] and Rec-Conv-Cap [19] using the Reuters-21578 dataset, it can be seen that, on average, the proposed methods achieve a better performance. This difference in performance could be because the other methods do not utilize unified graphs. It is also worth noting that the performance of some of the algorithms used for comparison (especially DMMC-EFS) declines quickly for the datasets with a high class imbalance ratio (see Table 2). The problem of missing labels might be affected directly by the balance of the classes. On the other hand, solving the missing label problem may also cause class imbalance as they are related. In other words, if a dataset initially has a high class imbalance ratio, it will possibly become worse after solving the missing label problem. Working on such an issue is promising; hence, it can be considered as future work. The superb performance of the proposed UG-MLP against GB-AS indicates the effectiveness of unifying label space recovery with nearest neighbor similarity learning and the superiority of label space recovery by exploiting sparse high-order label correlations. The superb performance of the proposed UG-MLP also indicates the effectiveness of the proposed method of solving multilabel learning with missing labels. From the line graphs in Figures 5-7, the results obtained using the proposed UG-MLP outperform those obtained using GB-AS on all datasets. The proposed UG-MLP behaves in a similar way in all datasets regarding its performance with the increase in the missing rate.   The solutions provided in this work might be useful in a variety of applications as they show a performance increase compared to the baseline. Organizations that manage text files such as those in health may utilize the solutions provided in this study.

Conclusions
This paper presented a scalable multilabel text classification method to handle missing label and label correlation problems of multilabel datasets. This paper designed several missing label prediction methods for multilabel feature selection. First, this paper introduced the GB-AS method for multilabel text classification. Then, this paper proposed a new method, UG-MLP, for multilabel text classification that considers unifying label space recovery with nearest neighbor similarity learning and the superiority of label space recovery by exploiting sparse high-order label correlations. Based on the obtained results, the performance of the missing label prediction method UG-MLP indicates its effectiveness in solving multilabel learning with missing labels. In light of this, the reduction in the missing labels has a direct impact on the performance of text classification in the multilabel domain problem. However, the computational complexity of the proposed methods is higher than that of the baseline methods as the running time has increased notably. This may be considered as a limitation, and investigating it is suggested as future work.

Conflicts of Interest:
The authors declare no conflict of interest.