In this section, we begin by introducing the concepts of MLC, CC, and feature selection. This will establish the theoretical foundation of our proposed approach. Then, we will review the CCbased MLC algorithms.
2.1. MLC Problem and CC Approach
MLC is a supervised learning problem where an object is naturally associated with multiple concepts. It is important to explore the couplings between labels because they can improve the prediction performance of MLC methods. In order to describe our algorithm, some basic aspects of MLC and CC are outlined first. Suppose
$\left(x,y\right)$ represents a multilabel sample where
$x$ is an instance and
$y\subseteq L$ is its corresponding label set.
$L$ is the total label set, which is defined as follows:
We assume that
$\mathit{x}=\left({x}^{1},{x}^{2},\cdots ,{x}^{D}\right)\in X$ is the Ddimensional feature vector corresponding to
$x$, where
$X\subseteq {R}^{D}$ is the feature vector space and
${x}^{d}\text{}\left(d=1,\text{}2,\cdots ,D\right)$ denotes a specific feature.
$\mathit{y}=\left({y}^{1},{y}^{2},\cdots ,{y}^{Q}\right),\in {\left\{0,1\right\}}^{Q}$ is the
Qdimensional label vector corresponding to
$y$, and
${y}^{q}$ is described as:
Thus, the multilabel classifier
$h$ can be defined as:
We further assume that there are
$m+n$ samples, in which
$m$ samples form the training set
${X}_{train}$ and
$n$ samples form the test set
${X}_{test}$. They are defined as follows:
Among the MLC algorithms, CC may be the most famous MLC method concerning label correlations. It involves
$Q$ binary classifiers as in the binary relevance (BR) method. The BR method transforms MLC into a binary classification problem for each label and it trains
$Q$ binary classifiers
${C}_{j},\text{}j=1,\text{}2,\dots ,Q$. In the CC algorithm, classifiers are linked along a chain where each classifier handles the BR problem associated with
${l}_{j}\in L,\text{}j=1,\text{}2,\dots ,Q$. The feature space of each link in the chain is extended with the 0 or 1 label associations of all previous links. The training and prediction phases of CC are described in Algorithm 1 [
13]:
Algorithm 1. The training phase of the classifier chain (CC) algorithm 
Input: ${X}_{train}$ Output: ${C}_{j},\text{}j=1,2,\dots ,Q$ Steps
for $j\in 1,\text{}2,\text{}\dots ,\text{}Q$; do /*the jth binary transformation and training*/; ${X}_{train}^{\prime}\to \left\{\right\}$; for $\left({\mathit{x}}_{\mathit{i}},{\mathit{y}}_{\mathit{i}}\right)\in \text{}{X}_{train}$, $i=1,\text{}2,\text{}\dots ,\text{}m$; do ${X}_{train}^{\prime}\leftarrow {X}_{train}^{\prime}\text{}U\left\{\left(\left({\mathit{x}}_{\mathit{i}},\text{}{l}_{1},\dots ,{l}_{j1}\right),{l}_{j}\right)\right\}$; ${C}_{j}:{X}_{train}^{\prime}\to {l}_{j}\in \left\{0,1\right\}$ /*train ${C}_{j}$ to predict binary relevance of ${l}_{j}$ */.

After the training step, a chain ${C}_{1},\dots ,{C}_{Q}$ of binary classifiers is generated. As shown in Algorithm 2, each classifier ${C}_{j}$ in the chain learns and predicts the binary association of label ${l}_{j}$, $j=1,2,\dots ,Q$, augmented by all prior binary relevance predictions in the chain ${l}_{1},\dots ,{l}_{j1}$.
Algorithm 2. The prediction phase of the CC algorithm 
Input: ${X}_{test}$, ${C}_{j},\text{}j=1,2,\dots ,Q$ Output: predicted label sets of each instance in ${X}_{test}$ Steps
for ${\mathit{x}}_{\mathit{i}}\in \text{}{X}_{test}$, $i=m+1,\dots ,m+n;$ do; ${\mathit{y}}_{\mathit{i}}^{\prime}\leftarrow \left\{\right\}$; for $j=1\text{}to\text{}Q;$ do ${\mathit{y}}_{\mathit{i}}^{\prime}\leftarrow {\mathit{y}}_{\mathit{i}}^{\prime}\text{}U\left\{\left({l}_{j}\leftarrow {C}_{j}:\left({\mathit{x}}_{\mathit{i}},{l}_{1},\dots ,{l}_{j1}\right)\right)\right\}$; return $\left({\mathit{x}}_{\mathit{i}},{\mathit{y}}_{\mathit{i}}^{\prime}\right)$·$i=m+1,\dots ,m+n$/*the classified test samples */.

The chaining mechanism of the CC algorithm transmits label information among binary classifiers, which considers label couplings and thus overcomes the label independence problem of the BR method.
2.3. Related Work of CCBased Approaches
The CC algorithm uses a highorder strategy to tackle the MLC problem; however, its performance is sensitive to the choice of label order. Much of the existing research has focused on solving this problem. Read et al. [
19] proposed using the ensemble of classifier chains (ECC) method, where the CC procedure is repeated several times with randomly generated orders and all the classification results are fused to produce the final decision by the vote method. Chen et al. [
20] adopted kernel alignment to calculate the consistency between the label and kernel function and then assigned a label order according to the consistency result. Read et al. [
21] presented a novel doubleMonte Carlo scheme to find a good label sequence. The scheme explicitly searches the space for possible label sequences during the training stage and makes a tradeoff between predictive performance and scalability. Genetic algorithm (GA) was used to optimize the label ordering since GA has a global search capability to explore the extremely large space of label permutation [
22,
23]. Their difference is that the one of the works [
23] adopts the method of multiple objective optimization to balance the classifier performance through considering the predictive accuracy and model simplicity. Li et al. [
24] applied the community division technology to divide label set and acquire the relationships among labels. All the labels are ranked by their importance.
Some of the existing literature [
25,
26,
27,
28,
29,
30,
31,
32,
33] adopted Graph Representation to express label couplings and rank labels simultaneously. Sucar et al. [
25] introduced a method of chaining Bayesian classifiers that integrates the advantages of CC and Bayesian networks (BN) to address the MLC problem. Specifically, they [
25] adopted the tree augmented naïve (TAN) Bayesian network to represent the probabilistic dependency relationships among labels and only inserted the parent nodes of each label into the chain according to the specific selection strategy of the tree root node. Zhang et al. [
26] used mutual information (MI) to describe the label correlations and constructed a corresponding TAN Bayesian network. The authors then applied a stacking ensemble method to build the final learning model. Fu et al. [
27] adopted MI to present label dependencies and then built a related directed acyclic graph (DAG). The Prim algorithm was then used to generate the maximum spanning tree (MST). For each label, this algorithm found its parent labels from MST and added them into the chain. Lee et al. [
28] built a DAG of labels where the correlations between parents and child nodes were maximized. Specifically, they [
28] quantified the correlations with the conditional entropy (CE) method and found a DAG that maximized the sum of CE between all parent and child nodes. They discovered that highly correlated labels can be sequentially ordered in chains obtained from the DAG. Varando et al. [
29] studied the decision boundary of the CC method when Bayesian networkaugmented naïve Bayes classifiers were used as base models. It found polynomial expressions for the multivalued decision functions and proved that the CC algorithm provided a more expressive model than the binary relevance (BR) method. Chen et al. [
30] firstly used the Affinity Propagation (AP) [
31] clustering approach to partition the training label set into several subsets. For each label subset, it adopted the MI method to capture label correlations and constructed a complete graph. Then the Prim algorithm was applied to learn the treestructure constraints (in MST style) among different labels. In the end, the ancestor nodes were found from MST and inserted into the chain for each label. Huang et al. [
32] firstly used a kmeans algorithm to cluster the training dataset into different groups. The label dependencies of each group were then expressed by the cooccurrence of the label pairwise and corresponding labels were then modeled by a DAG. Finally, the parent labels of each label were inserted into the chain. Sun et al. [
33] used the CE method to model label couplings and constructed a polytree structure in the label space. For each label, its parent labels were inserted into the chain for further prediction. Targeting the two drawbacks of the CC algorithm mentioned in
Section 1, Kumar et al. [
34] adopted the beam search algorithm to prune the label tree and found the optimal label sequence from the root to one of the leaf nodes.
In addition to the aforementioned graphbased CC algorithms and considering conditional label dependence, Dembczyński et al. [
35] introduced probability theory into the CC approach and outlined their probabilistic classifier chains (PCC) method. Read et al. [
36] extended the CC approach to the classifier trellises (CT) method for large datasets, where the labels were placed in an ordered procedure according to the MI measure. Wang et al. [
37] proposed the classifier circle (CCE) method, where each label was traversed several times (just once in CC) to adjust the classification result of each label. This method is insensitive to label order and avoids the problems caused by improper label sequences. Jun et al. [
38] found that the label with higher entropy should be placed after those with lower entropy when determining label order. Motivated by this idea, they went on to propose four ordering methods based on CE and, after considering each, suggested that the proposed methods did not need to train more classifiers than the CC approach. In addition, Teisseyre [
39] and Teisseyre, Zufferey and Słomka [
40] proposed two methods that combine the CC approach and elasticnet. The first integrated feature selection into the proposed CCnet model and the second combined the CC method and regularized logistic regression with modified elasticnet penalty in order to handle costsensitive features in some specific applications (for example, medical diagnosis).
In summary, in order to address the label ordering problem, almost all of the published CCbased algorithms adopted different ranking methods to determine a specific label order (by including all of the labels or just a part of them). All of these methods are reasonable, but it is hard to judge which label order (or label ordering method) is the best one for a specific application. Furthermore, some of these studies adopted different methods (for example, MI, CE, conditional probability, cooccurrence, and so on) to explore label correlations, but they only focused on label space; the coupling relationships were insufficiently exploited. In addition, the CCbased algorithms used in these published studies added previous labels into feature space to predict the current label, which resulted in an excessively large feature space, especially for large label sets. Thus, feature selection is a necessary stage in the CCbased algorithms. In this work, we propose a novel MLC algorithm based on the CC method and feature selection which avoids the label ranking problem and exploits the coupling relationships both in label and feature spaces.
Section 3 provides a detailed description of the proposed method.