1. Introduction
Multilabel learning extends the standard supervised learning model that associates a sample with a single label by simultaneously categorizing samples with more than one class label. In the past, multilabel learning has been successfully implemented in many different domains [
1], such as bioinformatics [
2,
3,
4], information retrieval [
5,
6], speech recognition [
7,
8], and online user reviews with negative comment classification [
9,
10].
As proposed by various researchers, one of the most intuitive handlings of multilabel classification is to treat it as a series of independent two-class binary classification problems, known as Binary Relevance (BR) [
11,
12]. However, this approach has several limitations: the performance is relatively poor; it lacks scalability; and it cannot retain label correlations. Researchers have tried improving these issues using chains of binary classifiers [
13], feature selections [
14,
15,
16], class dependent and label specific features [
17], and data augmentation [
5]. Among these investigations, the augmentation method in [
5] has demonstrated the best performance on some multilabel datasets using Incorporating Multiple Cluster Centers (IMCC). In biomedical datasets, hML-KNN [
18] has also proven to be one of the best multilabel classifiers, its success relying on feature-based and neighbor-based similarity scores.
Generating ensembles of classifiers is yet another robust and dependable method for enhancing performance on multilabel data [
15,
19], with bagging shown to perform well in several multilabel classification benchmarks [
20,
21]. However, a common problem in bagging is pairwise label correlation. To solve this problem, [
22] invented a stacking technique where learners were trained on each label. The decisions of the classifiers were fed into stacked combinations combined with a meta-level classifier whose output produced a final decision. Another algorithm of note is the RAndom k-labELsets (RAkEL) [
19], which constructed ensembles by training single-label learning models on random subsets of labels. The reader is referred to [
23,
24] for a discussion of some recent RAkEL variants.
Rather quickly, multilabel systems incorporating deep learners have risen to the top in classification performance. The impact deep learning has had on this field is visible in the large number of open source and commercial APIs currently providing deep learning solutions to multilabel classification problems. Some open source APIs are DeepDetect [
25], VGG19 (VGG) [
26], Inception v3 [
27], InceptionResNet v2 [
28], ResNet50 [
29], MobileNet v2 [
30], YOLO v3 [
31], and fastai [
32]. Some of the commercially available APIs include Imagga [
33], Wolfram Alpha’s Image Identification [
34], Clarifai [
35], Microsoft’s Computer Vision [
36], IBM Watson’s Visual Recognition [
37], and Google’s Cloud Vision [
38]. A comparison of performance across several multilabel benchmarks is reported in [
39].
Despite these advances in deep multilabel learning, research using advanced techniques, including those for building ensembles, has lagged compared to work in other areas of machine learning. The few ensembles that have been built for multilabel problems apply simple techniques (we describe the state of the art in
Section 2). More innovative ensembling techniques have yet to be explored. The most advanced is proposed in [
9], where random forest [
40] was used as the ensembling technique. Only a couple of studies have explored ensembling with deep learners [
2,
41]. There is a need to investigate cutting edge deep learning ensembling methods for the multilabel problem.
The goal of this paper is to experimentally derive a more advanced ensemble for multilabel classification that combines Long Short-Term Memory networks (LSTM) [
42], GRU [
43], and Temporal Convolutional Neural Networks (TCN) [
44] using Adam variants [
45] as the means of ensuring diversity. We posted some early preliminary results combining these classifiers on ArXiv [
46]. Conceptually, a GRU is a simplified Bidirectional Long Short-Term Memory (LSTM) model. GRU and LSTM have hidden temporal states and gating mechanisms. Both networks have a problem with intermediate activations, a function of low-level features. To offset this shortcoming, these models must be combined with classifiers that discern high-level relationships. In the research presented here, we investigate the potential of TCN as a complement classifier since it offers the advantage of hierarchically capturing relationships across high, intermediate, and low-level timescales.
As mentioned, diversity in ensembles composed of LSTM, GRU, and TCN is assured in our approach by incorporating different Adam optimization variants. Adam finds low minima of the training loss, and many variants have been developed for augmenting Adam’s strengths and offsetting its weaknesses. As will be demonstrated, combining ensembles of LSTM, GRU, and TCN with IMCC [
5] and a bootstrap-aggregated (bagged) decision trees ensemble (TB) [
9] (carefully modified for managing multilabel data) further enhances performance.
Some of the contributions of this study are the following:
To the best of our knowledge, we are the first to propose an ensemble method for managing multilabel classification based on combining sets of LSTM, GRU, and TCN, and we are the first to use TCN on this problem.
Two new topologies of GRU and TCN are also proposed here, as well as a novel topology that combines the two.
Another advance in multilabel classification is the application of variants of Adam optimization for building our ensembles.
Finally, for comparison with future works by other researchers in this area, the MATLAB source code for this study is available at
https://github.com/LorisNanni (accessed on 1 November 2022).
The effectiveness and strength of investigating more cutting-edge ensembling techniques are demonstrated in the experimental section where we evaluate the performance of different ensembles with some baseline approaches across several multilabel benchmarks. Our best deep ensemble is compared with the best multilabel methods tested to date and shown to obtain state-of-the-art performance across many domains.
This paper is organized as follows: In
Section 2, we report on some recent work applying deep learning to multilabel classification. In
Section 3 we describe the benchmarks and performance indicators used in the experimental section. In
Section 4, each element and pre-processing in the proposed approach is detailed. This section covers a brief discussion of the preprocessing methods used and descriptions of GRU, TCN, IMCC, and LSTM networks. This section also describes pooling, training, and our method for generating the ensembles. In
Section 5, Adam optimization and all the Adam variants tested here are addressed. In
Section 6, experimental results are presented and discussed. Finally, in
Section 7, we summarize the results and outline some future directions of research.
2. Related Works
One of the first works to apply deep learning to the problem of multilabel classification is [
47]. The authors in that study proposed a simple feed-forward network using gradient descent to handle the functional genomics problem in computational biology. A growing body of research has since ensued that has advanced the field and application of deep learning to a wide range of multilabel problems. In [
48], for example, a Convolutional Neural Network (CNN) combined with data augmentation using binary cross entropy (BCE) loss and adagrad optimization was designed to tackle the problem of land cover scene categorization. In [
49], a CNN using multiclass cross entropy loss was developed to detect heart rhythm/conduction abnormalities. In that study, each element in the output vector corresponded to a rhythm class. An early ensemble was developed in [
50], that combined CNN with LSTM to handle the multilabel classification problem of protein-lncRNA interactions, and in [
2] an ensemble of LSTM combined with an ensemble of classifiers based on Multiple Linear Regression (MLR) was generated to predict a given compound’s Anatomical Therapeutic Chemical (ATC) classifications. Recurrent CNNs (RCNNs) have recently been evaluated in many multilabel problems, including identifying surgical tools in laparoscopic videos [
51] using a GRU and in recommendation systems for prediagnosis support [
52].
A growing number of researchers, in addition to [
51], have explored the benefits of adding GRUs to enhance the performance of multilabel systems. In [
53], for example, sentiment in tweets were analyzed by extracting topics with a C-GRU (Context-aware GRU). a sentiment in tweets were analyzed by extracting topics with a C-GRU (Context-aware GRU). In [
54], a classifier system called NCBRPred was designed with bidirectional GRUs (BiGRUs) to predict nucleic acid binding residues based on the multilabel sequence labeling model. The BiGRUs were selected to capture the global interactions among the residues.
In terms of ensembles built to handle multilabel problems, GRUs have been shown to work well with CNNs. In [
41], an Inception model using was combined with GRU network to identify nine classes of arrhythmias. Adam was used for optimization in [
41], but no variants were used for building ensembles as in the system proposed here.
Table 1 compares existing models for deep learning, as discussed above, applied to the multilabel problem. The first column lists the techniques and models used in these systems and indicates whether ensembles were used in the study. An
X means that the indicated technique or model was used.
3. DataSets
The following multilabel data sets were selected to evaluate our approach. These are standard benchmarks in multilabel research and run the gamut of typical multilabel problems (music, image, biomedical, and drug classifications). The names provided below are not necessarily those reported in the original papers but rather those commonly used in the literature.
Cal500 [
55]: This dataset contains human-generated annotations, which label some popular Western music tracks. Tracks were composed by 500 artists. Cal500 has 502 instances, including 68 numeric features and 174 unique labels.
Scene [
11]: This dataset contains 2407 color images. It includes a predefined training and testing set. The images can have the following labels: beach (369), sunset (364), fall foliage (360), field (327), mountain (223), and urban (210). Sixty-three images have been assigned two category labels and one image three, making the total number of labels fifteen. The images all went through a preprocessing procedure. First, the images were converted to the CIE Luv space, which is perceptually uniform (close to Euclidean distances). Second, the images were divided into a 7 × 7 grid, which produced 49 blocks. Third, the mean and variance of each band were computed. The mean represents a low-resolution image, while the variance represents computationally inexpensive texture features. Finally, the images were transformed into a feature vector (49 × 3 × 2 = 294 dimensions).
Image [
56]: This dataset contains 2000 natural scene images. Images are divided into five base categories: desert (340 images), mountains (268 images), sea (341 images), sunset (216 images), and trees (378 images). Categorizing images into these five basic types produced a large dataset of images that belonged to two categories (442 images) and a smaller set that belonged to three categories (15 images). The total number of labels in this set, however, is 20 due to the joint categories. All images went through similar preprocessing methods as discussed in [
11].
Yeast [
57]: This dataset contains biological data. In total there are 2417 micro-array expression data and phylogenetic profiles. They are represented by 103 features and are classified into 14 classes based on function. A gene can be classified into more than one class.
Arts [
5]: This dataset contains 5000 art images, which are described by a total of 462 numeric features. Each image can be classified into 26 classes.
Liu [
15]: This dataset contains drug data used to predict side effects. In total it has 832 compounds. They are represented by 2892 features and 1385 labels.
ATC [
58]: This dataset contains 3883 ATC coded pharmaceuticals. Each sample is represented by 42 features and 14 classes.
ATC_f: This dataset is a variation of the ATC data set described above. In this dataset, however, the patterns are represented by a descriptor of 806 dimensions (i.e., all three descriptors are examined in this dataset as described in [
59]).
mAn [
4]: This dataset contains protein data represented by 20 features and 20 labels.
Bibtex: This dataset is highly sparse and was used in [
5].
Enron: a highly sparse dataset used in [
5].
Health: a highly sparse dataset used in [
5].
Table 2 shows a summary of the benchmarks along with their names, number of patterns, features, and labels, as well as the average number of class labels per pattern (LCard).
4. Proposed Approaches
4.1. Model Architectures
As previously indicated, the Deep Neural Network (DNN) architectures developed in this work combine LSTM, GRU, and TCN networks that have been adapted to handle multilabel classification. The general structure of each model is available in
Figure 1. GRU with (N = 50) hidden units is followed by a max pooling and a fully connected layer. Multiclass classification is provided in the sigmoid output layer. TCN has a similar architecture, except that a fully connected layer is followed by a max pooling layer. These two architectures are labeled in this work GRU_A and TCN_A.
Experiments reveal that both GRU and TCN perform better in some situations when a convolutional level is placed immediately before the network itself. Convolution modifies input features with simple mathematical operations on other local features. These operations can produce better model generalization where features achieve higher special independence. When a convolutional level is attached before any TCN topologies, the network is labeled here as TCN_B.
In some GRU experiments, we add a batch-normalization layer immediately after a convolutional because batch normalization standardizes the inputs to a layer for each mini-batch, thereby stabilizing the learning process and dramatically reducing the number of training epochs required to train very deep networks. A GRU with a batch-normalization layer following a convolution is labeled GRU_B.
In addition, we investigate a sequential combination of GRU_A (without the pooling layer) followed by a TCN_A, where the sigmoid output of GRU_A becomes the input of TCN_A. This combination is labeled GRU_TCN.
The last architecture shown in
Figure 1 is a network composed first of an LSTM layer with 125 hidden units followed by a dropout layer that randomly sets input elements to zero with a probability of 0.4. This is followed by a GRU layer with 100 hidden units and another dropout layer with a probability of 0.4. The end of the architecture is composed of a fully connected layer followed by a sigmoid layer.
The loss function is the binary cross entropy loss between the predicted labels (the output) and the actual labels (the target). Binary cross entropy loss computes the loss of a set of m observations by computing the following average:
where
and
are the actual and predicted label vectors of each sample
, respectively.
For details on the implementation environment, see the source code at
https://github.com/LorisNanni (accessed on 1 November 2022). All the code was tested and developed with MATLAB 2022a using a Titan RTX GPU.
4.2. Pre-Processing
In the main, most dataset samples require no preprocessing before being fed into our proposed networks. However, some preprocessing is needed when feature vectors are very sparse.
Two types of preprocessing were applied in our experiments:
Feature normalization in the range [0, 1] for the dataset ATC_f for IMCC [
5];
For the datasets Liu, Arts, bibtex, enron, and health, feature transform was performed with PCA, where 99% of the variance was retained. Feature transform is only necessary for our proposed networks and not for IMCC and TB. Poor performance resulted when using the original sparse data as input to our proposed networks.
We also discovered that LSMT_GRU does not converge if a normalization step is not performed for the ATC_f dataset; it performs poorly even when the normalization step is performed. However, LSMT_GRU does converge when normalization is followed by PCA projection, where 99% of the variance is retained.
4.3. Long Short-Term Memory (LSTM)
The LSTM layer in our topologies learns long-term dependencies between the time steps in a time series and sequence data [
61]. This layer performs additive interactions, which can help improve gradient flow over long sequences during training.
LTSM can be defined as follows. Let the output or hidden state be ht, and the cell state be ct at time step t. The first LSTM block uses the initial state of the network and the first-time step of the sequence to compute the first output and updated cell state. At time step t, the block uses the current state of the network (ct−1, ht−1) and the next time step of the sequence to compute the output and the updated cell state ct.
The state of the layer is the hidden state/output state and the cell state. The hidden state at time step t contains the output of the LSTM layer for this time step. The cell state contains information learned from previous time steps. At each time step, the layer adds to or removes information from the cell state. The layer controls these updates using gates.
The basic components of a LSTM are an input gate, forget gate cell candidate, and output gate: the first determines the level of cell state update, the second the level of cell state reset (forget), the third adds information to cell state, and the fourth controls the level of the cell state added to the hidden state.
Given the above and letting
be the input sequence with
, we can then define the input gate
, the forget gate
, the cell candidate
and the output gate
as:
where
are matrices and vectors and
denotes the gate activation function and σc the state activation function. The LSTM layer function, by default, uses the sigmoid function given by
to compute the gate activation function.
We then define:
as the cell state, where ʘ is the Hadamard (component-wise) product.
The output vector is defined as:
The LSTM layer function, by default, uses the hyperbolic tangent function to compute the state activation function.
4.4. Gated Recurrent Units (GRU)
GRU [
43], like LTSM, is also a recurrent neural network with a gating mechanism. GRUs can handle the gradient vanishing problem and increase the length of term dependencies from the input. GRU has a forget gate that enables the network learn which old information is relevant for understanding the new information [
62]. Unlike LSTM, GRU has fewer parameters because there is no output gate, yet the performance of GRU is similar to LSTM at many tasks: speech signal modeling, polyphonic music modeling, and natural language processing [
7,
61]. They also perform better on small datasets [
63] and work well on denoising tasks [
64].
The basic components of GRU are an update gate and a reset gate. The Reset gate measures how much old information to forget. The reset gate decides which information to forget and which should be passed on to the output.
Letting
be the input sequence and
, the update gate vector
and the reset gate vector
can be defined as
where
and
are matrices and vectors and
σ is the sigmoid function.
We define
as the candidate activation vector, where
ϕ is the tanh activation, and ʘ is the Hadamard (component-wise) product. The term
is the amount of past information for the candidate activation vector.
As can be observed, the update gate vector
measures how much new vs. old information is combined and kept [
43].
4.5. Temporal Convolutional Neural Networks (TCN)
TCNs [
65] contain a hierarchy of one-dimensional convolutions stacked over time to form deep networks that perform fine-grained detection of events in sequence inputs. Subsequent layers are padded to match the size of the convolution stack, and the convolutions of each layer make use of a dilation factor that exponentially increases over the layers. In this architecture, the first layers find short-term connections in the data while the deeper layers discover longer-term dependencies based on the features extracted by previous layers. Thus, TCNs have a large receptive field that bypasses a significant limitation of most RNN architectures.
The design of TCN blocks can vary considerably. The TCN block in this work is composed of a convolution of size three with 175 different filters, followed by a ReLU and batch normalization, followed by another convolution with the same parameters, a ReLU, and batch normalization. Four of these blocks are stacked. The dilation factors of the convolutions are , with the number of a layer. We use a fully connected layer on top, then a max pooling layer followed by the output layer. The output layer is a sigmoid layer for multiclass classification. For training, we use dropout with a probability of 0.05.
4.6. IMCC
IMMC [
5] has two steps: (1) creates virtual samples to augment the training set and (2) performs multilabel training. As augmentation is what provides IMCC its performance boost, the remainder of this discussion will be on the first step.
Augmentation is performed with
k-means clustering [
66], along with the calculation of clustering centers. Let
be a feature matrix and
be the label matrix, where
is the number of samples. If all samples are partitioned into
clusters
and
is partitioned into cluster
so that
, then the average of samples should capture the semantic meaning of the cluster.
The center
of each cluster
is defined as
where
is an indicator function that is equal to 1 when
or to 0, otherwise [
5].
A complementary training set
can be generated by averaging the label vectors of all instances of
, thus:
To understand more fully how the objective function deals with the original dataset
and the complementary dataset
see [
5]. In this study, the hyperparameters of IMCC are chosen by five-fold cross-validation on the training set.
4.7. Pooling
Pooling layers (comprised of a single max along the time dimension) are added to the end of the GRU and TCN block. In this way, the dimensionality of the processed data is reduced and only the most important information is retained, and the probability of overfitting is diminished.
4.8. Fully Connected Layer and Sigmoid Layer
The fully connected layer has l neurons, where l is the number of output labels in the given problem. This layer is fully connected with the previous layer. The activation function of this final layer is a sigmoid in the range [0…1]. These values are interpreted as the confidence relevance, or final probabilities, of each label. The output of the model is thus a multilabel classification vector, where the output of each neuron of the fully connected layer provides a score ranging from 0 to 1 for a single label in the set of labels.
4.9. Training
As noted in the introduction, training is accomplished using different Adam variants. Each of these variants is discussed below in
Section 4. The learning rate is 0.01, and the gradient decay and squared gradient decay factors are 0.5 and 0.999, respectively.
In addition, gradients are cut off with a threshold equal to one using L2 norm gradient clipping. The minibatch size is set to 30, and the number of epochs in our experiments is set to 150 for GRU, LSTM_GRU, and GRU_TCN but 100 for TNC.
4.10. Ensemble Generation
Ensembles combine the outputs of more than one model. In this work, models are trained on the same problem, and their decisions are fused using the average rule. The reason for constructing ensembles is that they improve system performance and prevent overfitting. It is well known that an ensemble’s prediction and generalization increase when the diversity among the classifiers is increased.
A simple way of generating diversity in a set of neural networks is to initialize them randomly. However, applying different optimization strategies is a better way to strengthen diversity. By varying the optimization strategy, it is possible for the system to find different local minima and achieve different optima. In this work, we evaluate several Adam optimizers suitable for ensemble creation: the Adam optimizer [
45], diffGrad [
67] and four diffGrad variants: DGrad [
68], Cos#1 [
68], Exp [
69], and Sto.
We generate an ensemble of 40 networks using this method: for each layer of each network, an optimization approach (DGrad, Cos#1, Exp, and Sto) is randomly selected for that layer.
6. Experimental Results
The goal of the first experiment was to build and evaluate the performance of the different variants of the base models combined with all the components detailed in
Section 4 and
Section 5. All ensembles were fused by the average rule. In
Table 3, we provide a summary of each of these ensembles: the number of classifiers and hidden units, the number of training epochs, and the loss function. Given a base standalone GRU_A with 50 hidden units trained by Adam for 50 epochs (labeled Adam_sa), we generated different ensembles by incrementally increasing complexity. We do this by combining ten Adam_sa (Adam_10s), increasing the number of epochs to 150 (Adam_10), selecting differnt optimizers (DG_10, Cos_10, Exp_10, Sto_10), and fusing the best results in the following ways:
DG_Cos is the fusion of DG_10 + Cos_10;
DG_Cos_Exp is the fusion of DG_10 + Cos_10 + Exp_10;
DG_Cos_Exp_Sto is the fusion of DG_10 + Cos_10 + Exp_10 + Sto_10;
StoGRU is an ensemble composed of 40 GRU_A, combined by average rule, each coupled with the stocastic approach explained in
Section 4.10;
StoGRU_B as StoGRU but based on GRU_B;
StoTCN is an ensemble of 40 TCN_A, combined by average rule, each coupled with the stochastic approach explained in
Section 4.10;
StoTCN_B as StoTCN but based on TCN_B;
StoGRU_TCN is an ensemble of 40 GRU_TCN each coupled with the stochastic approach explained in
Section 4.10;
StoLSTM_GRU is an ensemble of 40 LSTM_GRU each coupled with the stochastic approach explained in
Section 4.10;
ENNbase is the fusion by average rule of StoGRU and StoTCN;
ENN is the fusion by average rule of StoGRU, StoTCN, StoGRU_B, StoTCN_B and StoGRU_TCN;
ENNlarge is the fusion by average rule of StoGRU, StoTCN, StoGRU_B, StoTCN_B, StoGRU_TCN and StoLSTM_GRU.
An ablation study for assessing the performance improvement that each module of our ensemble achieved is reported in
Table 4. Only tests on GRU_A is reported here. The other topologies tested in this work produced similar conclusions. In
Table 4, we compare approaches using Wilcoxon signed rank test.
Table 5 shows the performance of the ensembles in
Table 3 in terms of average precision. Moreover, we report the performance of IMCC [
5] and TB. For both IMCC and TB, the hyperparameters were chosen by a five-fold cross-validation on the training set. Also reported in this table are the different fusions among ENNlarge, IMCC, and TB:
ENNlarge + w × IMCC is the sum rule between ENNlarge and IMCC; before fusion, the scores of ENNlarge (notice that the ensemble ENNlarge is obtained by average rule) were normalized since it has a different range of values compared to IMCC. Normalization was performed as , the classification threshold of the the ensemble is simply set to zero. The scores of IMCC were weighted by a factor of w.
ENNlarge + w × IMCC + TB is the same as the previous fusion, but TB is included in the ensemble. Before fusion, the scores of TB were normalized since it has a different range of values compared to IMCC. Normalization was performed as .
StoLSTM_GRU + IMCC + TB is the sum rule among StoLSTM_GRU, IMCC, and TB. StoLSTM_GRU and TB are normalized before the fusion as .
IMCC and TB do not use PCA when sparse datasets are tested, as noted in
Section 4.2.
These are some of the conclusions that can be drawn from examining
Table 5:
GRU/TCN-based methods work poorly on very sparse datasets (i.e., on Arts, Liu, bibtex, enron, and health);
StoGRU_TCN outperforms the other ensembles based on GRU/TCN; StoGRU and StoTCN perform similarly;
StoLSTM_GRU works very well on sparse datasets. On datasets that are not sparse, the performance is similar to that obtained by the other methods based on GRU/TCN. StoLSTM_GRU average performance is higher than that obtained by IMCC;
ENNlarge outperforms each method from which it was built;
ENNlarge + 3 × IMCC+TB outperforms ENNlarge+3×IMCC with a p-value 0.09;
StoLSTM_GRU + IMCC + TB is the best choice for sparse datasets;
ENNlarge + 3 × IMCC + TB tops or equals IMCC in all the datasets (note that ENN+IMCC+TB and StoLSTM_GRU+IMCC+TB have performance equal to or lower than IMCC in some datasets). ENNlarge + 3 × IMCC + TB is our suggested approach.
In the following tests, we simplify the names of our best ensembles to reduce clutter:
Ens refers to ENNlarge + 3 × IMCC+TB;
EnsSparse refers to StoLSTM_GRU + IMCC + TB.
In
Table 6, we compare IMCC and Ens using more performance indicators. Our ensemble Ens outperforms IMCC.
In
Table 7, we compare Ens with state of the art on the mAn dataset using the performance measures covered in the literature for this dataset,
viz., coverage, accuracy, absolute true, and absolute false.
Ens outperforms IMCC on mAn on these five measures; however, Ens does not achieve top performance. A recent ad hoc method [
4] does better on this dataset than does Ens.
In
Table 8, we show the performance reported in a recent survey [
71] of many multilabel methods (the meaning of the acronyms and sources are noted in [
71] and not included here as the intention of this table to demonstrate the superiority of Ens compared to the score reported in the survey). The last column reports the performance of our proposed ensemble. As can be observed, it obtains the average best performance.
Finally, in
Table 9, our best ensembles, Ens and EnsSparse, are compared with the literature across all twelve benchmarks using average precision as the performance indicator. Ens and EnsSparse obtain state-of-the-art performance using this measure on several of the multilabel benchmarks.
7. Conclusions
The system proposed in this work for multilabel classification is composed of ensembles of gated recurrent units (GRU), temporal convolutional neural networks (TCN) and long short-term memory networks (LSTM) trained with several variants of Adam optimization. This approach combines Incorporating Multiple Clustering Centers (IMCC) to produce an even better multilabel classification system. Many ensembles are tested across a set of twelve multilabel benchmarks representing many different applications. Experimental results show that the best ensemble constructed using our novel approach obtains state-of-the-art performance.
Future studies will focus on combining our proposed method with other topologies for extracting features. Moreover, more sparse datasets will be used to evaluate the performance of the proposed ensemble for further validation of the conclusions drawn in this work.
Another area of research will be to investigate deep ensembling techniques for multilabel classification that utilizes edge intelligence. Accelerated by the success of AI and IoT technologies, there is an urgent need to push the AI frontiers to the network edge to fully unleash the potential of big data. Edge Computing is a promising concept to support computation-intensive AI applications on edge devices [
77]. Edge Intelligence or Edge AI is a combination of AI and Edge Computing; it enables the deployment of machine learning algorithms to the edge devices where the data is generated. One of the main problems in edge intelligence is how to apply ensemble systems usually developed for high performance servers. A possible solution is to adopt and develop distillation approaches. For example, in [
78], the authors propose a framework for learning compact deep learning models.