Ensemble Deep Learning for Multilabel Binary Classiﬁcation of User-Generated Content

: Sentiment analysis usually refers to the analysis of human-generated content via a polarity ﬁlter. Affective computing deals with the exact emotions conveyed through information. Emotional information most frequently cannot be accurately described by a single emotion class. Multilabel classiﬁers can categorize human-generated content in multiple emotional classes. Ensemble learning can improve the statistical, computational and representation aspects of such classiﬁers. We present a baseline stacked ensemble and propose a weighted ensemble. Our proposed weighted ensemble can use multiple classiﬁers to improve classiﬁcation results without hyperparameter tuning or data overﬁtting. We evaluate our ensemble models with two datasets. The ﬁrst dataset is from Semeval2018-Task 1 and contains almost 7000 Tweets, labeled with 11 sentiment classes. The second dataset is the Toxic Comment Dataset with more than 150,000 comments, labeled with six different levels of abuse or harassment. Our results suggest that ensemble learning improves classiﬁcation results by 1.5% to 5.4%.


Introduction
Sentiment analysis is the process by which we uncover sentiment from information. The sentiment part could refer to polarity [1], fine grained or not [2], or to pure emotion information [3][4][5]. The most common source of information for sentiment analysis is Online Social Networks (OSNs) [6,7]. User-generated content provides a unique combination of complexity and challenge for automated sentiment classification.
Automated classification refers to methods that can identify and classify information based on an inference process. Machine Learning (ML) studies these types of methods and can be generally separated in three parts: Modeling, Learning, and Classification. Given a classification task, a ML method has to create a model of the data, learn based on a set of pre-classified examples and perform a classification, as required by the task. Ensemble learning refers to the combination of finite number of ML systems to improve the classification results [8].
Various ML systems exist, most frequently characterized by the model and the training methods they employ. Artificial Neural Networks (ANNs) are one type of ML systems [9]. These networks have three layers: input, hidden, and output. In the input layer, data is initially fed into a model. The model parameters are then (re)calculated in the hidden layer, and data is classified in the output layer. Each layer consists of a set of nodes or artificial neurons which are connected to the next layer. When an ANN consists of multiple hidden layers, it is referred to as a Deep Neural Network (DNN) [10].
DNNs have been widely used in computer vision problems [11][12][13], where the goal of the classification is to identify or detect objects/items/features in an image. When the goal of the classification is to detect multiple objects in an image then the task is considered multilabel. These types of problems can be extended from computer vision to text analysis. In emotion related classification, a textual input can convey one or multiple emotions.
Traditional sentiment analysis is focused on a confined polarity or single emotion basis. Our main goal is to present the effectiveness of ensemble learning in text-based multilabel classification. In addition, we aim to trigger the researcher interest for considering multilabel emotion classification as a significant aspect regarding sentiment analysis.
Our contributions are as follows. We create and present five multilabel classification architectures and two ensembles, as well as a baseline stacked ensemble and a weighted ensemble that assigns weights based on differential evolution. Then, we highlight the effectiveness of ensemble learning in modern multilabel emotion datasets. Our results show that ensemble learning can be more effective than single DNN networks in multilabel emotion classification. In addition, we also incorporate a high-level description of the most commonly used hidden layers to introduce readers to deep-learning architectures.
The remainder of our work is formatted as described. Section 2 covers some introductory bibliography alongside state-of-the-art ensemble publications. Section 3 presents in detail our diverse DNN architectures and their individual components. Section 4 describes the ensemble methods we employed as well as some key sub-components. Section 5 presents the datasets we used and some of their properties. Section 6 details our results and potential improvements. Section 7 concludes our study with the summary and future work direction.

Related Work
Sentiment analysis is extensively studied since the early 2000s [14,15]. With the advent of internet, OSNs soon became the most used source for sentiment analysis [16,17]. Some of the applications of sentiment analysis are: marketing [18], politics [19], and more recently medicine [20,21]. Affective computing, as suggested by Picard [22], has sparked the interest in the specific emotion analysis of texts [3,23].
DNNs can be used in text related applications as well. In [33] Severyn and Moschitti rerank short texts in pairs and present the per pair relation without manual feature engineering. Lai et al. [34] perform unsupervised text classification and highlight key text components in the process via a recurrent convolutional neural network. The authors of [35] perform a named entity recognition task and generate relevant word embedding with two separate DNNs. Text generation is addressed via a recurrent neural network in [36] and an extensively trained model outperforms the best non-NN models. Sentiment classification of textual sources had its own fair share of DNN implementations [31,32,37].
Learning ensembles have been used to combine different types of information, such as audio video and text towards sentiment and emotional classification [38]. Araque et al. use an ensemble of classifiers that combines features and word vectors [39] for sentiment classification with greater than 80% F-Score. A soft voting ensemble is used in [40] for topic-document and document classification the results suggesting a significant improvement over single-model methods. The authors of [41] use a stacked two-layer ensemble of CNN to predict the message level sentiment of Tweets, with the addition of a distant supervision phase. A pseudo-ensemble, essentially an ensemble of similar models trained on noisy sub-data, is used for sentiment analysis purposes in [42], but is ineffective for regression classification problems.
Multilabel classification problems assign multiple classes per item. Such problems are frequently observed in the field of computer vision [43][44][45]. With regards to multilabel text-based sentiment analysis, Chen et al. [46] propose an ensemble of a convolution neural network and a recurrent neural network for feature extraction and class prediction correspondingly. The authors of [47] propose a Maximum Entropy model for multilabel classification of short texts found in OSNs. Furthermore, they present an emotion per term lexicon as generated by the model, based on six basic emotions. However, they calculate a micro averaged F1 based on the top emotions per item, essentially converting each weighted label to binary format. Johnson and Zhang [48] present a combination of word order and bag of words in a CNN architecture and point out the threshold sensitivity in multilabel classification.

DNN Architectures
We create five different DNNs, with diverse architectures, suited to a multilabel classification problem. Model 1, Figure 1, is a simple CNN with one fully connected layer. Model 2, Figure 2, combines a Gated Recurrent Unit and a Convolution layer, similar to [49]. Model 3, Figure 3, uses Term Frequency Inverse Document Frequency Embeddings and three fully connected layers, inspired by [50]. Model 4, Figure 4, architecture is based on the top performing single model of the Toxic Comment Classification Challenge in Kaggle https://www.kaggle.com/c/jigsaw-toxic-comment-classificationchallenge. Model 5, Figure 5, combines uni/bi/tri grams follow by three interconnected CNN processes, as presented in [51]. Each of the modules used in these models is presented in this section.

Pre-Processing
For each dataset used, we perform a pre-processing that includes term lemmatization and stemming, lowercase conversion, removal of non-pure text elements (such as Uniform Resource Locators or Emotes), stop word filtering and frequency-based term exclusion. Although some information is lost, extensive term filtering is shown to improve classification results [52].

Tokenization
Tokenization is performed on a term level for all the five methods presented. Each term is represented by a unique token. Further linguistic elements such as abbreviations and negations are cleaned, returned to their canonical form, and get assigned a token. For example, commonly used negation 'don't' is tokenized to 'do', 'not'.

Embedding
We assign a vector value for each token in a sentence, e.g., based on the order they appeared in our corpus, and we create a vector of numerical values. The mapping of word tokens to numerical values of vectors is referred to as embedding. There are various ways of creating word embeddings. Term Frequency-Inverse Document Frequency Tokenization creates a matrix of TF-IDF features which are used to create the embedding. Every sentence is converted to a single dimension vector of numerical elements regardless of the tokenization method. To address variable sentence length, we define a large vector length and we fill the numerical vector with zeroes, a process known as padding. Most common and effective word embedding methods are created based on term co-occurrence throughout a large corpus [53,54].

Dropout
Neural networks are prone to overfitting. Overfitting is essentially the exhaustive training of the model in a certain set of data, so much that the model fails to generalize. As a result, the model cannot effectively work with new unknown data. A solution to overfitting is to train multiple models and combine their output afterwards, which is highly inefficient.
Srivastava et al. [55] proposed randomly dropping neural units from the network during the training phase. Their results suggested an improvement of the regularization on diverse datasets. Spatial dropout refers to the exact same process, performed over a single axis of elements, rather than random neural units over each layer. Furthermore, dropout has a significant potential to reduce overfitting and provide improvements over other regularization strategies such as L-regularization and soft-weight sharing [56].

LSTM and Gated Recurrent Unit
A feed-forward neural network has a unidirectional processing path, from input to hidden layer to output. A recurrent network can have information travelling both directions by using feedback loops. Computations derived from earlier input are fed back into the network, imitating human memory. In essence, a recurrent neural network is a chain of identical neural networks that transfer the derived knowledge from one to another. That chain creates learning dependencies that decay mainly due to the size of the chained network.
Hochreiter and Schmidhuber [57] proposed Long Short-Term Memory networks to counter that decay. The novel unit of the LSTM architecture is the memory cell that forgets or remembers the information passed from the previous chain link. The Gated Recurrent Unit of Model 2 was introduced by Cho et al. [58]. Its architecture is similar to LSTM units, but with the absence of an output gate. It is shown that GRU networks perform well on Natural Language Processing tasks [59].

Convolution
The convolution layer receives as input a tensor, which is convolved and its output, the feature map, is passed to the next layer. A tensor is a mathematical object that describes a mapping of an input set of objects to an output set. Therefore The convolution layers in all models are one-dimensional, with a convolution window of size 3 and an output space dimensionality of 128. The only exception is Model 5, where the convolution windows is different for each layer to provide a differentiated architecture. The primary focus of this layer is to extract features from the input data by preserving the spatial relationship between terms. The flattening layer on the other hand reduces the dimensionality of the input to one. For example, a feature map with dimensions 5 × 4 when flattened would produce a one-dimensional vector with 20 elements. The flattening layer passes the most important features of the input data to a fully connected dense layer comprised of the classification neurons.

Dense
The dense layer is comprised by fully connected neurons both forward and backward. Every element of the input is connected with every neuron of this layer. In four out of five models, a dense layer can be seen at the end of the pipeline. The number of neurons in these layers is the number of classes in our dataset. For the third model, where TF-IDF tokenization takes place, we chose a simple DNN with 3 fully connected layer, which decreasing number of neurons for each subsequent layer. DNNs with multiple dense fully connected layers is shown to perform better than shallow DNNs [60].

Classification
The output of our model is a multilabel classification vector. Each of the neurons in the final dense layer of the models interact with the classes of the dataset and provide a decimal value (ranging from 0 to 1) which is then rounded for each class. The number of classes defines the number of fully connected neurons in the final dense layer.

Ensembles
We previously mentioned that a method to counter overfitting is to train multiple models and then combine their outputs. Ensemble learning combines the single-model outputs to improve predictions and generalization. Ensemble learning improves upon three key aspects of learning, statistics, computation and representation [61]. From a statistics perspective, ensemble methods reduce the risk of data miss-representation, by combining multiple models we reduce the risk of employing a single model trained with biased data. While most learning algorithms search locally for solutions which in turn confines the optimal solution, ensemble methods can execute random seed searches with variable start points with less computational resources. A single hypothesis rarely represents the target function, but an aggregation of multiple hypothesis, as found in ensembles, can better approximate the target function.
We present two ensemble architectures, stacked and weighted [51]. Other popular ensemble methods include AdaBoost, Random Forest and Bagging [62]. Stacked ensembles are the simplest yet one of the most effective ensemble methods, widely used in a variety of applications [63,64]. Stacked ensemble acts as our baseline ensemble, compared with our proposed weighted ensemble based on differential evolution, a meta-heuristic weight optimization method. Meta-heuristic weighted ensembles have achieved remarkable results in single label text classification [65,66].

Stacked
The main idea behind stacked ensembles is to combine a set of trained models through training of another model (meta-model). The output predictions of the meta-model are based on the training of the model outputs, Algorithm 1. In our implementation we fit the models output into a DNN with two hidden dense layers, Figure 6.  The outputs of the models are merged with the concatenation function of Keras https://keras.io/. The input of the concatenation is a fixed size output tensor of each model. The output of the concatenation is a single tensor, which is then used as an input to the fully connected layer. A second fully connected layer follows similar to the final dense layer on each model.

Weighted
The weighted ensemble has a similar philosophy behind it. Instead of equally merging the outputs, we merge the outputs by co-calculating a weight. Given a set of weighted tensors and a vector-like object, we sum the product of tensor elements over a single axis, as specified by the one-dimensional vector, Figure 7. The 'fitting' process of the second ensemble is a heuristic process for the best possible weights combination, i.e., looking for the global minimum of a multivariate function. We propose differential evolution [67] to scan the large space of five distinct weights, Algorithm 2.
Both datasets exhibit a level of class imbalance, Figures 8a and 9a. However, they are different not only in context, where SEM2018 is based on Twitter and TOXIC in Wikipedia, but also in the properties of the actual text. The sentence length, after the source is cleaned, is different from the original mainly due to the removal of infrequent terms, Table 1. We discussed before that the dimensions of our term embeddings need to be low. We reduced the dimension by removing the terms that appear no more than 10 times, alongside a tailored stop term removal.

Semeval 2018
The SEM2018 Train is a collection of 6838 Tweets with emotion labeling of 11 classes. The classes are: anger, anticipation, disgust, fear, joy, love, optimism, pessimism, sadness, surprise and trust. Some examples of Tweets included in SEM2018 are: • Whatever you decide to do make sure it makes you happy.

•
Nor hell a fury like a woman scorned-William Congreve • chirp look out for them Cars coming from the Wesszzz • Manchester derby at home revenge The Development dataset consists of 886 Tweets with the 11 aforementioned classes and their respective labels. The class distribution in Train dataset is skewed in favor of five emotions, anger, disgust, joy, optimism, and sadness. The same class distribution is evident in the Development dataset, which is dominated by the same five emotions, Figure 8a.
SEM2018 contains 329 unique class combinations. The frequency of these unique combinations follows a power law distribution for both Train and Development datasets Figure 8b. The most frequent class combination for Train and Development was: anger and disgust, followed by joy and optimism. One third of the class combinations appears only once and often combine contradiction emotions, such as joy and sadness.

Toxic Comments
The Toxic dataset consists of two datasets as well, Train and Development. Train dataset consists of 159,571 unique comments labeled with 6 different types of toxicity: toxic, severe_toxic, obscene, threat, insult and identity_hate. Some example of comments are: • I don't anonymously edit articles at all.

Results
The accuracy scores are validated via 10-fold validation. The baseline neural network (NN) model [68] expectedly under-performs [69]. For the SEM2018 datasets, both ensembles outperform each individual model. The stacked ensemble provides the best results in the Train subset while the weighted ensemble marginally outperforms stacked ensemble in the Development subset, Table 2. The accuracy of both ensembles is limited, to a degree, by the inherit bias of the dataset. The performance of our ensembles outperforms all submitted models in the Codalab Competition https://competitions. codalab.org/competitions/17751#results. The baseline NN performed better in TOXIC dataset. NN performance is boosted by the big number of unclassified elements in the dataset, Table 3, more than 40% of the samples as seen in Figure 9b. TOXIC dataset included more than 25,000 unique terms before cleaning. The number of unique terms affects the length of the tokenization and subsequently the dimension of the embedding. The required dimension reduction reduced the training time of each model but affected its performance. Our best performing model is in the op 35% of the Kaggle Competition submissions https://www. kaggle.com/c/jigsaw-toxic-comment-classification-challenge/leaderboard but more than 1% worse when compared to the top performing one. Our ensemble methods improved upon single models in six out of eight cases. The classification accuracy is improved by at least one of the ensembles across both datasets. The ensembles for SEM2018 dataset performed excellently compared to other architectures. On the other hand, the extensive data cleaning of TOXIC -requirement due to computation/time constraints-hindered the performance of our models and their ensembles. Given the heavy class imbalance and the cleaning of TOXIC, the achieved accuracy of 97+% is decent. The baseline stacked ensemble under-performed our proposed weighted ensemble in three out of four cases.
All the models presented, and in extent their ensembles, can be further improved by a range of techniques. Test augmentation [70], hyperparameter optimization [71], bias reduction [72] and tailored emotional embeddings [4,73] are some techniques that could further improve the generalization capabilities of our networks. However, the computational load over multiple iterations is extensive, as the most complex models required hours of training per epoch and dataset.

Conclusions
We demonstrated that ensemble learning can improve the classification accuracy in multilabel text classification applications. We created and tested five different deep-learning architectures capable of handling multilabel binary classification tasks.
Our five DNN architectures were ensembled via two methods, stacked and weighted, and tested in two different datasets. The datasets used provide a similar multilabel classification but vary in size, term distribution and term frequency. The classification accuracy was improved by the ensemble models in both tasks. Our proposed weighted ensemble outperformed the baseline stacked ensemble in 75% of cases by 1.5% to 5.4%. Hyperparameter tuning, supervised or unsupervised, could further improve the results but with a heavy computational load, since each hyperparameter iteration requires the re-training/re-calculation of the ensemble.
Moving forward we aim to explore the creation and use of tailored emotional embeddings concatenated with word embeddings. Additionally, we are currently developing new data augmentation methods, tailored to text datasets. We are also exploring multilabel regression ensembles and architectures that could be considered to be the refinement of binary classification, whether multilabel or not.

Conflicts of Interest:
The authors declare no conflict of interest.