3.1. Overview
A method of combining machine learning classifiers is called ensemble learning (or ensemble method). Ensemble learning is a technique for combining various base learners from which a new classifier is created [
22]. The new classifier is supposed to gain performance compared to any of its constituent base learners. In ensemble learning, different base learners of the same or heterogeneous types are combined using different fusing strategies. Combining classifiers (i.e., either homogeneous or heterogeneous type) in either sequential (i.e., boosting) or in parallel (i.e., bagging) configurations aims to achieve improved classification/regression performance than the performance of individual models. Besides, the other objective of the ensemble method is to reduce variance and bias. That is, ensemble classifiers are not only designed to get a model that achieves performance gains, but also a model that can generalize well.
The most popular ensemble approaches: voting, bagging, boosting, and stacking [
22,
42,
43]. Voting is the simplest of all fusing strategies, which takes the most frequent predicted class of multiple predictors. For sample
x, class
i is assigned if class
i is predicted most frequently. Mathematically,
where
is the predicted class, mode is statistical mode of prediction by classifiers
for sample
x. Bagging is a method for creating
N base learners by corresponding
N sample data generated randomly from the training dataset (with replacement). The bagging method is also called Bootstrap aggregation. Suppose from
N iterations,
N random samples from training data are generated. For
N base learners, the final prediction of their ensemble is given by averaging all predictions from all
N models using Equation (
2).
Popular bagging algorithms are random forest and bagging meta-estimators, just to name a few.
Boosting is an ensemble of base learners in a sequence where each classifier is started with equal weight, but after all models are trained once, weight is assigned to each model based on its performance. After model evaluations, a larger weight is assigned to a misclassified sample for providing greater focus in the next iteration, and vice versa. The final model relies on a weighted averaging method. Mathematically,
where
are base learners,
are weights,
N is the number of classifiers, and
h is the final classifier. It is noted that boosting considers weighting in training data which is one of the feature making it different from bagging.
Boosting classifiers include gradient boosting classifier, adaboost classifier, extreme gradient boosting classifier, and light gradient boosting classifiers. Unlike bagging classifiers, boosting classifiers are sequential, i.e., the input of the next base learner is the output of its previous base learners.
Ensemble learning is proposed to address bias–variance trade-offs on performance of classifiers. Variance is the error caused by limitation of learning data, whereas bias is caused by limitation of the algorithm itself. Boosting tries to address bias, whereas variance is addressed by bagging [
44]. However, boosting is sensitive to overfitting as it tries to fit the data into the model [
45,
46].
Unlike voting methods, which rely on user adjusted weights, stacking (or meta- learners) can adjust their weight themselves. The proposed approach in this research is using the meta-learner for aggregating the predictions of the base learners.
3.3. Proposed Stacking Algorithm
This research presents sentiment classification that uses a combination of machine learning techniques like base learners. The most common machine learning algorithms (SVM, NB, and RF) are chosen as base learners with default parameters, while LR is employed as a meta-learner. The proposed ensemble learner for sentiment classification is depicted in
Figure 2.
The steps of the workflow illustrated in
Figure 2 are briefly described as follows.
(1) Data Sets Collections: Four data sets [
22]—GCAO (2871), PMO(6637), EBC (2444), and ZEMEN (1440)—of comments are used for evaluation. The first three datasets are collected from Facebook comments and the fourth is collected from YouTube movie comments (i.e., Zemen Drama [
48])). Specifically, the Facebook comments are collected from (i) Facebook page of Government Communication Affairs Office (GCOA [
49]), (ii) the official Facebook page of Prime Minister Office (PMO [
50]), and (iii) the Facebook page of Ethiopian Broadcasting Corporate (EBC [
51]). The statistical summary of the above-mentioned four datasets are depicted in
Table 2.
From
Table 2, we can observe that the text samples of GCOA have short length where the average number of words and average number of characters are 8 and 41, respectively. In contrast, the PMO data set has the largest average length of words and characters, i.e., 39 and 192, respectively. The features of this sample have a strong potential of discriminating and assigning it to a certain class as the length of user-generated text samples increases, whereas a shorter sample (i.e., one with fewer words) would not have enough features to discriminate it from the rest of the data set’s samples. This makes it tough for a machine learning system to extract meaningful information from such a sample.
Figure 3 shows that almost all the datasets are skewed. The number of samples with negative class is less than the number of samples with positive class across all three data sets (GCOA, EBC, and PMO). The negative class samples are under-represented (i.e., minority class) in these data sets, whereas the positive class samples are over-represented (i.e., majority class). In both PMO and EBC, about 69% of the samples are from the majority class. On the other hand, the majority of the samples (63.4%) are negative class in the ZEMEN dataset. If we use machine learning techniques in this setting, the model will be biased towards the negative class. To minimize this bias, we need to use the SMOTE procedure to balance these datasets prior to using them for machine learning algorithms.
(2) Preprocessing and Feature Extraction: As preprocessing is crucial in text mining, procedures including removing all digits, punctuation marks, and non-Amharic characters; spelling correction; stop word removal; and normalization are performed.
In this research, normalization is the process of replacing all letters with the same sound (replaced by a single letter). Because of the many spelling variants employed, different persons write certain Amharic words in various forms. For example, the word
ቴሌቪዥን (‘television’) can be written as
ቴሌቭዢን, ቴሌቭዥን, ቴሌቪዥን [
52]. As a result, Amharic texts comprise many characters with the same sound that needs to be substituted by a single common character. That is, (
’ሀ,ሃ,ሐ,ሓ,ኀ,ኃ,ኻ→ሀ)፣ (ሰ,ሠ→ሰ)፡ (ፀ,ጸ→ፀ)፡ (ዐ,አ,ኣ,ዐ,ዓ→ዐ)፡ (ቆ,ቈ,ቖ→ቆ)፡ (ቁ,ቍ→ቁ)፡(ኮ,ኰ→ኮ)፡ (ጎ,ጐ→ጎ)፡ (ኋ,ዃ,ሗ→ኋ)), where the arrow (→) means ‘replaced by’. If one of the left-hand symbols appears in the text, it is replaced by the symbol to the right of the arrow (→).
Furthermore, stop words are recognized as the top most common (i.e., redundant) tokens in the data sets. However, some words such as አይደለም/it is not/, ምንም/nothing/, የለም/none/, ሳይሆን/not happened/, and የለበትም/has nothing in it/ are examples of negative words in Amharic language. The performance of sentiment classification is affected when these words are included in the stop word list. As a result, these terms were not included in the stop word list.
Because Amharic is morphologically dense, we discovered that stemming removes the salient characters/most significant features/that might aid in determining a text’s sentiment class [
15]. Therefore, stemming is not considered in preprocessing procedures.
Text data sets are transformed into numerical features using TF-IDF vectorizations after they have been preprocessed. TF-IDF vectorization is a method of converting documents into numerical features. By combining local weights and global weights of a text, the TF-IDF feature of a document contains more discriminant information to encode texts. We compute TF-IDF by using the formula , where refers to the number of occurrences of term t in document d, is the number of documents containing term t, and N refers to the total number of documents. captures the local weights of a term t in terms term-frequency, whereas captures the global weight of a term t representing feature with respect to text document d.
After applying the Grid Search algorithm to the TF-IDF vectorizer implemented in the Scikit learn Python package [
53], the TF-IDF character (1,7) grams feature set and maximum features of 5000 has been chosen for optimal Amharic sentiment classification. For several NLP applications, the TF-IDF character level
n-grams feature outperforms the word-level grams feature, according to the literature [
25,
29,
54]. Specifically, character
n-grams features outperform word grams features for dealing negation in Amharic sentiment classification [
22]. As a result, the proposed approach is tested with the TF-IDF character (1,7) grams feature set, and the results have been compared to the TF-IDF word uni-gram feature set (as baseline).
Besides, the Synthetic Minority Oversampling Technique (SMOTE) is proposed for balancing imbalanced datasets. Employing SMOTE for balancing imbalanced datasets improves sentiment classification tasks [
31]. SMOTE is also popular for balancing imbalanced non-textual data sets for other applications [
30,
55]. As a result, we proposed SMOTE as a strategy for balancing vectorized sentiment datasets. SMOTE augments the minority class of samples in datasets to balance out imbalanced data sets. The average accuracy of the 5-fold cross-validation (CV) is measured in each of the four data sets with and without SMOTE using both the TF-IDF character (1,7)
n-grams and TF-IDF word uni-gram features.
(3) Base Learner Algorithms: Support Vector Machine (SVM), Naive Bayesian (NB), and Random Forest (RF) are the most commonly used supervised machine learning algorithms in NLP [
54,
56], and they were chosen for Amharic sentiment classification in this study. For the sake of simplicity and comparison, we choose Logistic Regression (LR) as a meta-learner for combining the base learners. They are briefly stated as follows.
(i) SVM is one of the most powerful supervised machine learning approaches and it is closely connected with neural networks. SVM is built on mapping and classifying data members into distinct output spaces. Support vectors are the data points that are closest to the decision hyperplane. The computational inefficiency of SVM is one of its shortcomings [
57].
(ii) RF is a combination of multiple decision trees (also known as bagging) in various configurations. This should address the shortcomings of decision trees, which do not update themselves when new training samples are supplied. Random forest is robust because it combines multiple tree classifiers which rely on a subset of the training set’s input features. Finally, it decides by voting for predicting new sample. Random forest classifier is built in two phases: create several decision trees and then get predictions with those trees for test sets and finally combine their predictions by majority voting (averaging).
(iii) NB is a probabilistic method which is based on Bayes’ rule, in which the input features are assumed to determine the output variable independently. Even though this method worked effectively in most times, this assumption of independence is rarely used in practice. The other strength of this algorithm is that it can learn incrementally and update its probability distribution [
58,
59], and (iv) LR is a statistical approach for training binary categorical classes, rather than continuous variables.
(4) Meta-Features: The meta-learner method is trained using the predicted values from the base learners. The base learners’ predicted information is employed as meta-features, which are considered being essential for discriminating the target class categories.
(5) Meta-Learner Algorithm: The meta-learner is acting as a combiner of the proposed approach. However, unlike other fusing strategies, it uses a machine learning model (i.e., logistic regression in our case) rather than voting/averaging. Voting and averaging, weighted averaging combine base learners relying on Equations (
1)–(
3), respectively.
The procedure for the proposed ensemble learner with stack cross-validation algorithm is presented in Algorithm 1.
Algorithm 1: Proposed Ensemble Learning. |
Input: Labeled Data Set |
Output: Average Accuracy of Trained Meta-learner Model M |
- 1
Create the 3 base learners (SVM, RF, and NB) and meta-learner (LR) - 2
With 5-fold cross-validation, partition the training set into 5 disjoint sets. - 3
foreachk fold in the partitioned trainingSetdo - 8
return mean accuracy of the models on the complete 5-fold CV
|
Description: Algorithm 1 takes a labeled dataset as input and average accuracy is returned as an output. Three base learners (i.e., SVM, NB, and RF) and one meta-learner (i.e., LR) are created (line 1). In line 2, with a 5-fold CV, the dataset is randomly partitioned into 5 disjoint sets and stored. For each k-fold cross-validation set (lines 3–7), each kth disjoint set is randomly split into training and testing set (with a ratio of 80:20, respectively) (line 4). In those experiments where we want to evaluate the impact of balancing the classes, apply SMOTE to the trainingSet of the respective run (line 5). Line 6 builds the stack classifier using the trainingSet. Line 7 stores the accuracy of each of the model using the testingSet. Finally, line 8 computes the mean accuracy of the 5 models.
The proposed stack configuration is intended to improve the performance of Amharic sentiment classification by aggregating the prediction (meta-feature) of base learners.