Some traditional classifiers, such as support vector machine (SVM), extreme learning machine (ELM) and decision tree (DT), have been well applied in emotion recognition, while each classifier has his own shortcomings inevitably. For instance, SVM has better diagnostic performance under small sample conditions, poor performance for large sample conditions and emerge multiple categories with the same number of votes when voting. The initial input parameters of ELM are generated randomly, which requires a large number of training samples and cannot guarantee the optimal parameters. DT is inconsistent with the data of different samples and the information gain tend to those features with more values. In view of the requirements for accuracy and reliability of emotion recognition system and the uncertainty caused by a single classifier, a team-collaboration identification strategy based on the fusion of SVM, DT and ELM, is proposed, which exerts the function of collaborative diagnosis with multiple classifiers, thus eliminating the uncertainty brought by a single classifier and improving the recognition accuracy.
2.4.1. Support Vector Machine
Support vector machine (SVM) is a machine learning method based on the principle of structural risk minimization in statistical learning theory, which seeks the best performance between model complexity and learning ability to achieve the best generalization ability based on limited sample information [
46,
47,
48]. The core idea is to realize nonlinear classification or regression fitting by mapping nonlinear classification or regression fitting problems into high-dimensional space by using kernel function to obtain the better classification or regression result. When making a decision in a classification problem, the voting method is usually adopted and the category with the most votes is the class to which the sample belongs.
For a data set
of two classes with
k as the number of samples, where
represents the sample;
is the class label;
is the training sample number. SVM seeks an optimal hyper-plane in the
n dimensional data feature space by constructing the following function:
where
is a mapping function from low-dimensional space to high-dimensional space;
is a slack variable to ensure the correctness of the classification in the case of inseparable samples;
C is a penalty factor and a larger
C indicates a greater penalty for misclassification;
w and
b are the weight vector and classification threshold of the decision function
;
xi is the input vector and
yi is the output vector.
Introducing the Lagrange function to get the dual optimization problem:
where
is the column vector;
is the semi-positive definite matrix of
;
is the kernel function;
is the Lagrange multiplier;
is the sample label vector;
is the Lagrange multiplier vector.
Computing Equation (19), the optimal solution is:
The optimal hyper-plane decision function is:
SVM can be extended to multi-classification problems by constructing multiple SVM two-class classifiers which include direct method, one-to-one and one-to-rest. Among them, the one-to-one method is used to classify the
k classes of sample data by constructing
k(k –1)/2 binary classifiers, which has a fast solving speed and is widely used in practice. The classification principle is the “voting mechanism,” that is, each classifier votes for its preference and the final result is based on the category with the most votes. This method can be expressed as:
where
and
are the weight vector and threshold obtained when designing the two-class classifier for the
i-th sample and the
j-th sample respectively;
is the slack variable;
is the training sample vector;
is the sample label;
S is the sum of the
i-th class samples and
j-th class samples.
2.4.2. Decision Tree
Decision tree (DT) classifier is an instance-based inductive learning algorithm that uses inductive algorithm to generate readable decision trees and rules and then uses the decision tree to classify new data [
49]. DT is an inverted tree structure similar to the flow chart, which mainly focuses on the two core problems of growth and pruning. The structure diagram of DT is shown in
Figure 2. The knowledge acquired by DT is a formal representation of the tree, including the regression tree and the classification tree. The results of classification or prediction are reflected in the leaf nodes of DT. The average value of the output variables is the prediction result in the samples contained in the leaf nodes of the regression tree, while the mode of the output variable is the classification result in the samples contained in the leaf nodes of the classification tree.
Each none-leaf node in the figure represents the input attribute of the training data set, attribute value represents the value corresponding to the attribute and the leaf node means the value of the target category attribute. Yes and No represent positive and negative examples respectively.
DT classifier is computed as follows:
Input—training set D, feature set A and threshold ;
(1) If all samples in D belong to the same category of Ck, then T is a single node tree and Ck is used as the class of the node and returns T.
(2) If A is not an empty set, then T is a single node tree and the class Ck with the largest number in D is taken as the class of the node and returns T.
(3) Otherwise, calculate the information gain ratio (GA) of each feature in
A according to Equation (23) and select feature
Ag with the largest:
where
is the proportion of the sample of the
i-th attribute value in the subset;
V(A) is the range of the attribute
A;
SV is the subset of
D whose value is
V on the attribute
A;
is the entropy of
D relative to
C states.
(4) If Ag is less than , then T is a single node tree and the class Ck with the largest number in D is taken as the class of the node and T is returned;
(5) Otherwise, for each possible value ai of Ag, divide D into several non-empty subsets Di according to Ag = ai, mark the class with the largest number in Di as a mark to construct sub nodes and form a tree T by the node and return T.
(6) For node i, Di is used as training set and A-{Ag} as feature set. Step (1) to (5) are called recursively to get subtree Ti and return Ti.
2.4.3. Extreme Learning Machine
In order to improve the traditional learning algorithms (such as back propagation neural network), which easily fall into local minimum, slow model training speed and difficulty in adjusting learning rate, reference [
50] proposed the Extreme Learning Machine (ELM), which consisted of only an input layer, a hidden layer and an output layer. The brief network structure of the algorithm is shown in
Figure 3. According the inputs, ELM randomly generates the connection weights and the threshold of hidden layer neurons between input layer and hidden layer, which need not be adjusted during the training process. Users do not need to know the hidden layer because Gaussian kernel was applied. The optimal solution can be obtained by setting the number of hidden layer neurons, which is related to the number of features. Best values for positive regularization coefficient and Gaussian kernel parameter were found empirically after several experiments.
Suppose there are
N training samples
, where
is the input and
the output. The mathematical model of a standard single hidden layer feed forward neural network with
M hidden layer nodes is:
where
is the input weight vector connecting the input neuron and the
j-th hidden layer neuron;
is the output weight vector connecting the
j-th hidden layer neuron and the output neuron;
is the actual output vector;
is the bias of the hidden layer neurons;
is the activation function of the hidden layer neurons.
If the model can approximate the output
of the training sample with zero error, which means
, then
,
and
make the following formula hold:
Simplifies (25):
where
is called the output matrix of the hidden layer of the neural network.
When the activation function of the neuron is arbitrarily differentiable, the training error of the single hidden layer feed forward neural network can approach infinitely small positive number
. At this time, the input weight vector
and the hidden layer offset
can remain unchanged during training process and can also be randomly assigned. Therefore, the training process is equivalent to finding the least squares solution
of the linear system:
The solution is and is the Moore-Penrose generalized inverse of the hidden layer output matrix .
2.4.4. Team-Collaboration Identification Strategy
As mentioned above, each classifier is based on a different method principle, so each classifier has its own advantages and disadvantages. For a sample, it may be easily misclassified by one classifier but easily identified by other classifiers. In order to reduce the limitation of a single classifier and improve the accuracy of recognition, a team-collaboration identification strategy model, which combines SVM, DT and ELM, is proposed. For this proposed team-collaboration identification strategy, the SVM model is regarded as a major decision expert and the DT and ELM are emloyed to provide the decision suggestions for the samples which are easily misclassified by SVM. The core idea of SVM-DT-ELM is that selecting the possibly misclassified samples for the SVM and then employing DT and ELM to conduct referral for these samples and finally confirming the emotion class of the sample according to the designed decision-making mchanism. Main procedures of the suggested SVM-DT-ELM algorithm are as follows:
(1) Firstly, the training sets are used to train SVM, DT, ELM classification model respectively. In this research, during training SVM model, the radial basis function (RBF kernel) is selected and the grid search method is used to optimize the SVM model parameters to achieve better performances.
(2) Selecting samples that may be misclassified by SVM. According to the training model of SVM and the self-classification accuracy, selection conditions where the possibly misclassified samples belong to are determined. Analysis shows that samples distributed near the support vector or with the same number of votes during voting process are easily misclassified when SVM is used for classification. The SVM supports one-versus-one multiclassification. If k is the number of classes, k(k-1)/2 models will be generated, each model involves only two class. Focusing on above problems, set the following conditions:
a) When using SVM for classification, if the highest number of votes are equal during the voting process, the sample will be regarded as a possibly misclassified sample and the sample will be picked out for referral.
b) If the test sample satisfies the following condition after input to SVM, this sample is selected for re-diagnosis.
where
umin represents the smallest absolute value in the case of three votes;
hmax is the largest absolute value in the case of three votes;
vmin represents the smallest absolute value in the case of two votes;
smax represents the largest absolute value in the case of two votes. The
and
are conditional parameters, which are determined by the performances of the trained SVM model, in this research,
and
.
(3) Decision principles. When applying SVM-DT-ELM team-collaboration strategy for emotion identification, this paper follows the following principles:
ⅰ) If the test sample is classified by SVM with full votes and the sample is outside the set conditions, then DT and ELM are not employed for further consultation. The output emotional categories are based on the results of SVM.
ⅱ) Samples except those satisfying condition ⅰ) should be classified by DT and ELM. If any one of the results between DT and ELM are the same to SVM, the output emotional classes are based on the principle of minority obeying majority.
ⅲ) If the results of DT, ELM are different from the category of highest ranked vote of SVM and any one of the referral results between DT and ELM is consistent with the SVM’s second highest ranked vote, the final diagnosis category is based on the referral result.
(4) According to the principle of step (3), the emotional categories of the test samples are confirmed.
The flow chart of emotion recognition algorithm based on SVM-DT-ELM is shown in
Figure 4.