2.2. Multiscale Principal Component Analysis (MSPCA)
Let us consider the input signal matrix
, where
n is the number of measurements (samples) and
m is the number of signals. Principal Component Analysis (PCA) converts this input matrix into two new matrices,
L and
S, such that the following holds:
where
L represents the matrix of principal component loadings, and
S is the matrix of principal component scores. Each column of
L contains coefficients for one principal component. The columns are in descending order in terms of principal component variances, i.e., the eigenvalues,
λi, of the covariance matrix of
X. Matrix
S is the representation of
X in the principal component space. The eigenvalues define how much variance the related principal component accounts for. They can show how many variables are truly important so that the dataset can be reduced. The contribution of principal components whose eigenvalues are close to zero is usually negligible and they can be excluded [
10,
11,
12,
13,
17].
The continuous wavelet transform (CWT),
ω(
s,
τ), of a continuous-time signal,
x(
t), is defined by comparing the signal,
x(
t), to the probing function,
ψs,τ(
t):
creating two-dimensional mapping onto the time-scale domain.
The discrete wavelet transform (DWT) usually accomplishes the coefficient frugality by limiting the variation in scale and sliding to powers of 2. If the selected wavelet belongs to an orthogonal family, the DWT even represents a non-redundant bilateral transform.
Dyadic sampling of the two wavelet parameters is defined as,
where
j and
k are both integers. Therefore, the discretized probing function becomes:
Inserting Equation (4) into Equation (2), the discrete wavelet transform (DWT) is derived:
The original signal is recovered by the inverse DWT, or the wavelet series expansion
where
ψj,k(
t) is a set of orthonormal basis functions [
18].
Multiscale PCA (MSPCA) uses the capability of PCA to eliminate the variables’ cross-correlation. The signals are decomposed by wavelet transform to combine the strengths of PCA and wavelets. This results in data matrix X transform into a data matrix WX, where WX is an
n x m orthonormal matrix as a result of the orthonormal wavelet transform. The stochastic behavior in each signal is almost decorrelated in WX, but the deterministic behavior in every signal in X is preserved in a quite small number of coefficients in WX. Since the quantity of principal components to be reserved at each scale does not change the fundamental relationship between the variables at any scale, it is not transformed by the wavelet decomposition [
10,
11,
19].
2.4. Machine Learning Methods—Single Classifiers
Artificial Neural Networks (ANN) is the machine learning model that tries to solve problems in the same way as the human brain does. Instead of neurons, ANN is using artificial neurons, also known as perceptron. In the human brain, neurons relate to axons, while in the ANN, weighted matrices are used for connections between artificial neurons. Information travels through neurons using connections between them; from one neuron, the information travels to all neurons connected to it. Adjusting the weights between a neurons system can be trained from input examples [
25]. The artificial neural network is organized into multiple layers, where each layer contains multiple neurons. The information inside of the network travels from input layers to the output layers. Between input and output layers, the artificial neural network can have zero or more hidden layers. The number of layers and number of neurons inside the artificial neural network is called the architecture of the neural network [
25]. In the artificial neural network, the smallest building block is the perceptron that has multiple weighted inputs, bias input, and the activation function. Propagating signal from input to the output in the artificial neural network is called forward propagation, while propagating signal from output to input is called back propagation. The most popular algorithm for training artificial neural networks is called the backpropagation algorithm [
26].
The types of artificial neural networks depend on architecture, neuron activation function, loops in architecture, learning algorithm, and other attributes. Also, there are types of artificial neural networks that are capable of learning without human interaction.
Backpropagation is the most used algorithm for training artificial neural networks. This algorithm is based on an optimization method called gradient descent. In a nutshell, the algorithm has three phases: forward propagation, error calculation, and weights’ updates. When the input data sample is propagated through the artificial neural network, then the output is calculated. The output of the input sample is compared with expected output and the error is calculated. The error is used to do backward propagation and update weights in all layers to make the error minimal. This process has been repeated for every data sample from input. This process is repeated until artificial neural network mean square error reaches the desirable level [
26].
k-Nearest Neighbor (k-NN): In the machine learning area, the k-NN algorithm is one of the oldest classifications and regression algorithms. The algorithm uses very simple logic to determine the output of the system based on K-nearest neighbors for input from feature space. This algorithm belongs to the category of lazy learning algorithms because the computation is postponed as much as possible. This means that the input dataset is not processed to build the model, but the model performs all computation during testing time [
27,
28].
The input data into the k-NN algorithm contains feature vectors and a label that represents the category of that vector. During classification, the input to the model is number k, that represents several neighbors to refer to for classification and the unlabeled input vector. The k-NN model will produce output based on k-nearest neighbors and the major label from those k neighbors will be the output label for the input vector. As explained before, the main attribute of the k-NN algorithm is the distance metric for input vectors. The most commonly used metric is the Euclidean distance. For variables from the discrete set, such as text and voice, the Hamming distance is used. For problems from biology such as microarray, different metrics can be used, such as the Pearson metric or the Spearman metric. There are implementations of the k-NN algorithm that use more advanced metrics such as NCA (Neighbor Component Analysis) and LMN (Large Margin Neighbor) [
23].
The main problem of the k-NN machine learning algorithm is that major voting based on k nearest neighbors is problematic in cases when we have skewed categories. This causes the most frequent categories to dominate over those that are less frequent, leading towards a big classification error for this type of algorithm [
23]. To overcome this issue inside the metric, function weights are introduced. The closer neighbors have more impact on classification than those which are farther. The output label is determined by multiplying weight with the inverse of neighbors’ distance.
The machine learning area has other algorithms that are trying to overcome this issue, like self-organized maps. To determine the best number for the parameter
k, it is necessary to know some heuristics about input data. The general rule is that larger values of the
k result in better classification accuracy, but boundaries between classes are not distinct. There are methods based on meta heuristics like hyper-parameter optimization that could be used to choose number
k for the k-NN algorithm. Mainly, performance of the algorithm is dependent on the input data and it is very common that input data is pre-processed by some feature scaling, noise reduction, or a similar method before k-NN classification. When choosing number
k, it is desirable that that number be an odd number so majority voting cannot be a draw. Evolutionary algorithms are often used to pre-process input data and to determine number
k [
29]. In case of high redundancy in the input vector space, feature extraction is usually applied to remove unnecessary noise. The input vector space is transformed into reduced new input space. The most used methods for feature extraction are the Haar method, mean shift tracking analysis, Principal Component Analysis, Fisher Linear Discriminant Analysis (LDA), and others [
29].
Support Vector Machines: A classifier from the category of the supervised machine learning algorithms is Support Vector Machines (SVM), also known as Support Vector Networks. SVM can be used for solving classification and regression problems. For an input dataset, SVM can make a binary decision and decide in which, between the two categories, input 46 sample belongs. The SVM algorithm is trained to label input data into two categories that are divided by the widest area possible between categories. Because of this wide area between category labels, SVM is also known as a large margin classifier. The SVM model is the linear classifier and it cannot map nonlinear functions without the usage of the kernel [
25].
The classification of data is the most common problem in the machine learning algorithms. The SVM algorithm represents input data samples as vectors in the n-dimensional space and the algorithm builds a linear formula for a hyper plane in (n−1)-dimensional space that separates labels between planes. There is an infinite number of hyperplanes that can separate labels in space but the SVM algorithm is designed to choose the hyperplane that has the maximum distance between both labels’ categories [
30,
31]. Vapnik [
32] recommended that the hyperplane method,
f(
x) ∈
F, be selected in a functions group with enough capability. Actually,
F covers functions for linearly separable hyperplanes using the next universal approximator as:
Every training example,
xi, is given a weight,
wi, and the hyperplane is a weighted combination of whole training samples [
25].
Assume that the clear meaning of the nonlinear function,
ϕ(
·), has been avoided by the kernel function usage, determined properly as the nonlinear function’s dot products
and the Equation (8) transforms into
This technique is generally attributed to Mercer’s theorem [
30]. The classifier which is trained then has the following form:
Decision Tree (DT) Algorithms: Nodes in a decision tree test the importance of a feature. The node test sometimes compares the value of a feature to a constant. In other cases, two features can be compared to each other, or a function is applied to features. A new sample is classified by routing it down the tree regarding the feature values established in consecutive nodes. Once a leaf has been reached, the sample is classified according to the class associated to the leaf [
33]. Reduced-error pruning tree (REPTree) is an algorithm which is created by using information growth and it is reduced by reduced-error pruning. Speed optimization is achieved by sorting values for numeric features and utilizing missing values in the same way as C4.5 [
31]. The alternating decision tree (ADTree) is applied for two-class issues and creates a decision tree modified by boosting [
31]. LogitBoost ADTree (LADTree) is an alternative decision tree algorithm which is applied for problems with multiple classes in the LogitBoost algorithm. The iterations number is a parameter which defines the scope of the tree generated, like ADTree [
31]. The classification and regression tree (CART) applies the minimum cost-complexity strategy for pruning. For instance, every test may be a linear attribute value combination for numeric attributes. Hence, the output tree represents a hierarchy of linear models [
31].
Random Forest: Breiman [
34,
35] presented a random forest classifier which contains many classification trees. Every tree could be a classifier and the outputs of classification from all trees define the general classification output. Moreover, this general output of classification is completed by selecting the mode of all tree outputs for classification. The phases for the structure of the random forests (RF) may be concluded as follows:
The learning process of the RF classifier starts with the construction of each tree.
Generating the training dataset for every tree.
This training set is named the in-bag dataset, and it is made using the bootstrapping method. The training dataset epoch number and the epoch number in the in-bag dataset are the same. In the in-bag set, around 30% of the data is repeated, so the rest of training dataset, named as out-of-bag, is applied to check the performance of tree classification. Different bootstrap samples construct each tree.
For every tree, to build the random tree classifier nodes and leaves, we have to select and use a random number of features. If we look at Breiman [
35], the casual collection of features raises the classification correctness, reduces the data noise sensitivity, and reduces the features’ correlation [
36].
At the last step, the tree including all the training datasets starts to construct a random tree.
The leading phase includes selecting a characteristic at the root node and after that, dividing the training data into subsets. This creates a division for every potential characteristic value. The splitting and the range of the root node are completely in order of information gain of the characteristic dividing. The information gain (
IG) of separating the training dataset (
Y) into subsets (
Yi) may be defined as follows:
The operator |·| is the scope of the set and
E(
Yi) is the entropy statistic [
37] of the set
Yi, defined as:
where
N is the sleep stages number to be classified (
N = 5) and
pj is the sleep stage proportion
j in the set (
Yi). If the information gain is positive, the node is divided. If it is negative, the node stays the same and becomes a leaf node that is assigned a class label. The quality collection is done according to the highest information gain of the residual qualities. The splitting procedure stays until qualities are chosen. The classification output is the most happening (active) stage of sleep in the subset for extending training nodes [
36].
2.5. Ensemble Machine Learning Methods
Random Tree Classifier is defined as an algorithm which is ensemble learning and generates a couple of individual learners. Every node is divided by means of the best division between all features in a standard tree [
31,
38].
Bagging: A combination of different methods’ decisions results in joining the different results into a single decision. When we talk about classification, the easiest way to implement this is voting, but, if we are dealing with prediction of numeric values, it is calculation of the average.
Specific decision trees could be present as experts. Trees could be combined by voting on every test case. One class is taken as correct if it receives more votes than others. Predictions which are the results of voting are more consistent if we take more votes. If we include new datasets for training, it unusually makes worse decisions, it builds trees for them, and predictions of these trees contribute in the vote. Bagging tries to remove the instability of learning techniques by way of simulating a method defined before applying a certain training set. The original training data is modified by removing some samples and duplicating the others instead of sampling a new training dataset each time. Random sample is applied on instances from the original dataset with replacement to generate a new dataset. This sampling process necessarily removes some of the instances and replicates the others. Bagging only resamples the native training data instead of generating autonomous datasets from the domain. The datasets obtained by resampling vary from each other but are definitely not independent because one dataset is the base for all of them. Nevertheless, it seems that bagging generates a combined model that usually performs better than the single model made from the native training data and is never significantly worse. There is one difference, where the individual predictions are averaged, which is a real number instead of voting on the result [
31].
The boosting is a technique for involving multiple models by searching for models that complement each other. Boosting employs voting or averaging to combine the result of each model, like bagging [
31].
AdaBoost may be used in any learning algorithm, similar to bagging. Any learning algorithm can handle weighted instances, which is a positive number. The weights of instance are used to calculate the classifier’s error. By doing so, the learning algorithm will concentrate on a particular set of instances with high weight. The AdaBoost algorithm starts by assigning the same weight to all training data instances and then by forcing the learning method to create a classifier for this data and reweights every instance according to the output of the classifier [
31].
MultiBoosting: Bagging and AdaBoost seem to work with dissimilar mechanisms, and we can combine these two methods by obtaining the benefits of them. Actually, bagging generally decreases variance, while AdaBoost decreases both bias and variance. Furthermore, bagging achieves more of an effect than AdaBoost at decreasing [
39] variance and their combination can hold AdaBoost’s bias reduction while increasing bagging’s variance reduction. Since bagging minimizes the number of examples for training available to form each sub-committee, we can use wagging [
39] to eliminate this drawback. As a result, MultiBoosting can be defined as wagging committees formed by AdaBoost. In order to set the number of sub-committees, MultiBoosting uses a single committee size argument and then it defines a target final sub-committee member index by setting an index for each member of the final committee, starting from one. Moreover, MultiBoosting has the power computational advantage over AdaBoost in which the sub-committees can be learned in parallel, while this would necessitate an adjustment in the functioning of premature learning in a sub-committee termination. Since the AdaBoost is intrinsically sequential, it reduces the potential for parallelization. On the other hand, every classifier learned with wagging is not depending on others, which allows parallelization, a gain that MultiBoosting gets at the sub-committee level [
40].
Rotation Forest: The classification accuracy can usually be improved by considering the grouping of a few classifiers—a procedure known as classifier ensemble. Rotation Forest represents a technique for constructing classifier ensembles based on the process of feature extraction. Initially, the set of feature vectors is arbitrarily divided into
K subsets and PCA is carried out to each subset. The variability in the data is preserved by keeping all principal components. Consequently, the new features for a basis classifier are formed by means of rotations of the
K axis. The rotation takes care of supporting variety within the ensemble via feature extraction for every basis classifier. The term “forest” is used as Decision Trees (DTs) were selected as base classifiers due to their sensitivity to rotation. The preservation of all principal components will take care of individual accuracy. In addition, accuracy is pursued as the whole dataset is used for training every base classifier. Rotation Forest ensembles are able to form more accurate and diverse individual classifiers than the ones using AdaBoost, Random Forest, and Bagging approaches [
41].
The Random Subspace Method (RSM) is a combination of methods introduced in Reference [
42]. The training data is also modified in the RSM by changing the feature space in such a way that it gains the r-dimensional subspace, which is random of the original p-dimensional space of the feature. Hence, the transformed training set contains r-dimensional training objects. As a result, one may create classifiers in the random subspaces. Therefore, the last decision rule combines them via majority vote. The RSM can be beneficial for applying random subspaces for creating and aggregating the classifiers. If the training objects’ number is not big compared to the size of the data, one can resolve the small sample size issue by creating classifiers in random subspaces. The original feature space is bigger than the dimensionality of subspace, though the training objects’ number keeps the same value [
43].