A Holistic Strategy for Classification of Sleep Stages with EEG

Manual sleep stage scoring is usually implemented with the help of sleep specialists by means of visual inspection of the neurophysiological signals of the patient. As it is a very hectic task to perform, automated sleep stage classification systems were developed in the past, and advancements are being made consistently by researchers. The various stages of sleep are identified by these automated sleep stage classification systems, and it is quite an important step to assist doctors for the diagnosis of sleep-related disorders. In this work, a holistic strategy named as clustering and dimensionality reduction with feature extraction cum selection for classification along with deep learning (CDFCD) is proposed for the classification of sleep stages with EEG signals. Though the methodology follows a similar structural flow as proposed in the past works, many advanced and novel techniques are proposed under each category in this work flow. Initially, clustering is applied with the help of hierarchical clustering, spectral clustering, and the proposed principal component analysis (PCA)-based subspace clustering. Then the dimensionality of it is reduced with the help of the proposed singular value decomposition (SVD)-based spectral algorithm and the standard variational Bayesian matrix factorization (VBMF) technique. Then the features are extracted and selected with the two novel proposed techniques, such as the sparse group lasso technique with dual-level implementation (SGL-DLI) and the ridge regression technique with limiting weight scheme (RR-LWS). Finally, the classification happens with the less explored multiclass Gaussian process classification (MGC), the proposed random arbitrary collective classification (RACC), and the deep learning technique using long short-term memory (LSTM) along with other conventional machine learning techniques. This methodology is validated on the sleep EDF database, and the results obtained with this methodology have surpassed the results of the previous studies in terms of the obtained classification accuracy reporting a high accuracy of 93.51% even for the six-classes classification problem.


Introduction
Sleep is one of the most important functions of the brain, and it plays a vital role in a person's life, which includes factors such as learning ability, concentration, and memory [1]. A partial or full unconsciousness is rendered by sleep to an individual, thereby making the brain a less complicated network. Conditions such as insomnia and obstructive sleep apnea are quite frequent, and they greatly affect the physical health [2]. Sleep issues cause depression, fatigue, lack of interest in academics/office, headache, frequent colds, and joint problems, and can sometimes lead even to death. A lot of road traffic accidents and fatalities are caused by drowsiness [3]. Therefore, automatic detection and analysis of sleep patterns are quite important to trace sleep-related conditions, including fatigue, drowsiness, apnea, insomnia, and so forth [3]. For the analysis of human sleep, sleep stage scoring is the gold standard, and it helps to identify the sleep stages that are important in treating sleep disorders. Based on the polysomnographic (PSG) recordings obtained or without dimensionality reduction and later classified by a classification procedure is the standard protocol followed. For feature extraction techniques, time domain methods, frequency domain methods, nonlinear complex methods, and so forth are utilized widely [29]. The common time domain methods used in the past for feature extraction are standard statistical methods, such as mean, variance, standard deviation, skewness, kurtosis, threshold percentile, median, Shannon entropy, Renyi entropy, zero crossing, Hjorth parameters, detrended fluctuation analysis, mutual information, and Tsallis entropy. The common frequency domain methods utilized in the past include nonparametric analysis, parametric analysis, higher-order spectra (HOS), median frequency, harmonic parameters, coherence analysis, Itakura distance, spectral entropy, and so forth [29]. The time-frequency domain methods include wavelet transform, signal decomposition, short-time Fourier transform, empirical mode decomposition, energy distribution, and Choi-Williams technique. Other nonlinear parameters involved in the past for feature extraction included correlation dimension, Lempel-Ziv complexity, Lyapunov exponent, fractal dimension, approximate entropy, sample entropy, autoregressive coefficients, phase space components, Hurst exponent, energy operators, permutation operator, and multiscale entropy [29]. Feature selection techniques involved in the past for sleep stage classification includes fuzzy C-means clustering, minimum redundancy maximum relevance, sequential techniques, artificial immune clustering, large margin neural network, fast correlation-based filter, fisher score, t-test, recursive feature elimination, principal component analysis, ReliefF method, and metaheuristic algorithms such as differential evolution, genetic algorithm, and particle swarm optimization (PSO) feature selection methods. Many machine learning techniques, such as linear discriminant analysis (LDA), support vector machine (SVM), artificial neural networks (ANN), naïve Bayesian classifier (NBC), quadratic discriminant analysis (QDA), K-nearest neighbor (KNN), decision trees (DT), Adaboost, K-means classifier, Gaussian mixture model (GMM), hidden Markov model (HMM), and bagging, have been utilized for sleep stage classification in the past [30]. A lot of concerns always exist with the automatic sleep stage classification system, such as deviation in the ranges of classification accuracy with sensitivity and specificity measures, careful selection of versatile feature extraction and selection methods, and execution of advanced classification techniques. Many other considerations, such as strong mathematical modelling, good computational time, high generalization and stability, and prospects for good hardware implementation in real-time situations, are also considered while developing automated sleep stage classification systems. On analyzing all the past literature, in this paper something novel was implemented, and the major contributions of the work are as follows: (a) Initially, the clustering was implemented to EEG signals, and the clustering methodology incorporates hierarchical clustering, spectral clustering, and the proposed PCA-based subspace clustering techniques, which is the first of its kind to implement all the three techniques for EEG signal processing utilized for automated sleep stage classification. (b) The dimensionality of the signals was then reduced with the help of the proposed SVD-based spectral algorithm and the standard VBMF. Though VBMF is already existing in the literature, very few works have been reported on its application for reducing the dimensionality of EEG, and so it is considered in this work along with the proposed SVD-based spectral technique. (c) The features were extracted and selected with the help of techniques, such as the proposed sparse group lasso technique with dual-level implementation (SGL-DLI) and the proposed ridge regression technique with limiting weight scheme (RR-LWS). Both these two developed novel techniques have been successfully utilized in our work. (d) Finally, classification happens with the less explored multiclass Gaussian process classification (MGC), the proposed RACC method, and the deep learning technique using LSTM, and the performance is compared with the other conventional machine learning techniques too. A very good mathematical modelling was provided for all the proposed techniques, and the interesting factor is the novel convergence of all the proposed techniques, which makes the whole paper in general very interesting and easy to perform the experiment and provide better results than the previous works. The workflow of the methodology is shown in Figure 1. using LSTM, and the performance is compared with the other conventional machine learning techniques too.
A very good mathematical modelling was provided for all the proposed techniques, and the interesting factor is the novel convergence of all the proposed techniques, which makes the whole paper in general very interesting and easy to perform the experiment and provide better results than the previous works. The workflow of the methodology is shown in Figure 1. The organization of the work is as follows: In Section 2, the clustering and dimensionality reduction techniques are discussed, followed by the usage of feature extraction and selection techniques in Section 3. Section 4 discusses the usage of classifiers, followed by the results and discussion in Section 5 and ended with the conclusion in Section 6.

Clustering Techniques
The primary task of assimilating a group of objects in such a manner in which the objects in the same cluster/group are quite similar to each other compared with those present in the other cluster/group is called clustering. In this work, hierarchical clustering, spectral clustering, and the proposed PCA-based subspace clustering are utilized for the clustering, and they are applied to the signals once the preprocessing is done with the help of independent component analysis (ICA).

Hierarchical Clustering
One of the famous and strong manifestations of the curse of dimensionality problem is that the points considered from high dimensional distributions are quite far from their nearest neighbors, and to address the noise and outliers associated with it becomes a huge challenge [31]. In order to model the low-dimensional structure, various assumptions are imposed on the data such that the clusters should be drawn from the affine subspaces. Spectral clustering usually takes place when the cluster shape is unknown or when it deviates severely from the linear structure. With respect to the geometry of the clusters considering the noise and outliers, this clustering seems to be a very popular and effective approach. An initial distance or a similarity measure is required by the spectral clustering as the operation of it is performed on a graph constructed between the neighbors assessed and the weights dependent on such distances. In the procedure of assessing the groupings The organization of the work is as follows: In Section 2, the clustering and dimensionality reduction techniques are discussed, followed by the usage of feature extraction and selection techniques in Section 3. Section 4 discusses the usage of classifiers, followed by the results and discussion in Section 5 and ended with the conclusion in Section 6.

Clustering Techniques
The primary task of assimilating a group of objects in such a manner in which the objects in the same cluster/group are quite similar to each other compared with those present in the other cluster/group is called clustering. In this work, hierarchical clustering, spectral clustering, and the proposed PCA-based subspace clustering are utilized for the clustering, and they are applied to the signals once the preprocessing is done with the help of independent component analysis (ICA).

Hierarchical Clustering
One of the famous and strong manifestations of the curse of dimensionality problem is that the points considered from high dimensional distributions are quite far from their nearest neighbors, and to address the noise and outliers associated with it becomes a huge challenge [31]. In order to model the low-dimensional structure, various assumptions are imposed on the data such that the clusters should be drawn from the affine subspaces. Spectral clustering usually takes place when the cluster shape is unknown or when it deviates severely from the linear structure. With respect to the geometry of the clusters considering the noise and outliers, this clustering seems to be a very popular and effective approach. An initial distance or a similarity measure is required by the spectral clustering as the operation of it is performed on a graph constructed between the neighbors assessed and the weights dependent on such distances. In the procedure of assessing the groupings within the data and based on these groupings, the assigning of labels to the data points without supervision is performed, and the procedure is termed clustering. In some circumstances, it can perform well, but in some other circumstances, it can never perform well as we have learned with K-means clustering, fuzzy C-means clustering, and so forth [31]. Statistical assumptions are usually placed on the data so that a good performance assessment is provided. The most famous clustering technique is K-means along its variants and is utilized with many feature extraction techniques for EEG signal processing. However, in this paper, a different attempt to utilize other kinds of clustering is implemented, and the notations utilized for the clustering concept are as follows: The data points to the clusters are denoted by X = {x i } n i=1 ⊂ d , the intrinsic dimension of cluster sets is expressed by d, the number of clusters is denoted by K, and the discrete data clustering is denoted by {X l } K l=1 . The discrete noise data are represented as X, and the denoised data are represented as X N . The number of points that remains in the cluster is denoted as n min , and W represents the weight matrix. The arbitrary value is represented by ρ. A family of clusters is built at distinct hierarchical levels by the hierarchical clustering algorithm [31]. The initiation of the individual points is performed as their own clusters, and then the merging of it is performed in an iterative manner unless it reaches a stopping criterion, and therefore, the algorithms are agglomerative. The merging of the clusters at a certain iteration is determined by the agglomerative techniques by utilizing a clustering dissimilarity metric ρ c . For two clusters, C i , C j , ρ c C i , C j implies that the clusters are strong candidates for merging purposes. For every data point in X, let the metric be expressed as ρ X , and along with the standard ρ c , the corresponding clustering techniques include: To define an embedding of the data, a spectral decomposition of a Laplacian matrix is utilized, and then the embedded data are clustered using a standard algorithm called K-means. On the data, a weighted graph is constructed that can specify the local relationships. For the points that are far apart from each other, the graph has very low edge weights. For the points that are very close to each other, the graph has very high edge weights. Then the partitioning of the graph is performed into clusters so that small edge weights are present between each cluster and large edge weights are within each cluster. A kernel function is denoted here as f σ : → [0, 1] with a specific scale parameter σ. Assume that W ij = f σ ρ x i , x j is the respective weight matrix for a given metric The degree of point x i is expressed as d i = ∑ n j=1 W ij , and the diagonal degree matrix D i1 = d i , D ij = 0 is also defined for i = j. By expressing L = D − W, the graph Laplacian is defined and is normalized to get the symmetric Laplacian L SYM = I − D − 1 2 WD − 1 2 . It can also be normalized to obtain a random walk Laplacian L RW = I − D −1 W. To define an embedding, utilizing the eigenvectors of L leads to un-normalized spectral clustering, while the eigenvector of L SYM leads to normalized spectral clustering. The within-cluster similarity is always maximized by normalized spectral clustering and is generally utilized in practice.
The spectral clustering with L SYM is considered, and then the spectral embedding is constructed [32]. In order to indicate the matrix L SYM computed on the data set X, L SYM (X, ρ, f σ ) is utilized effectively with a metric ρ and kernel f σ . The eigenvalues of L SYM are specified by λ 1 ≤ . . . ≤ λ n , and the respective eigenvectors are denoted by φ 1 , . . . , φ n . The data have to be clustered into K groups, and initially, a n × K matrix Φ is formed where columns are expressed by {φ i } k i=1 , where the K eigenvectors are called as the K principal eigenvectors. To obtain matrix V, the normalization of the rows of Φ is performed and is expressed as: The rows of V are specified by {v i } n i=1 ∈ K . If g : D → K is implemented to specify the spectral embedding, then v i = g(x i ). At the end, the clustering of {v i } n i=1 is performed into K groups by applying K-means where the partition of data points {x i } n i=1 is expressed. Similarly, L RW can be utilized. A vital aspect of spectral clustering is to choose K. In order to estimate the total number of clusters as the largest empirical eigenmap, the eigenvalues of L SYM are often used and are represented asK = argmax i λ i+1 − λ i . It should also be observed that λˆk +1 − λˆk is maximal and also λ i should be close to zero for i ≤k. The spectral clustering algorithm utilized in this work is given in Algorithm 1. When utilizing a sparse Laplacian especially, where the sparse nearest neighbor graph is defined by W, this algorithm can be utilized. Algorithm 1: Spectral clustering process with a metric ρ.
, (kernel function) f σ , and (scaling parameter) σ > 0 Output: Y(labels) with clustered values is computed and sorted so that 0 = λ 1 ≤ λ 2 ≤ . . . ≤ λ n . 5. The number of clusters K is estimated as:K = argmax k λ k+1 − λ k . 6. The row normalized spectral embedding is defined by , the labels Y are computed by utilizingK as the total number of clusters, thereby implementing the concept of clustering successfully.

Proposed PCA-Based Subspace Clustering
Subspace clustering is an important technique, and the mainstream approach has two important phases, such as calculation of the affinity matrix, followed by the application of spectral clustering [33]. To enhance the scalability of sparse subspace clustering, many techniques have been proposed in the literature. A random subset of the whole dataset in clustering is considered, and then it utilizes these clusters to group the output of sample data points. When the random subset is small or large, this technique can be scaled well. From the raw dataset X, which has N observations of various parameters as input, the clustering assignment for each point is considered an output in the dataset. In the sample clustering stage, a subset X is drawn from n N points. The (d max + 1) nearest neighbor points in X is found out for every point x i ∈ X. The index set of these points is denoted by C i . Therefore, the subclustering of x i is called Xc i . The affirmative matrix D is computed by each element [D] ij and is nothing but the similarity computed between Xc i and Xc j . By eliminating the spurious connections with the implementation of principal component analysis (PCA), the affinity matrix is sparsified [34]. With the sparsified affinity matrix, the spectral clustering on X is conducted. To the clustered points in X, a classifier is fit, and the points may be classified, but in our case, the dimensionality of it is still reduced, and the best features are extracted and selected so that a very good classification accuracy can be obtained at a later stage. A total of n subclusters are formulated with the help of the sampled dataset. These n subclusters are divided and grouped into k clusters. The linear model of subspaces is central to the concept of clustering. Around every sampled point, the neighborhood of points is computed by applying a thresholding based on the similarity score of inner products. There is huge dependence on the self-representative property of linear subspaces here, and therefore, the concept of distance in between the subclusters is developed to build an affinity matrix. The proposed PCA-based subspace clustering is expressed in Algorithm 2 as follows. Input: Data X, number of subspaces k, sampling size n, regularization parameter λ 1 and λ 2 , neighborhood threshold d max , residual minimization parameter m, affinity threshold t max . Output: The label vector l of all points in X with clustered values.
1. The uniform sampling of n points X from X is performed. 2. The subclusters are constructed. 3. Implement PCA on the subclusters. 4. An affinity matrix is constructed. 5. The adjacency matrix is sparsified.
Sample points in X are clustered by implementing spectral clustering on D.
8. Indicate the labels of X by l in . 9. The label of the entire dataset X is obtained by combining l in and l out so that the entire l can be obtained and the clustering is performed successfully.

Dimensionality Reduction Techniques
To reduce the overall dimension, dimensionality reduction techniques are highly useful, and techniques incorporated here are the proposed SVD-based spectral algorithm and the standard variational Bayesian matrix factorization technique. Once the clustering of the signals is done using the above three techniques, the dimensionality of it is reduced so that the aim of achieving a high classification accuracy is achieved later.

SVD-Based Spectral Algorithm
Here in this approach, an undirected graph G = ([n], E) is initially assembled along with an unknown vector r ∈ n , where the score related with node i is expressed as r i for the obtained clustered values. The assumption of G is considered G(n, p), where the edges between the vertices have a huge independence with a probability p. For each i, the assumption is that r i is handled uniformly, for all i, j. M is not considered to be known to the algorithm. A noisy and independent measurement R ij is obtained for every {i, j} ∈ E as In order to control the noise level, the parameter η ∈ [0, 1] is used, and the indication of the noise level in an explicit manner is performed by γ = 1 − η. The parameters η and p are not considered to be known to the algorithm.
The measurement matrix H ∈ n×n is formed by initializing the following conditions: where R ij denotes the independent measurements. The main intention is to recover the score vector r and also to recover the ranking π, which is induced by r. The complete graph G along with the noise-free measurement conditions makes H = re T − er T , which is nothing but a rank 2 skew-symmetric matrix. By specifying α = r T e n , it can be understood that the two nonzero left singular vectors are Training a vector orthonormal to e/ √ n is important for any orthonormal basis for span {u 1 , u 2 } so that the candidate solutions ± r−αe r−αe 2 are obtained. In order to recover the scale information of r, these candidates are multiplied by σ 1 / √ n. By selecting the best candidate that has high consistency among the measurements, the resolving of the sign ambiguity is performed easily. Therefore, ranking and synchronization is implemented here as it is a famous spectral technique to recover the ranks and scores of items. The application of SVD for ranking and synchronization is expressed in Algorithm 3 as follows.
Algorithm 3: SVD for ranking and synchronization.
Input: Measurement graph G = ([n], E) and pairwise measurement R ij for {i, j} ∈ E assigned for the clustered values. Output: Rank estimates:π and score estimatesr ∈ n considered as the dimensionally reduced values.
1. Measurement matrix formation H ∈ n×n by utilizing R ij . 2. Trace the top 2 left singular vectors of H, namely,û 1 ,û 2 . 3. As an orthogonal projection of u 1 = e/ √ n onto space {û 1 ,û 2 },obtain vector u 1 . 4. Unit vector u 2 ∈ span{û 1 ,û 2 } is obtained. 5. Rank recovery: induced by u 2 , the ranking π is obtained. 6. Minimize the number of upsets and reconcile its global sign. 7. The ranking estimateπ is found out. 8. Score recovery: To recover the scale τ ∈ , u 2 , H is utilized and the output is expressed, giving the dimensionally reduced values as:

Variational Bayesian Matrix Factorization
In order to uncover a low-rank latent structure of data, matrix factorization is utilized, where a product of two factor matrices is obtained by approximating the data matrix [36]. For the purposes of collaborative prediction, the most famous technique utilized is matrix factorization, where the user and item factor matrices are used to predict the unknown ratings; therefore, the approximation of a user-item matrix as their respective product can be analyzed well. Assuming that Z ∈ P×Q indicates a user-item rating matrix, Z pq of the (p, q) entry indicates the user rating p on item q. The factor matrices U = [u 1 , . . . , u P ] ∈ K×P and V = v 1 , . . . , v Q ∈ K×Q are determined by the matrix factorization so that the rating matrix Z is approximated by U T V.
Here, the rank of the factor matrices is denoted by K.
The regularized squared error loss is minimized and expressed as: where a collection of indices of the observed entries in Z is represented by Ω and the regularization parameter is represented by λ. By alternating the stochastic gradient descent techniques or the least squares, the solving of the problem (7) can be performed in an efficient manner. The metaparameters, such as learning rate, regularization parameters, and the total number of iterations, should be carefully tuned so that the overfitting on the training data is avoided. All the model parameters are integrated so that the overfitting problem is alleviated successfully by means of the implementation of the Bayesian concept on matrix factorization. Thus, without the need for more parameter tuning, the learning of the complex models can be conducted easily. The side information can be easily incorporated by the Bayesian matrix factorization by implementing the Gaussian priors on user and item factor matrices. To the respective side information, each prior can be repressed so that the time and space complexity is reduced. With respect to the rank K, a cubic time and quadratic space complexity is obtained by VBMF as the variational distributions are considered to be matrixwise independent. An additional cubic time and quadratic space complexity is necessary if the incorporation of the side information is performed to the VBMF depending on the feature vector size obtained by the side information. Thus, with the prohibition of the usage of rich side information, high-dimensional feature vector is achieved, and the dimensionally reduced values are obtained. In order to satisfy the element-wise independence, the full factorization of the variational distribution is performed.

Feature Extraction and Selection Techniques
Once the dimensionality is reduced, the features have to be extracted and selected, and therefore, the two techniques proposed here are the sparse group lasso technique with dual-level implementation and a ridge regression technique with limiting optimal weight scheme.

Proposed Sparse Group Lasso Technique with Dual-Level Implementation
To identify the significant groups and features in a simultaneous manner, the most powerful regression technique utilized is sparse group lasso (SGL). The lasso and group lasso are combined by the SGL so the sparsity can be yielded at both the individual and group feature levels [37]. SGL has been implemented in machine learning, bioinformatics, signal processing, and so forth. A two-layer feature screening technique called dual layer features is proposed here. The inactive groups and features are quickly identified by this method, and ultimately, zero coefficients are guaranteed in the solution. To deal efficiently with multiple sparsity-inducing regularities, a dual-level technique is widely used. Through the framework of Fenchel duality, the dual feasible solution of SGL is developed [38]. The upper bounds should be estimated so that an efficient dual-level technique is developed.
Assume that · 1 , · , · ∞ is indicated as the l 1 , l 2 and l ∞ norms, respectively. The unit l 1 , l 2 and l ∞ norm balls in n are denoted by B n 1 , B n and B n ∞ , respectively. For set C, assume that intC is its interior value. Assume that Γ 0 ( n ) is the class of proper close convex function on n . The domain of f is the set dom f := {w : f (n) < ∞}. Assume [w] i as the i th component for w ∈ n . G ⊂{1, 2, . . . , n} is considered an index set, and the corresponding subvector of w is denoted by [w] G ∈ |G| , where the number of elements in G is denoted by |G|.
Assume that y ∈ N is the response vector and X ∈ P N×q is the matrix of features. The SGL problem is expressed here with the group information available and represented as: where the number of features in the g th group is represented as n g . The predictors in the group with the respective coefficient vector β g are expressed as X g ∈ N×n g . The positive regularization parameters are represented as λ 1 , λ 2 . Without loss of generality, assume λ 1 = αλ and λ 2 = λ with α > 0. Equation (8), therefore, can be written as follows: min The dual problem of SGL can be obtained as follows using the Lagrangian techniques as: such that: The intersection of closed half spaces enables the dual feasible set of lasso. Using Fenchel's duality theorem, the dual feasible set of SGL is analyzed well. For every X T g θ ∈ D α g , Fenchel's duality leads to an explicit decomposition X T g θ = b 1 + b 2 , where one belongs to α √ ngβ and the other belongs to B∞. The procedure for developing dual-level feature extraction and selection is expressed in Algorithm 4 as follows.
Algorithm 4: Procedure for developing dual-level feature extraction and selection.

2.
The two optimization problems are solved as follows: The dual feature screening ensures the form as: where the optimal solution of SGL in (9) is expressed as β * (λ, α), giving the best extracted and selected features.

Proposed Ridge Regression Technique with Limiting Optimal Weight Scheme
For the dimensionally reduced values, a subset of samples is considered initially, and then ridge regression is trained on these local data. The local dataset is arranged into a feature matrix X i , where every row has a sample or data point along with an outcome vector Y i , where each entry is an outcome. The local ridge regression estimates [39] are computed as follows:β where the regularization parameter is termed as λ i . By using a weighted combination, the aggregation of them is performed so that a single-shot distributed ridge estimator is constructed as:β where q represents the number of sites. By a finite sample analysis of estimation error in linear models, the distributed ridge regression can be studied well. The standard linear model is considered here as Y = Xβ + ε. For n independent samples, the n-dimensional continuous outcome vector is represented as Y ∈ n . X is the n × p design matrix having the values of p features for each sample. The p-dimensional vector of unknown regression coefficients is expressed as β = β 1 , . . . , β p T ∈ T . In order to predict the outcome variable of future samples and to firmly estimate the respective coefficients, this technique is used. Random noise can greatly affect the outcome vector ε = (ε 1 , . . . , ε n ) T ∈ n . The coordinates of ε are assumed to be independent random variables with zero mean and variance σ 2 . For estimation and prediction purposes in linear models, the ridge regression estimation is the most widely used. The ridge estimator of β is recalled as follows [39]: where λ denotes a tuning parameter. Many justifications are present in their estimation. An improved estimation can be performed as the coefficient of the ordinary least squares estimators are shrunk. Supposing that the distribution of the samples is performed across q different sites or machines, the partitioning is performed and expressed as follows: Therefore, for the sake of approximation in a distributed setting, ridge regression estimation is widely used. When the ridge regression is performed locally on every subset of the data, a one-shot weighting technique is given more focus, and finally, the regression coefficients are aggregated by a weighted sum. The weighting technique is utilized as it serves as a useful method of initialization to iterative techniques. Moreover, a variety of new phenomena about one-shot weights can be discovered easily. Therefore, for every dataset X i , Y i , the local ridge estimators are defined with a regularization parameter λ i and are expressed as follows:β By using a weighted one-shot distributed estimation summation, the local ridge estimators are combined and expressed as: The local ridge estimators are well defined, and they are not like ordinary least squares (OLS). As the ridge estimators are biased, it is not necessary to consider whether any constraints should be added on the weights or not. The proposed algorithm works well for designs X with arbitrary covariance structures Σ. Assuming the samples distributed to be n, it is considered that there is an equal distribution of the samples. A local ridge estimator β 1 is computed along with the local estimatorsσ 2 i andα 2 i of the SNR and the noise level. The qualities necessary to find the optimal weights are m, m and λ. The procedure of Ridge regression with limiting optimal weights is expressed in Algorithm 5 as follows.
The tuning parameter λ is chosen by the grid search process. Therefore, the limiting optimal weights too are estimated successfully by this algorithm, and the best features are extracted and selected through this technique. Input: Data matrices (n i × p) and outcomes (n i × 1), (X i , Y i ) distributed across q sites. Output: Distributed ridge estimatorβ dist of regression coefficients β indicating the best features extracted and selected.
1. For i ← 1 to q do.
For i ← 1 to q do .
Compute the local ridge estimatorβ i (λ) = X T i X i + n i λI p −1 X T i Y. The weight w i is computed for the i th local estimator as w i (λ) =σ Progressβ i (λ) and w i (λ) to the global data center. End Terminate the performance of the distributed ridge estimator. End 5. Select the best tuning parameter λ * . 6. Output the respective distributed ridge estimator

Classification Techniques
The features extracted and selected are then fed to the classification stage. The classification techniques proposed in this work are the multiclass Gaussian process classification (MGC), the proposed random arbitrary collective classification (RACC), deep learning methods, and other standard conventional techniques. A standard 10-fold crossvalidation technique was utilized for all the implemented pattern recognition and machine learning techniques.

Multiclass Gaussian Process Classification
The multiclass classification problems can be well addressed by Gaussian processes [40], and it is as follows: a dataset comprising N instances with X = (x 1 , . . . , x N ) T as the observed explaining attributes and y = (y 1 , . . . , y N ) T as the target class labels, where y i ∈ {1, . . . , C} and C > 2 are the number of classes. Making predictions about the label y * of a new instance x * is the primary task of interest, given the observed data X and y. Every class label has been obtained by utilizing the labelling rule for the multiclass classification with the Gaussian process as follows: where f c (.), for c = 1, . . . , C are the various latent functions, and each of them communicates with various class labels. By analyzing the latent function with the highest value at the data point y i , the class label can be obtained. Assume likelihood of the values of every latent function at a training point under this labelling rule is expressed by: where the Heaviside step function is denoted as Θ(·). By analyzing and marginalizing noise present around the latent functions f c (·), other likelihood functions, such as the softmax likelihood, are considered. Around each f c , the Gaussian noise is considered. The actual class label y i , which is related to x i , is considered to account for the labelling errors, and it could have been replaced with a particular probability ε so that some other class labels are reached. Therefore, the likelihood becomes as follows: (20) For every latent function f c (·), the assumptions of a GP prior are performed so that a multiclass classification with GPs is addressed. Therefore, it is represented as p( f c ) ∼ GP(0, k θ (·, ·)), where k θ c (·, ·) represents a covariance function with hyperparameter θ c . A famous example of covariance function includes the squared exponential covariance function and is represented as follows: where the indicator function is represented as I[·] and θ c = σ 2 , σ 0 , l j d j=1 , which are the hyperparameters. The length scale is represented by l j , the amplitude parameter is represented by σ 2 , and the level of additive Gaussian noise and f c is represented by σ 2 0 . For each latent function f c (·), the hyperparameter will be quite different from each other.
is computed so that the predictions about the potential class label of a new data point x * is made. The latent function values that are pretty compatible with the observed data are summarized by this distribution. Using Baye's rule, the computation of the posterior distribution is performed as follows: where f c = ( f c (x 1 ), . . . , f c (x N )) T and p( f c ) = N( f c |0, K c ) are a multivariate Gaussian distribution with zero mean and covariance matrix K c with K c i,j = k θ c x i , x j . The normalization constant is represented as p(y) = p( y| f )p( f )d f and is known as marginal likelihood. In order to get good values for the model hyperparameters θ c , it can be maximized well. Computing the marginal likelihood is slightly difficult, and so to approximate the posterior, inference methods are utilized. A famously used inference technique is variational inference technique. The main advantage of using variational inference is that it can transform the approximate inference issue into a goal optimization problem, and it can be solved easily using stochastic optimization techniques.

Proposed Random Arbitrary Collective Classification
For classification tasks, a very famous framework is ensemble classification, which is nothing but a combination of the consequences of a lot of weak learners to get the ultimate classification [41]. The stability and accuracy of weak classifiers are improved greatly, always leading to a higher performance than the individual weak classifier. Here, a random arbitrary collective classification (RACC) is proposed that can be hybrid with any base classifier. The base classifier used here is SVM. RACC is thus a very flexible ensemble classification framework. Assume that the observation pair are generated from the j th (j ∈ {1, . . . , B 1 }) weak learner. The optimal one S j * is chosen based on some criterion to be mentioned. By utilizing only a portion of training samples in this subspace S j * , the training of the weak learner is performed. To form the decision function, the aggregation of the B 1 weak classifiers C S 1 * −T n , . . . , C S B 1 * −T n is performed as: where α indicates a threshold, which has to be determined. A flexible framework can be admitted here, where any selected classification techniques can act as the base classifiers, such as LDA, KNN, SVM, QDA, and DT. The ranking on the significance of variables in the B 1 subspaces b j * B 1 j=1 is explained by this ensemble process. The minimal discriminative set for every learner can be covered easily with this procedure, and the methodology is as follows.
Assuming that n pairs of observations are present {(x i , y i ), i = 1, . . . n} ∼ (x, y) ∈ X × {0, 1}, where X denotes an open subset of q , q indicates a positive integer and y = {0, 1} is the class label. To specify the whole feature set, S FULL = {1, . . . , q} is utilized. For classes 0 (y = 0) and 1 (y = 1), the marginal densities of x are assumed and expressed as f (0) and f (1) . The respective probability estimates they influence are indicated as p (0) and p (1) . Using the following mixture model, the joint distribution of (x, y) can be expressed as follows: where the Bernoulli variable is represented as y with a success probability π 1 = 1 − π 0 ∈ (0, 1). To express the cardinality, |S| is used for any subspace S. The probability estimate observed by the marginal distribution of x is indicated as Q x , which is expressed as π 0 Q (0) + π 1 Q (1) . For classes 0 and 1, the respective marginal densities are expressed as f (0) s and f (1) s when they are restricted to the feature subspace S. The generation of the B 2 independent arbitrary subspaces is performed as S j1 , . . . , S jB2 so that each weak learner can be trained. Then the selection of the optimal subspace S j * is performed based on some important criterion, and only in S j * , the weak learners are trained, and therefore, the

Generate random subspaces independently, S jk
Choosing of optimal subspace S j * is performed from S jk . (2), the threshold is set. 5. The predicted label C RACC n (x) = 1(v n (x) >α) is given as output, which is the chosen proposition of every feature η = n 1 , . . . , n q T . Hierarchical uniform distribution is used to choose the subspace distribution D. From the uniform distribution over {1, . . . , D}, the generation of the subspace size 'd' is performed. The adjustment of the subspace distribution could be performed if sufficient details are present with respect to the data structure.

LSTM Recurrent Network
One of the famous time recurrent neural networks is LSTM [42]. For predicting the time series of important events, LSTM is highly useful. The historical information can be easily retained by this neural network, and therefore, the learning of long-term dependence information is easily realized. An input gate, a forget gate, and an output gate are the most common gates contained in an LSTM network. In order to update and retain the historical information, a cell unit is utilized. Figure 2 shows the structure of an LSTM block.

LSTM Recurrent Network
One of the famous time recurrent neural networks is LSTM [42]. For predicting the time series of important events, LSTM is highly useful. The historical information can be easily retained by this neural network, and therefore, the learning of long-term dependence information is easily realized. An input gate, a forget gate, and an output gate are the most common gates contained in an LSTM network. In order to update and retain the historical information, a cell unit is utilized. Figure 2 shows the structure of an LSTM block. By utilizing a simple single neuron, it helps to control the forget gate t f in the LSTM memory block. To enable the historical information storage, it helps to assess which information must be retained or discarded. The input gate t i is a part where neurons and the previous memory unit effects are used to create an LSTM block. To assess the historical By utilizing a simple single neuron, it helps to control the forget gate f t in the LSTM memory block. To enable the historical information storage, it helps to assess which information must be retained or discarded. The input gate i t is a part where neurons and the previous memory unit effects are used to create an LSTM block. To assess the historical information of the LSTM block, it is activated widely. Using a tanh neuron, the calculation of the candidate update content c in is performed. By utilizing the current candidate cell c in , input gate information i t , forget gate information f t , and the previous time state c t−1 , the current time memory cell state value c t is computed. The generation of o t for the LSTM block in the current time is performed at the output gate. The amount of information about the current cell state is determined by a t , and it is the output. The calculation of the activation of every gate along with the updation of the current cell state is performed as follows: For each position, the hidden vector is computed, and the last hidden vector is considered as the EEG signal representation. It is fed to a linear layer, and finally, a softmax output layer is utilized to classify the EEG. A four-layer LSTM architecture was used in this paper, which includes an input layer, an LSTM layer, and two fully connected (FC) layers. The illustration of the proposed LSTM for the EEG signal feature extraction and classification is shown in Figure 3.
For each position, the hidden vector is computed, and the last hidden vector is considered as the EEG signal representation. It is fed to a linear layer, and finally, a softmax output layer is utilized to classify the EEG. A four-layer LSTM architecture was used in this paper, which includes an input layer, an LSTM layer, and two fully connected (FC) layers. The illustration of the proposed LSTM for the EEG signal feature extraction and classification is shown in Figure 3.

Focal Loss
To deal with imbalanced datasets, one of the most effective ways is focal loss [43]. By transforming the cross-entropy (CE) loss function, it is obtained. The computation of CE is performed as follows: A dynamically scaled CE is focal loss, where the confidence of the classification increases when the scaling factor decays to zero. The contribution of EEG examples can be automatically downweighed by this scaling factor when the model training focuses on the hard examples. The computation of FL is performed as follows: where the modulating factor is denoted by (1 −ŷ) γ , and the focusing parameter is expressed by γ. When the misclassification of EEG is performed and the value ofŷ is very small, then the value of the modulation factor is close to 1, and in such cases, the loss is barely affected. For the network parameters, optimization is important. Many kinds of gradient descent optimization algorithms are present, such as Adam, Nadam, Adagrad, and Adadelta. Here in this work, Adam is utilized.

Dataset Description
The sleep-EDF database contains raw physiological data having 61 data recordings considered from 42 Caucasian subjects [44,45]. The initial 39 recordings are considered from 20 healthy volunteers (SC-PSG.edf files), and they do not have any sleep-related disease. There were 10 males and 10 females, and at the time of recordings, the demographic range was between 25 to 34 years. The rest, 22 data records, were obtained from 22 participants (ST-PSG.edf files), and there were 7 males and 15 females within the demographic range of 18-79. These 22 subjects had the problem of falling asleep. Dual-channel EEG from FPz-Cz and Pz-Oz is considered in this database with a sampling rate of 100 Hz. Many other physiological signals, such as EMG, EOG, and oronasal respiration, are present in it. To understand the automatic sleep staging, dual-channel EEG data are utilized in this work as they are effective for sleep stage classification. Based on the R and K standards, the manual scoring of the 30 s epoch was performed, and the primary annotations are named as AWA, REM, S1, S2, S3, S4, 'Movement Time' and 'Unscored'. Based on the R and K criteria, the total number of samples is expressed in Table 1. The total number of samples is 127,658 after movement time, and unscored categories are ignored. Table 1. Total number of samples in sleep-EDF dataset (R and K criteria). AWA  REM  S1  S2  S3  S4   6  74,827  11,848  4848  27,292  5070  3773  5  74,827  11,848  4848  27,292  8843  4  74,827  11,848  32,140  8843  3  74,827  11,848  40,983  2 74,827 52,831

Number of Classes
Once the clustering is done, a total of 90,000 samples are obtained, and once when dimensionality reduction is obtained, a total of 30,000 samples are obtained as the dimensionality is reduced by threefold time. When the feature extraction and selection techniques are implemented, a total of 2000 samples are selected, and finally they are fed to classification implementing a 10-fold cross-validation method. For the deep learning application model, once the clustering is performed, all the 90,000 samples are provided to it, and the classification results are obtained. The hyperparameter set for the LSTM deep learning is as follows: The number of LSTM cells is set at 64, the network layers are 4, the optimizer chosen is Adam, the dropout rate is set to 0.1 (after several trial-and-error experiments), the batch size is 128, the cost function is focal loss, and the value of focusing on parameter γ is set to 2 finally again after several trial-and-error experimentations. Table 2 shows the results of the hierarchical clustering with SVD-based spectral algorithm dimensionality reduction technique and its performance analysis with SGL-DLI and RR-LWS for the different classifiers. When MGC is utilized, a high classification accuracy of 97.73% is obtained for two classes, 94.43% for three classes, 93.73% for four classes, 92.73% for five classes, and 92.16% for six classes under SGL-DLI technique. Similarly, when RACC is utilized, a high classification accuracy of 97.96% is obtained for two classes, 94.56% for three classes, 92.99% for four classes, 92.96% for five classes, and 92.72% for six classes under SGL-DLI technique. Similarly, when MGC is utilized, a high classification accuracy of 96.68% is obtained for two classes, 92.84% for three classes, 92.31% for four classes, 92.67% for five classes, and 91.45% for six classes under RR-LWS technique. Similarly, when RACC is utilized, a high classification accuracy of 97.55% is obtained for two classes, 91.78% for three classes, 93.56% for four classes, 91.34% for five classes, and 92.12% for six classes under RR-LWS technique. All the present results surpassed the previous results to a great extent.  Table 3 shows the results of the spectral clustering with SVD-based spectral algorithm dimensionality reduction technique and its performance analysis with SGL-DLI and RR-LWS for the different classifiers. When MGC is utilized, a high classification accuracy of 95.87% is obtained for two classes, 92.56% for three classes, 93 Table 8 shows the results of the clustering methodology with deep learning LSTM method. For the two-classes classification, the classification accuracies produced are 96.35% for hierarchical clustering, 97.85% for spectral clustering, and 97.38% for subspace clustering. For the three-classes classification, the classification accuracies produced are 95.11% for hierarchical clustering, 95.99% for spectral clustering, and 96.78% for subspace clustering. For the four-classes classification, the classification accuracies produced are 94.71% for hierarchical clustering, 95.37% for spectral clustering, and 96.42% for subspace clustering. For the five-classes classification, the classification accuracies produced are 90.65% for hierarchical clustering, 93.35% for spectral clustering, and 93.22% for subspace clustering. For the six-classes classification, the classification accuracies produced are 90.42% for hierarchical clustering, 92.31% for spectral clustering, and 92.47% for subspace clustering.

Performance Comparison with Previous Works
The results obtained in this work are compared with the previous works and expressed in Table 9.  In the literature, there was only one recently published paper reporting results from two channels (Pz-Oz and Fpz-Cz), and so the present results were compared with it. Most of the papers have concentrated only on a single-channel EEG or EOG, and some have clubbed both EEG and EOG as reported in [21], and therefore, the current results obtained cannot be compared with them. Moreover, many results have utilized the extended version of sleep-EDF database, which has about 197 recordings released in 2018. It is a very huge database, and it is rarely analyzed as a whole dataset, and most of the reports have analyzed only a small portion or subset of it. Therefore, the current results cannot be compared with those results too as the database itself was completely different and is an extended version of the currently used database. Considering these points in mind, the hierarchical clustering with SVD-based spectral algorithm methodology with suitable classifiers produced a classification accuracy of 97.96% for two classes, 94.56% for three classes, 93.73% for four classes, 92.96% for five classes, and 92.72% for six classes. The spectral clustering with SVD-based spectral algorithm methodology with suitable classifiers produced a classification accuracy of 95.87% for two classes, 92.56% for three classes, 93.81% for four classes, 93.07% for five classes, and 93.51% for six classes, respectively. The subspace clustering with SVDbased spectral algorithm methodology with suitable classifiers produced a classification accuracy of 98.41% for two classes, 97.56% for three classes, 96.61% for four classes, 94.78% for five classes, and 93.12% for six classes. The hierarchical clustering with VBMF produced a classification accuracy of 96.11% for two classes, 92.22% for three classes, 91.45% for four classes, 90.03% for five classes, and 92.24% for six classes. The spectral clustering with VBMF produced a classification accuracy of 93.66% for two classes, 90.55% for three classes, 90.15% for four classes, 90.34% for five classes, and 89.65% for six classes. The subspace clustering with VBMF produced a classification accuracy of 97.33% for two classes, 96.98% for three classes, 95.85% for four classes, 93.84% for five classes, and 92.03% for six classes. The clustering methodology with deep learning LSTM produced a classification accuracy of 97.85% for two classes, 96.78% for three classes, 96.42% for four classes, 93.35% for five classes, and 92.47% for six classes. All the results obtained surpassed the previous results, and this shows that the present work is quite a versatile methodology. The statistical significance of the results too was analyzed. Cohen's kappa coefficient was computed for the extracted and selected features, and the values ranged in the category of 0.6 to 1, proving that the values reached good agreement and sometimes very good agreement. The Friedman test analysis too was conducted for the process, and distinct values were obtained, proving the uniqueness in the selected features. The standard two-sided Wilcoxon test too was conducted, and the obtained ρ value was less than 0.05 in our experiment, thereby proving that a higher confidence level is achieved.

Conclusions and Future Work
Sleep disorder is a very common symptom of many neurological disorders that affects the quality of life to a great extent. Some of the common problems created due to sleep disorders are insomnia, narcolepsy, sleep-related breathing disorders, and sleep-related movement disorders. The PSG recordings of subjects are the physiological signals that are obtained during an entire night of sleep. The signal recordings, such as EEG, ECG, EOG, and EMG, are found here as PSG is a multivariate system. Once the recording is done, the scoring of sleep stages is performed on the PSG recordings by sleep experts who evaluate and grade the sleep stages. The manual determination of sleep stages is very complex and costly by means of visual inspection of PSG signals. Detecting the EEG signal variations is hard as it has a random and chaotic nature. As a result, automated sleep detection systems are developed so that the experts can be assisted well. The widely used PSG signals for the purpose of sleep stage classification are the EEG data or one or more channels. EEG is widely preferred as it is obtained using wearable technologies, and it consists of more important information. In the EEG signal processing phase, factors such as dimensionality reduction, feature extraction, and feature selection techniques are quite important, and based on that, a novel attempt to implement an interesting flow of methodology is proposed in this paper. Initially, three clustering techniques, followed by two dimensionality reduction techniques and two feature extraction cum selection techniques, were utilized and classified with around 10 classifiers to conduct an exhaustive performance analysis. Among all the results, the best results were obtained when subspace clustering with SVD-based spectral algorithm with suitable classification was performed for a two-class classification problem reporting a classification accuracy of 98.41%. Future works aim to work with many other modified versions of clustering algorithms, modified versions of dimensionality mitigation schemes, and feature extraction techniques along with plenty of other deep learning techniques to obtain a higher classification accuracy and a faster execution time with much easier applicability.  Data Availability Statement: All the programming codes developed can be obtained upon request to the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.