Efficient Noisy Sound-Event Mixture Classification Using Adaptive-Sparse Complex-Valued Matrix Factorization and OvsO SVM

This paper proposes a solution for events classification from a sole noisy mixture that consist of two major steps: a sound-event separation and a sound-event classification. The traditional complex nonnegative matrix factorization (CMF) is extended by cooperation with the optimal adaptive L1 sparsity to decompose a noisy single-channel mixture. The proposed adaptive L1 sparsity CMF algorithm encodes the spectra pattern and estimates the phase of the original signals in time-frequency representation. Their features enhance the temporal decomposition process efficiently. The support vector machine (SVM) based one versus one (OvsO) strategy was applied with a mean supervector to categorize the demixed sound into the matching sound-event class. The first step of the multi-class MSVM method is to segment the separated signal into blocks by sliding demixed signals, then encoding the three features of each block. Mel frequency cepstral coefficients, short-time energy, and short-time zero-crossing rate are learned with multi sound-event classes by the SVM based OvsO method. The mean supervector is encoded from the obtained features. The proposed method has been evaluated with both separation and classification scenarios using real-world single recorded signals and compared with the state-of-the-art separation method. Experimental results confirmed that the proposed method outperformed the state-of-the-art methods.


Introduction
Surveillance systems have become increasingly ubiquitous in our living environment. These systems have been used in various applications including CCTV in traffic and site monitoring, and navigation. Automated surveillance is currently based on video sensory modality and machine intelligence. Recently, intelligent audio analysis has been taken into account in surveillance to improve the monitoring system via detection, classification, and recognition sound in a scenario. However, in a real-world situation, background noise has interfered in both the image and sound of a surveillance system. This will hinder the performance of a surveillance system. Hence, an automatic signal separation and event classification algorithm was proposed to improve the surveillance system by classifying the observed sound-event in noisy scenarios. The proposed noisy sound separation and event classification method is based on two approaches (i.e., blind signal separation and sound classification, which are introduced in the sections to follow, respectively).
The classical problem of blind source separation (BSS), the so-called "cocktail party problem", is a psycho-acoustic spectacle that alludes to the significant human-auditory capability to selectively focus on and identify the sound-source speaker from the scenarios. The interference is produced by competing speech sounds or a variety of noises that are often assumed to be independent of each other. In the case of only a single microphone being available, this reduces to the single channel blind source separation (SCBSS) [1][2][3][4]. The majority of SCBSS algorithms work in time-frequency domain, for example, binary masking [5][6][7] or nonnegative matrix factorization (NMF) [8][9][10][11]. NMF has been continuously developed with great success for decomposing underlying original signals when a sole sensor is available. NMF was developed using the multiplicative update (MU) algorithm to solve its parametrical optimization based on a cost function such as the Kullback-Liebler divergence and the least square distance. Later, other families of cost functions have been continuously proposed, for example, the Beta divergence [12], Csiszar's divergences, and Itakura-Saito divergence [13]. Additionally, iterative gradient update was presented where a sparsity constraint can be included into the optimizing function through regularization by minimizing penalized least squares [14] and using different sparsity constraints for dictionary and code [15]. The complex nonnegative matrix factorization (CMF) spreads the NMF model by combining a sparsity representation with the complex-spectrum domain to improve the audio separability. The CMF can extract the recurrent patterns of the phase estimates and magnitude spectra of constituent signals [16][17][18]. Nevertheless, the CMF lacks the generalized mechanics used for controlling the sparseness of the code. However, the sparsity parameter is manually determined for the above proposed methods. Approximate sparsity is an important consideration as they represent important information. Many sparse solutions have been proposed in the last decade [19][20][21][22][23][24][25]. Nonetheless, the optimal sparse solution remains an open issue.
Sound event classification (SEC) has vastly been exploited by many researchers. Sound can be categorized into speech, music, noise, environmental sound, or daily living sound [26]. Sound events are available in all classes, for example, car horn, traffic, walking, or knocking, etc. [27,28]. Sound-events contain significant information that can be used to describe what has happened or to predict what will happen next in the future. Most algorithms of the SEC methods are conveyed from sound classification approaches such as sparse coding, deep learning, and support vector machine (SVM). These approaches have been exploited to categorize a sound event in both indoor and outdoor scenarios. In recent years, the deep learning approach has been used to classify the sound-event. A deep learning framework can be established with two convolutional neural networks (CNNs) and a deep multi-layer perceptron (MLP) with rectified linear units (ReLU) as the activation function [29,30]. A Softmax function that is the final activation function is used to classify the sound into its corresponding class. The Softmax function is considered as the generalization of the logistic function, which aims to avoid overfitting. One of the advantages of deep learning is that it does not require feature extraction for the input sound. However, a deep neural network requires large training samples and despite a plethora of research, there is a general consensus that deep neural networks are still difficult to fine tune and generalize to test data. Moreover, it does not lend itself to the explanation as to why a certain decision is being made. Separate from the deep learning framework, another SEC approach is support vector machines [31,32], which has been practically presented to solve the classifier problem in various fields. The SVM algorithm relies on supervised learning by using the fundamental concept of statistic and risk minimization. The main process of the SVM is to draw the optimal separating hyperplane as the decision boundary located in such a way that the margin of separation between classes is maximized. The SVM approach is considered as supervised learning algorithm that is comprised of two sections: (1) a training section to model feature space and an optimal hyperplane, and (2) a testing section to use the SVM model for separating the observed data. The margin denotes the distance of the closest instance and the hyperplane. SVM has the desirable properties in that it requires only two differentiating factors to categorize two classes and a hyperplane that can be constructed to suit for an individual problem, even in the nonlinear case by selecting a kernel. Second, SVM provides a unique solution, since it is a convex optimization problem.
The rest of this paper is organized as follows. Section 2 presents the proposed noisy sound separation and event classification method, respectively. Next, Section 3 demonstrates and analyzes the performance of the proposed method. Finally, conclusions are drawn up in Section 4.

Background
Noisy mixed signals observed via a recording device can be stated as: y(t) = s 1 (t) + s 2 (t) + n(t) where s 1 (t) and s 2 (t) denote the original sounds, and n(t) is noise. This research is focused on two sound events in a single recorded signal. The proposed method consists of two steps: noisy sound separation and sound event classification, which is illustrated in Figure 1, where y(t) and Y(ω, t) denote a sound-event mixture in the time domain and time-frequency domain, respectively. The terms W k (ω), H k (t), φ k (ω, t) are spectral basis, temporal code or weight matrix, and phase information, respectively. The term λ k (t) represents sparsity andŝ j (t) is an estimated sound event source. The abbreviations MFCC, STE, and STZCR stand for Mel frequency cepstral coefficients, short-time energy, and short-time zero-crossing rate, respectively. The proposed method is consecutively elaborated in the following parts.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 23 hyperplane that can be constructed to suit for an individual problem, even in the nonlinear case by selecting a kernel. Second, SVM provides a unique solution, since it is a convex optimization problem. The rest of this paper is organized as follows. Section 2 presents the proposed noisy sound separation and event classification method, respectively. Next, Section 3 demonstrates and analyzes the performance of the proposed method. Finally, conclusions are drawn up in Section 4.

Background
Noisy mixed signals observed via a recording device can be stated as: ( ) = ( ) + ( ) + ( ) where ( ) and ( ) denote the original sounds, and ( ) is noise. This research is focused on two sound events in a single recorded signal. The proposed method consists of two steps: noisy sound separation and sound event classification, which is illustrated in Figure 1, where ( ) and ( , ) denote a sound-event mixture in the time domain and time-frequency domain, respectively.
The terms ( ), ( ), ( , ) are spectral basis, temporal code or weight matrix, and phase information, respectively. The term ( ) represents sparsity and s (t) is an estimated sound event source. The abbreviations MFCC, STE, and STZCR stand for Mel frequency cepstral coefficients, short-time energy, and short-time zero-crossing rate, respectively. The proposed method is consecutively elaborated in the following parts.

Single-Channel Sound Event Separation
The problem formulation in time-frequency (TF) representation is given by an observed complex spectrum, Y f ,t ∈ C, to estimate the optimal parameters θ = {W, H, φ} of the model. A new factorization algorithm named as the adaptive L 1 -sparse complex non-negative matrix factorization (adaptive L 1 -SCMF) is derived in the following section. The generative model is given by where Z k (ω, t) = e jφ k (ω,t) and the reconstruction error (ω, t) ∼ N C 0, σ 2 is assumed to be independently and identically distributed (i.i.d.) with white noise having zero mean and variance σ 2 . The term (ω, t) is used to denote a modeling error for each source. The likelihood of θ = {W, H, φ} is thus written as It is assumed that the prior distributions for W, H, and φ are independent, which yields The prior P(H|λ) corresponds to the sparsity cost, for which a natural choice is a generalized Gaussian prior. When p = 1, P(H|λ) promotes the L 1 -norm sparsity. L 1 -norm sparsity has been shown to be probabilistically equivalent to the pseudo-norm, L 0 , which is the theoretically optimum sparsity [33,34]. However, L 0 -norm is non-deterministic polynomial-time (NP) hard and is not useful in large datasets such as audio. Given Equation (3), the posterior density [35,36] is defined as the maximum a posteriori probability (MAP) estimation problem, which leads to minimizing the following optimization problem with respect to θ. Equations of Gaussian prior and maximum a posteriori probability (MAP) estimation are expressed in Appendix A.
The CMF parameters have been upgraded by using an efficient auxiliary function for an iterative process. The auxiliary function for f (θ) can be expressed as the following: for any auxiliary variables The term f (θ) ≤ f + θ, θ with an auxiliary function was defined as: where

Estimation of the Spectral Basis and Temporal Code
In Equation (4), the update rule for θ is derived by differentiating f + θ, θ partially w.r.t. W k (ω) and H k (t), and setting them to zero, which yields the following: The update rule for the phase, φ k (ω, t), can be derived by reformulating Equation (4) as follows: where A denotes the terms that are irrelevant with φ k (ω, t), . Derivation of (9) is elucidated in Appendix B.

Estimation of L 1 -Optimal Sparsity Parameter λ k (t)
This section aims to facilitate spectral dictionaries with adaptive sparse coding. First, the CMF model is defined as the following terms: e jφ(t) = e jφ 1 (t) . . . · · · . . .e jφ K (t) where "⊗" and " • " are the Kronecker product and the Hadamard product, respectively. The term vec(·) denotes the column vectorization and the term I is the identity matrix. The goal is then set to compute the regularization parameter λ k (t) related to each H k (t). To achieve the goal, the parameter p in Equation (4) is set to 1 to acquire a linear expression (in λ k (t)). In consideration of the noise variance σ 2 , Equation (4) can concisely be rewritten as: where the h and λ terms indicate vectors of dimension R × 1 (i.e., R = F × T × K), and the superscript 'T' is used to denote complex Hermitian transpose (i.e., vector (or matrix) transpose followed by complex conjugate). The Expectation-Maximization (EM) algorithm will be used to determine λ and h is the hidden variable where the log-likelihood function can be optimized with respect to λ. The log-likelihood function satisfies the following [12]: by applying the Jensen's inequality for any distribution Q(h). The distribution can simply verify the posterior distribution of h, which maximizes the right-hand side of Equation (15), is given by The posterior distribution in the form of the Gibbs distribution is proposed as follows: The term F(h) in Equation (16) as the function of the Gibbs distribution is essential for simplifying the adaptive optimization of λ. The maximum-likelihood (ML) estimation of λ can be decomposed as follows: In the same way, Individual element of H is required to be exponentially distributed with independent decay parameters that delivers p(h λ) = g λ g exp −λ g h g , thus Equation (17) obtains The term h denotes the dependent variable of the distribution Q(h), whereas other parameters are assumed to be constant. As such, the λ optimization in Equation (19) is derived by differentiating the parameters within the integral with respect to h. As a result, the functional optimization [37] of λ then obtains where g = 1, 2, . . . , R, λ g denotes the g th element of λ. Notice that the solution h naturally splits its elements into distinct subsets h M and h P , consisting of components ∀ m ∈ M so that h m > 0 and components ∀ p ∈ P so that h P = 0. The sparsity parameter is then obtained as presented in Equation (21): and its covariance X is given by whereQ P h P ≥ 0 = The core procedure of the proposed CMF method is based on L 1 -optimal sparsity parameters. The estimated sources are discovered by multiplying the respective rows of the W k (ω) components with the corresponding columns of the H k (t) weights and time-varying phrase spectrum e jφ k (ω,t) . The separated sourceŝ j (t) is obtained by converting the time-frequency represented sources into the time domain. Derivation of L 1 -optimal sparsity parameter, is elucidated in the Appendix C.

Sound Event Classification
Once the separated sound signal is obtained, the next step is to identify the sound event. A multiclass support vector machine (MSVM) is employed to achieve the goal. The MSVM is comprised of two phases: the learning phase and the evaluation phase. The MSVM is based on one versus one strategy (OvsO) that splits observed c classes into c(c−1) 2 binary classification sub-problems. To train the th MSVM model, the MSVM will construct hyperplanes for discriminating each observed data into its corresponding class by executing the series of binary classification. Starting from the learning phase, sound signatures are extracted from the training dataset in the time-frequency domain. The sound signatures that were studied in this research were the Mel frequency cepstral coefficients (MFCC: MF), short-time energy (STE: E(t)), and short term zero-crossing rate (STZCR: STZ(t)), which can be orderly expressed as: where f w (t) denotes the windowing function. The training signals are segmented into small blocks, then the individual block is extracted to the three signature parameters. The mean supervector is then computed as an average of individual feature of all blocks for each sound event input. Thus, the mean feature supervector (O) with a corresponding sound-event-label vector ((w)) is paired together (i.e., (ψ(O, w))) and supplied to the MSVM model. The discriminant formula can be expressed as: where O i|β , w i , i = 1, . . . , M represents the i th separated sound signals; the weight vector α is employed for individual class to compute a discriminant score for the O; the i term is the index of the block order (β); and the function α T ψ(O, w; β) measures a linear discriminant distance of the hyperplane with the extracted feature vector from the observed data. The MSVM based OvsO strategy for class th and other, the hyperplane, can be maximized as α T ψ(O, w; β) + b and can then be learned via the following equation as min α ,ξ ξ i denotes a penalty function for tradeoff between a large margin and a small error penalty. The optimal hyperplane can be determined by minimizing 1 2 α 2 to maximize the condition (i.e., α T ψ(O, w; β) + b ≥ 1 − ξ i ). If the conditional term is greater than 1 − ξ i , then the estimated sound event belongs to the th class. Otherwise, the estimated sound event classifies into other classes.
The overview of the proposed algorithm is presented in the following table as Algorithm 1.

Algorithm 1
Overview of the proposed algorithm.
(4) Update parameters (21) and (23) until convergence is reached as determined by the rate of change of the parameters update falling within a pre-determined threshold. (5) Estimation of each source by multiplying the respective rows of the spectral components W k (ω) with the corresponding columns of the mixture weights H k (t) and time-varying phrase spectrum e jφ k (ω,t) . (i.e., f ,t and construct the binary TF mask for the i th source ).
(6) Convert the time-frequency represented sources into time domain to obtain the separated sourcesŝ j (t) Classify the th sound event by computing the optimal hyperplane α T ψ(O, w; β) + b by minimizing the following equation: min

Experimental Results and Analysis
The performance was evaluated on recorded sound-event signals in a low noisy environment at 20 signal-to-noise ratios (SNRs). The sound-event database had a total of 500 recorded signals containing four event classes: speech (SP), door open (DO), door knocking (DK), and footsteps (FS). An overview of the experimental setup is given as the following: all signals had a 16-bit resolution and a sampling frequency of 44.1 KHz. A 2048 length of Hanning window with 50% overlap was used for signal processing. Nonlinear SVM with a Gaussian RBF kernel was used for constructing the MSVM learning model. Other kernels such as polynomials, sigmoid, and even linear function were tested, but the best performance was delivered by the Gaussian kernel. A 4-fold cross-validation strategy was used in the training phase for tuning the classifier parameters when using 80% of the recorded signals (n = 400) from the sound-event database.
The performance of the proposed noisy sound separation and event classification (NSSEC) method was demonstrated and presented into the following two sections: (1) the separating performance, and (2) the MSVM classifier.

Sound-Event Separation and Classification Performance
Event mixtures consist of two sound-event signals in low noisy environment at 20 dB SNRs. A hundred sound-event signals of four classes were randomly selected and then mixed to generate 120 mixtures of six types (i.e., DO + DK, DO + FS, DO + SP, DK + FS, DK + SP, and FS + SP). The separation performance measured the signal-to-distortion ratio (SDR) (i.e., SDR = 10 log 10 s target 2 / e inter f + e noise + e arti f 2 where e inter f , e noise , and e arti f ). The SDR represents the ratio of the magnitude distortion of the original signal by the interference from other sources. The proposed separation method was compared with the state-of-the-art NMF approach (i.e., CMF [38], NMF-ISD [14,39], and SNMF [40][41][42] methods). The cost function was the least squares with 500 maximum number of iterations.

Variational Sparsity Versus Fixed Sparsity
In this implementation, several experiments were conducted to investigate the effect of sparsity regularization on source separation performance. The proposed separation method was evaluated by variational sparsity in the case of (1) uniform constant sparsity with low sparseness e.g., λ k t = 0.01 and (2) uniform constant sparsity with high sparseness (e.g., λ k t = 100). The hypothesis is that the proposed variational sparsity will significantly yield improvement of the audio source separation when compared with fixed sparsity.
To investigate the impact of uniform sparsity parameter, the set of sparsity regularization values from 0 to 10 with a 0.5 interval were determined for each experiment of 60 mixtures of six types. Results of the uniform regularization given by various sparsity (i.e., λ k t = 0, 0.5, . . . , 10) is illustrated in Figure 2.

Variational Sparsity Versus Fixed Sparsity
In this implementation, several experiments were conducted to investigate the effect of sparsity regularization on source separation performance. The proposed separation method was evaluated by variational sparsity in the case of (1) uniform constant sparsity with low sparseness e.g., = 0.01 and (2) uniform constant sparsity with high sparseness (e.g., = 100). The hypothesis is that the proposed variational sparsity will significantly yield improvement of the audio source separation when compared with fixed sparsity.
To investigate the impact of uniform sparsity parameter, the set of sparsity regularization values from 0 to 10 with a 0.5 interval were determined for each experiment of 60 mixtures of six types. Results of the uniform regularization given by various sparsity (i.e., = 0, 0.5, … , 10) is illustrated in Figure 2.  Figure 2 illustrates that the best performance of the unsupervised CMF was in a range of 1.5-3, which yielded the highest SDR of over 8dB. When the term was set too high, the low spectral values of sound-event signals were overly sparse. This overfitting sparsity ( ) caused the separation performance toward a tendency to degrade. Conversely, the underfitting sparsity ( ) occurred when the term was set too low. The coding parameter ( ) could not distinguish between the two sound-event signals. It was also noticed that if the factorization is non-regularized, this will cause the separation results to contain a mixed sound. According to the uniform sparsity results in Figure 2, the separation performance of the proposed method varies depending on the assigned sparsity values. Thus, it is challenging to find a solution for the indistinctness among the sound-event sources in the TF representation to determine the optimal value of sparseness. Thus, this introduces the importance of determining the optimal for separation. Table 1 presents the essential sparsity value on the separation performance by comparing the proposed method given by variational sparsity against the uniform sparsity scheme. The average performance improvement of the proposed adaptive CMF method against the uniform constant sparsity was 1.32 dB SDR. The SDR results clearly indicate that the adaptive sparsity yielded the surpass separation performance over the constant sparsity scheme. Hence, the proposed variational sparsity improves the performance of the discovered original sound-event signals by adaptively selecting the appropriate sparsity parameters to be individually adapted Consequently, the optimal sparsity facilitates the estimated spectral dictionary via the estimated temporal code. The quantitative measures of separation performance were performed to assess the proposed single-channel sound event separation method. The overall average signal-to-distortion ratio (SDR) was 8.62 dB as illustrated in Figure 3.  Figure 2 illustrates that the best performance of the unsupervised CMF was in a range of 1.5-3, which yielded the highest SDR of over 8dB. When the term λ k t was set too high, the low spectral values of sound-event signals were overly sparse. This overfitting sparsity H k (t) caused the separation performance toward a tendency to degrade. Conversely, the underfitting sparsity H k (t) occurred when the term λ k t was set too low. The coding parameter H k (t) could not distinguish between the two sound-event signals. It was also noticed that if the factorization is non-regularized, this will cause the separation results to contain a mixed sound. According to the uniform sparsity results in Figure 2, the separation performance of the proposed method varies depending on the assigned sparsity values. Thus, it is challenging to find a solution for the indistinctness among the sound-event sources in the TF representation to determine the optimal value of sparseness. Thus, this introduces the importance of determining the optimal λ for separation. Table 1 presents the essential sparsity value on the separation performance by comparing the proposed method given by variational sparsity against the uniform sparsity scheme. The average performance improvement of the proposed adaptive CMF method against the uniform constant sparsity was 1.32 dB SDR. The SDR results clearly indicate that the adaptive sparsity yielded the surpass separation performance over the constant sparsity scheme. Hence, the proposed variational sparsity improves the performance of the discovered original sound-event signals by adaptively selecting the appropriate sparsity parameters to be individually adapted for each element code (i.e., λ g = . Consequently, the optimal sparsity facilitates the estimated spectral dictionary via the estimated temporal code. The quantitative measures of separation performance were performed to assess the proposed single-channel sound event separation method. The overall average signal-to-distortion ratio (SDR) was 8.62 dB as illustrated in Figure 3.  Each sound-event signal has its own temporal pattern that can be clearly noticed in TF representation. Examples of sound-event signals in the TF domain are illustrated in Figure 4. Through the adaptive L1-SCMF method, the proposed single-channel separation method can generate complex temporal patterns such as speech. Thus, the separation results clearly indicate that the performances of noisy source separation perform with high SDR values.  Each sound-event signal has its own temporal pattern that can be clearly noticed in TF representation. Examples of sound-event signals in the TF domain are illustrated in Figure 4. Through the adaptive L 1 -SCMF method, the proposed single-channel separation method can generate complex temporal patterns such as speech. Thus, the separation results clearly indicate that the performances of noisy source separation perform with high SDR values.   Each sound-event signal has its own temporal pattern that can be clearly noticed in TF representation. Examples of sound-event signals in the TF domain are illustrated in Figure 4. Through the adaptive L1-SCMF method, the proposed single-channel separation method can generate complex temporal patterns such as speech. Thus, the separation results clearly indicate that the performances of noisy source separation perform with high SDR values.   This section presents the adaptive CMF separating performance against the state-of-the-art NMF methods (i.e., CMF, SNMF, and NMF-ISD). In the compared methods, the experimental variables such as the normalizing time-frequency domain were computed by using the short-time Fourier transform (i.e., 1024-point Hanning window with 50% overlap). The number of factors was two, with a sparsity weight of 1.5. One hundred random realizations of twenty second-event mixtures were executed. As a result, the average SDRs are presented in Table 2. The proposed adaptive CMF method yielded the best separating performance over the CMF, SNMF, and NMF_ISD methods with the average improvement SDR at 2.13 dB. The estimated door open signals obtained the highest SDR among the four event categories.
The sparsity parameter was carefully adapted using the proposed adaptive L1-SCMF method exploiting the phase information and temporal code of the sources, which is inherently ignored by SNMF and NMF-ISD and has led to an improved performance of about 2 dB in SDR. On the other hand, the parts decomposed by the CMF, SNMF, and NMF-ISD methods were unable to capture the phase spectra and the temporal dependency of the frequency patterns within the audio signal.
Additionally, the CMF and NMF-ISD are unique when the signal adequately spans the positive octant. Thus, the rotation of and opposite can obtain the same results. The CMF method can easily be over or under sparse resolution of the factorization due to manually determining the sparsity value.

Comparison of the Proposed Adaptive CMF with Other SCBSS Methods Based on NMF
This section presents the adaptive CMF separating performance against the state-of-the-art NMF methods (i.e., CMF, SNMF, and NMF-ISD). In the compared methods, the experimental variables such as the normalizing time-frequency domain were computed by using the short-time Fourier transform (i.e., 1024-point Hanning window with 50% overlap). The number of factors was two, with a sparsity weight of 1.5. One hundred random realizations of twenty second-event mixtures were executed. As a result, the average SDRs are presented in Table 2. The proposed adaptive CMF method yielded the best separating performance over the CMF, SNMF, and NMF_ISD methods with the average improvement SDR at 2.13 dB. The estimated door open signals obtained the highest SDR among the four event categories. The sparsity parameter was carefully adapted using the proposed adaptive L 1 -SCMF method exploiting the phase information and temporal code of the sources, which is inherently ignored by SNMF and NMF-ISD and has led to an improved performance of about 2 dB in SDR. On the other hand, the parts decomposed by the CMF, SNMF, and NMF-ISD methods were unable to capture the phase spectra and the temporal dependency of the frequency patterns within the audio signal.
Additionally, the CMF and NMF-ISD are unique when the signal adequately spans the positive octant. Thus, the rotation of W and opposite H can obtain the same results. The CMF method can easily be over or under sparse resolution of the factorization due to manually determining the sparsity value.

Performance of Event Classification Based on MSVM Algorithm
This section elucidates the features and performance of the MSVM-learning model. The MSVM-learning model was investigated to obtain the optimal size of the sliding window and then determine the significant features that led to the classification performance. Finally, the efficiency of the MSVM model was evaluated. These topics are presented in order in the following parts.

Determination Optimal Window Length for Feature Encoding
For the MSVM method, sound-event signals are segmented into small blocks for encoding feature parameters by using a fixed-length of the sliding window. The sets of feature vectors are computed using the mean supervector and then loaded to the MSVM model for learning and constructing the hyperplane. The size of blocks can affect the information of the feature vectors, which leads to the classifier performance. The block's size will affect the α , hence modifying the block size will mark the learning efficiency of the MSVM model. Therefore, in order to obtain the optimal value of α , the optimal block size was exploited by training the MSVM model given various lengths of window sizes (i.e., 0.5, 1, 1.5, and 2.0 s) to learn the 400 noisy sound-event signals of four event classes with cross-validation.
The experimental results are plotted in Figure 5, where the block size varies from 0.5 to 2.0 with 0.5 increments. The MSVM model of the 1.5 s block size yielded the best sound-event classification at 100% accuracy. The sliding window function benefits from SVM to learn an unknown sound event by generating the set of blocks from the observed event, regarded as a number of observed events. As a result, a set of sound event characteristics were computed for each block (i.e., O i|β , w i in Equation (24)).

Performance of Event Classification Based on MSVM Algorithm
This section elucidates the features and performance of the MSVM-learning model. The MSVMlearning model was investigated to obtain the optimal size of the sliding window and then determine the significant features that led to the classification performance. Finally, the efficiency of the MSVM model was evaluated. These topics are presented in order in the following parts.

Determination Optimal Window Length for Feature Encoding
For the MSVM method, sound-event signals are segmented into small blocks for encoding feature parameters by using a fixed-length of the sliding window. The sets of feature vectors are computed using the mean supervector and then loaded to the MSVM model for learning and constructing the hyperplane. The size of blocks can affect the information of the feature vectors, which leads to the classifier performance. The block's size will affect the , hence modifying the block size will mark the learning efficiency of the MSVM model. Therefore, in order to obtain the optimal value of , the optimal block size was exploited by training the MSVM model given various lengths of window sizes (i.e., 0.5, 1, 1.5, and 2.0 s) to learn the 400 noisy sound-event signals of four event classes with cross-validation.
The experimental results are plotted in Figure 5, where the block size varies from 0.5 to 2.0 with 0.5 increments. The MSVM model of the 1.5 s block size yielded the best sound-event classification at 100% accuracy. The sliding window function benefits from SVM to learn an unknown sound event by generating the set of blocks from the observed event, regarded as a number of observed events. As a result, a set of sound event characteristics were computed for each block (i.e., | , in (24)). The optimal length of the window size can capture the signature of the sound event. If the window length is too short, the encoded features will then deviate from the character of the sound event. In addition, the mean supervector is computed from the set of features of all blocks, which can be regarded as the mean of the probability distribution of the features. This mean supervector advantages the MSVM to reduce misclassifications when compared to the conventional SVM. Hence, the STFT of all experiments set the window function at 1.5 s.

Determination of Sound-Event Features
Each sound-event signal was encoded with three features: Mel frequency cepstral coefficients (MFCCs), short-time energy (STE), and short-time zero-crossing rate (STZCR). MFCCs are represented as a frequency domain feature that is evaluated in a similar assembly to the human ear (i.e., logarithmic frequency perception). STE is the total spectrum power of an observed event. The STZCR denotes the number of times that the signal amplitude interval satisfies the condition (i.e., is 1 if the condition is true and 0 otherwise). The STZCR features of four sound-event classes are illustrated in Figure 6. The optimal length of the window size can capture the signature of the sound event. If the window length is too short, the encoded features will then deviate from the character of the sound event. In addition, the mean supervector is computed from the set of features of all blocks, which can be regarded as the mean of the probability distribution of the features. This mean supervector advantages the MSVM to reduce misclassifications when compared to the conventional SVM. Hence, the STFT of all experiments set the window function at 1.5 s.

Determination of Sound-Event Features
Each sound-event signal was encoded with three features: Mel frequency cepstral coefficients (MFCCs), short-time energy (STE), and short-time zero-crossing rate (STZCR). MFCCs are represented as a frequency domain feature that is evaluated in a similar assembly to the human ear (i.e., logarithmic frequency perception). STE is the total spectrum power of an observed event.
The STZCR denotes the number of times that the signal amplitude interval satisfies the condition (i.e., if the condition is true and 0 otherwise). The STZCR features of four sound-event classes are illustrated in Figure 6.  The STZCR feature represents unique patterns of four sound-event classes. The four soundevent patterns are different in shape and data range. Similarly, the MCFFs and STE features extract distinctive patterns of all event classes, except for the patterns between door knocking and footstep, as illustrated in Figure 7. The STZCR feature represents unique patterns of four sound-event classes. The four sound-event patterns are different in shape and data range. Similarly, the MCFFs and STE features extract distinctive patterns of all event classes, except for the patterns between door knocking and footstep, as illustrated in Figure 7.  The STZCR feature represents unique patterns of four sound-event classes. The four soundevent patterns are different in shape and data range. Similarly, the MCFFs and STE features extract distinctive patterns of all event classes, except for the patterns between door knocking and footstep, as illustrated in Figure 7.    Figure  7a represents the five orders of MFCC features to compare patterns between door knocking and walking while the STE features are shown in Figure 7b.
The proposed method separated the six categories of mixtures, then classified each estimated sound event signal into its corresponding class. Classified results of the six categories are presented as confusion matrixes below: The As shown in Figure 8, the MSVM model given by MFCCs and STZCR yielded the best classified accuracy at 100%, with less deviation among the other cases. Therefore, the separated signals were then classified by the proposed MSVM method given by the MFCC and STZCR vectors and the 1.5 s window function. The computational complexity of the proposed method was analyzed by two steps. First, the adaptive L1-SCMF method was NP-hard. Big-O of the adaptive L1-SCMF method consists of spectral basis ( ), temporal code ( ), and phase information that rely on components ( ). Thus, Big-O of the separation step is ( ) . For MSVM steps, it is a polynomial algorithm where Big-O is ( ). Therefore, the computational complexity of the proposed method based on Big-O is ( ) . All experiments were performed by a PC with Intel ® Core™ i7-4510U CPU2.00 GHz and 8 GB RAM. MATLAB was used as the programming platform.   Figure 7a represents the five orders of MFCC features to compare patterns between door knocking and walking while the STE features are shown in Figure 7b.
The proposed method separated the six categories of mixtures, then classified each estimated sound event signal into its corresponding class. Classified results of the six categories are presented as confusion matrixes below: The As shown in Figure 8, the MSVM model given by MFCCs and STZCR yielded the best classified accuracy at 100%, with less deviation among the other cases. Therefore, the separated signals were then classified by the proposed MSVM method given by the MFCC and STZCR vectors and the 1.5 s window function. The computational complexity of the proposed method was analyzed by two steps. First, the adaptive L1-SCMF method was NP-hard. Big-O of the adaptive L 1 -SCMF method consists of spectral basis (m), temporal code (n), and phase information that rely on components (k). Thus, Big-O of the separation step is (mn) O(k 2 ) . For MSVM steps, it is a polynomial algorithm where Big-O is O n 3 . Therefore, the computational complexity of the proposed method based on Big-O is (mn) O(k 2 ) . All experiments were performed by a PC with Intel ® Core™ i7-4510U CPU2.00 GHz and 8 GB RAM. MATLAB was used as the programming platform.

Performance of MSVM Classifier
The MSVM-classifier performance is presented in terms of percentage of the corrected soundevent classification. The 240 separated signals of four classes from the proposed separation method were individually identified by the MSVM classifier. Figure 9 compares the classification performance on the four classes of individual sound events. The best classification accuracy was door open, followed by footstep, door knocking, and speech. On the other hand, the classification results based on the mixed sound events are illustrated in Figure 10. The MSVM model delivered the highest performance of the door-open event with 84% accuracy.

Performance of MSVM Classifier
The MSVM-classifier performance is presented in terms of percentage of the corrected sound-event classification. The 240 separated signals of four classes from the proposed separation method were individually identified by the MSVM classifier. Figure 9 compares the classification performance on the four classes of individual sound events. The best classification accuracy was door open, followed by footstep, door knocking, and speech. On the other hand, the classification results based on the mixed sound events are illustrated in Figure 10. The MSVM model delivered the highest performance of the door-open event with 84% accuracy.
From the above experiments, the proposed method yields an average classification accuracy of 76.67%. The MSVM method can well discriminate and classify the mixed event signals with high classification accuracy (i.e., the mixture of door open with door knocking and door knocking with speech were correctly classified with above 80% accuracy). Due to the MFCC and STZCR features in the individual event, these signals had obvious distinguishable patterns, as shown in the example of STZCR plots in Figure 6. Despite the SDR scores of the separated signals between door open and door knocking being relatively low (as given in Figure 3), the MSVM yielded the highest classification accuracy for the door open with door knocking mixture (DO + DK). This is attributed to the fact that interference remaining in the separated event signals causes the extracted MFCC and STZCR vectors to deviate from their original sound event vectors.

Performance of MSVM Classifier
The MSVM-classifier performance is presented in terms of percentage of the corrected soundevent classification. The 240 separated signals of four classes from the proposed separation method were individually identified by the MSVM classifier. Figure 9 compares the classification performance on the four classes of individual sound events. The best classification accuracy was door open, followed by footstep, door knocking, and speech. On the other hand, the classification results based on the mixed sound events are illustrated in Figure 10. The MSVM model delivered the highest performance of the door-open event with 84% accuracy.   From the above experiments, the proposed method yields an average classification accuracy of 76.67%. The MSVM method can well discriminate and classify the mixed event signals with high classification accuracy (i.e., the mixture of door open with door knocking and door knocking with speech were correctly classified with above 80% accuracy). Due to the MFCC and STZCR features in the individual event, these signals had obvious distinguishable patterns, as shown in the example of STZCR plots in Figure 6. Despite the SDR scores of the separated signals between door open and door knocking being relatively low (as given in Figure 3), the MSVM yielded the highest classification accuracy for the door open with door knocking mixture (DO + DK). This is attributed to the fact that interference remaining in the separated event signals causes the extracted MFCC and STZCR vectors to deviate from their original sound event vectors.

Conclusions
A novel solution for classification of the noisy mixtures using a single microphone was presented. The complex matrix factorization was proposed and extended by adaptively tuning the sparse regularization. Thus, the desired L1-optimal sparse decomposition was obtained. In addition, the phase estimates of the CMF could extract the recurrent pattern of the magnitude spectra. The updated equation was derived through an auxiliary function. For classification, the multiclass support vector was used as the mean supervector for encoding the sound-event signatures. The proposed noisy sound separation and event classification method was demonstrated by using four sets of noisy sound-event mixtures, which were door open, door knocking, footsteps, and speech. Based on the experimental results, first, the optimal window length of STFT was found where 1.5 s of the sliding window yielded the best separation performance. The second was two significant features that were ZCR and MFCCs. These parameters were set for examining the proposed method. The proposed method achieved outstanding results in both separation and classification. In future work, the proposed method will be evaluated on a public dataset such as the DCASE 2016, alongside the comparison with other machine learning algorithms.

Conclusions
A novel solution for classification of the noisy mixtures using a single microphone was presented. The complex matrix factorization was proposed and extended by adaptively tuning the sparse regularization. Thus, the desired L 1 -optimal sparse decomposition was obtained. In addition, the phase estimates of the CMF could extract the recurrent pattern of the magnitude spectra. The updated equation was derived through an auxiliary function. For classification, the multiclass support vector was used as the mean supervector for encoding the sound-event signatures. The proposed noisy sound separation and event classification method was demonstrated by using four sets of noisy sound-event mixtures, which were door open, door knocking, footsteps, and speech. Based on the experimental results, first, the optimal window length of STFT was found where 1.5 s of the sliding window yielded the best separation performance. The second was two significant features that were ZCR and MFCCs. These parameters were set for examining the proposed method. The proposed method achieved outstanding results in both separation and classification. In future work, the proposed method will be evaluated on a public dataset such as the DCASE 2016, alongside the comparison with other machine learning algorithms.
where A denotes the terms that are irrelevant with φ k (ω, t), B k (ω, t) = , and sin Ω k (ω, t) = . The auxiliary function, f + θ, θ in (A4) is minimized when cos Ω k (ω, t) and sin φ k (ω, t) = sin Ω k (ω, t). The update formula for e jφ k (ω,t) eventually leads to The update formula for β k (ω, t) and H k (t) for projection onto the constraint space is set to
where "⊗" and " • " are the Kronecker product and the Hadamard product, respectively. The term vec(·) denotes the column vectorization and the term I is the identity matrix. The goal is then set to compute the regularization parameter λ k (t) related to each H k (t). To achieve the goal, the parameter p in Equation (A3) was set at 1 to acquire a linear expression (in λ k (t)). In consideration of the noise variance σ 2 , Equation (A3) can concisely be rewritten as: where the h and λ terms indicate vectors of dimension R × 1 (i.e., R = F × T × K), and the superscript 'T' is used to denote complex Hermitian transpose (i.e., vector (or matrix) transpose), followed by complex conjugate. The Expectation-Maximization (EM) algorithm is used to determine λ and h is the hidden variable, where the log-likelihood function can be optimized with respect to λ. The log-likelihood function satisfies the following [12]: by applying the Jensen's inequality for any distribution Q(h). The distribution can simply verify the posterior distribution of h that maximizes the right-hand side of Equation (A19) is given by Q(h) = p h y, λ, A, σ 2 . The posterior distribution in the form of the Gibbs distribution is proposed as follows: The term F(h) in Equation (A16) as the function of the Gibbs distribution is essential for simplifying the adaptive optimization of λ. The maximum-likelihood (ML) estimation of λ can be decomposed as follows: λ ML = arg max Individual element of H is required to be exponentially distributed with independent decay parameters that delivers p(h λ) = g λ g exp −λ g h g , thus Equation (20) obtains The term h denotes the dependent variable of the distribution Q(h) whereas other parameters are assumed to be constant. As such, the λ optimization in (A19) is derived by differentiating the parameters within the integral with respect to h. As a result, the functional optimization of λ then obtains where g = 1, 2, . . . , R, λ g denotes the g th element of λ. The iterative update for σ 2 ML is given by where p y h, A, σ 2 = πσ 2 −N 0 /2 exp − 1/2σ 2 y − Ah 2 and N 0 = K × T. However, the integral forms in Equations (A20) and (A21) are complex to compute and analyzed analytically. Thus, an approximation to Q(h) is exploited. Notice that the solution h naturally splits its elements into distinct subsets h M and h P consisting of components ∀ m ∈ M such that h m > 0 and components ∀ p ∈ P such that h P = 0. Hence, this can be derived as follows: Defining Z M = exp −F h M , λ M dh M and Z P = exp −F h P , λ P dh P . With the purpose of characterizing Q P h P , some positive deviation to h P is needed to be allowed for, whereas the h P values will reject all negative values due to CMF only accepting zero and positive values. Thus, h P admits zero and positive values in Q P h P . The approximation of the distribution Q P h P is then utilized in the Taylor expansion as the maximum a posterior probability (MAP) estimate. Therefore, with h MAP , one obtains where C P = 1 σ 2 A T P A P and C = 1 σ 2 A T A. The integration of the term Q P h P in Equation (A24) is hard to derive in its closed form expression for analytical evaluation, which subsequently prohibits inference of the sparsity parameters. A fixed form distribution is employed for computing variational approximate Q P h P . As a result, the closed form expression is obtained. Subsequently, the term h P only takes on nonnegative values, so a suitable fixed form distribution is to use the factorized exponential distribution given byQ By minimizing the Kullback-Leibler divergence between Q P andQ P , the variational parameters u = u p where ∀p ∈ P can be derived as: u = arg= min uQ P h P lnQ P (h P ) Q P (h P ) dh P = arg= min uQ P h P lnQ P h P − ln Q P h P dh P Solving Equation (A26) for u p leads to the following update [37]: The approximate distribution for components h M can be obtained by substituting F h M , λ M into Q M h M as follows: and its covariance X is given by Similarly, the inference for σ 2 can be computed from Equation (24) as The core procedure of the proposed CMF method is based on L 1 -optimal sparsity parameters. The estimated sources are discovered by multiplying the respective rows of the W k (ω) components with the corresponding columns of the H k (t) weights and time-varying phrase spectrum e jφ k (ω,t) . The separated sourcesŝ j (t) are obtained by converting the time-frequency represented sources into time domain.