Disentangled Feature Learning for Noise-Invariant Speech Enhancement

: Most of the recently proposed deep learning-based speech enhancement techniques have focused on designing the neural network architectures as a black box. However, it is often beneﬁcial to understand what kinds of hidden representations the model has learned. Since the real-world speech data are drawn from a generative process involving multiple entangled factors, disentangling the speech factor can encourage the trained model to result in better performance for speech enhancement. With the recent success in learning disentangled representation using neural networks, we explore a framework for disentangling speech and noise, which has not been exploited in the conventional speech enhancement algorithms. In this work, we propose a novel noise-invariant speech enhancement method which manipulates the latent features to distinguish between the speech and noise features in the intermediate layers using adversarial training scheme. To compare the performance of the proposed method with other conventional algorithms, we conducted experiments in both the matched and mismatched noise conditions using TIMIT and TSPspeech datasets. Experimental results show that our model successfully disentangles the speech and noise latent features. Consequently, the proposed model not only achieves better enhancement performance but also offers more robust noise-invariant property than the conventional speech enhancement techniques


Introduction
Speech enhancement techniques aim to improve the quality and intelligibility of a given speech degraded by certain additive noise in the background. In a variety of applications, speech enhancement is considered as an essential pre-processing step. This technique can be directly employed to improve the quality of mobile communications [1] in noisy environments or to enhance speech signals for hearing aid devices [2,3] before amplification. Speech enhancement has also been widely used as a pre-processing technique in automatic speech recognition (ASR) [4,5] and speaker recognition systems [6] for more robust performances.
Over the past several decades, myriads of approaches have been developed in the speech research community for better speech enhancement. Spectral subtraction method [7] suppresses stationary noise from the input noisy speech by subtracting the spectral noise bias computed during the non-speech activity periods. The minimum mean-square error (MMSE)-based spectral amplitude estimator [8,9] showed promising results in terms of reducing residual noise as compared to the spectral subtraction method or Wiener filtering-based algorithm [10]. The least mean square adaptive filtering (LMSAF) based speech enhancement approaches have the best filtering performances of Wiener filter. Meanwhile, they do not need a priori knowledge, and can be adapted to the external environment by self-learning. But these approaches have some disadvantages including low constringency, strong sensitivity to non-stationary noise and a contradiction between constringency and stability [11,12]. The minima controlled recursive averaging (MCRA) noise estimation was also introduced in [13] of which the performance is known to be reasonably competitive under the environments with relatively high signal-to-noise ratios (SNR). However, since these statistical models are constructed based on a stationarity assumption, their performances generally tend to deteriorate in low SNR or highly non-stationary noise conditions. Non-negative matrix factorization (NMF) is one of the most common template-based approaches to speech enhancement [14][15][16], which models noisy observations as a weighted sum of non-negative source bases. NMF-based speech enhancement methods are more robust to non-stationary noise conditions as compared to the statistical model-based methods. These approaches, however, often result in signal distortion in the enhanced speech since they are based on an unrealistic assumption that speech spectrograms are linear combinations of the basis spectra.
Due to the complex nature of the noise corruption process, non-linear models such as deep neural networks (DNNs) have been suggested as an alternative choice for modeling the relationship between the noisy and the corresponding clean speech utterances. DNNs have been successful in solving the speech enhancement tasks under various noise environments since its introduction. Early literature using DNNs as a nonlinear mapping function for estimating clean speech had reported better enhancement results [17][18][19][20] compared to the NMF-based algorithms. Various neural network structures have been employed for speech enhancement, such as multi-context stacking networks for ensemble learning [21], recurrent neural networks (RNNs) [22][23][24], and convolutional neural networks (CNNs) [25,26].
More recently, generative adversarial network (GAN) [27] has become popular in the area of deep learning, and it has been also applied to speech enhancement. Pascual et al. proposed end-to-end speech enhancement GAN (SEGAN) in which the generator learns to model the mapping from the noisy speech samples to their clean counterparts, while the discriminator learns to distinguish between the enhanced and the target clean samples within the context of a mini-max game [28]. The underlying idea of GAN has been adopted in many GAN-based speech enhancement algorithms including the time-frequency mask estimation using the minimum mean square error GAN (MMSE-GAN) [29] and the conditional GAN (cGAN) [30].
Though deep learning-based speech enhancement models have achieved considerable improvements, the performance is usually degraded in the case of mismatched conditions caused by different types of noises or SNR levels between the training and test set samples. Moreover, the performance varies depending on the types of noises. In order to address such issues, disentangled feature learning can be considered as a possible solution. Most of the previous studies [17][18][19][20][21][22][23][24][25][26][27][28][29][30], which have focused mainly on the mapping between the noisy and the clean speech, rarely consider how input features are learned in the hidden layers. The model based on disentangled feature learning, on the other hand, manipulates the latent features to distinguish between the speech and noise in the intermediate layers, hence resulting in better enhancement performance even in the mismatched noise conditions. Moreover, the quality of noise-invariant attributes can also be improved.
In this paper, we propose a novel deep learning-based noise-invariant speech enhancement algorithm which employs an adversarial training framework designed to disentangle the latent features of speech and noise, under the concept of domain adversarial training (DAT) [31]. Although DAT was originally introduced for the domain adaptation task, the proposed algorithm exploits the DAT framework for use in the regression task, i.e., speech enhancement. Experimental results show that the proposed method successfully disentangles the speech and noise latent features. Moreover, the results also reveal that our model outperforms the conventional DNN-based algorithms. The main contributions of this paper are summarized as follows:

•
We modify the DAT framework in order to solve the speech enhancement task in a supervised manner. The proposed model achieves better performance in speech enhancement as compared to the baseline models under both the matched and mismatched noise conditions. • By reducing the performance gap among different noise types, we show that our method is more robust to noise variability.

•
By visualizing feature representations, we demonstrate that our model successfully disentangles speech and noise latent features.
The remainder of this paper is organized as follows: Section 2 reviews past studies related to the proposed method. The proposed model is elaborately described in Section 3. Section 4 reports the results obtained from the experiments and discusses the details. Finally, Section 5 concludes the paper.

Masking-Based Speech Enhancement
When training neural networks in a supervised manner, it is essential to define a proper training target in order to ensure effective learning. The training targets for speech enhancement can be mainly categorized into two groups: (i) mapping-based, and (ii) masking-based approaches. The mapping-based methods learn a regression function relating a noisy speech to the corresponding clean speech directly while the masking-based methods estimate time-frequency masks given a noisy speech. A variety of training targets have been studied. Wang et al. evaluated and compared the performance of various mapping-based and masking-based targets [32]. It may be controversial to argue which method is better, yet many cases have shown that the masking-based methods (e.g., ideal ratio masks) tend to perform better than the mapping-based methods [21,32,33] in terms of enhancement results. In this work, we design the proposed model within a masking-based framework. We use the time-frequency masking functions as an extra layer in the neural network [22]. This way, the model implicitly incorporates the masking functions when optimizing the network which will be detailed in Section 3.1.

Domain Adversarial Training
Domain adaption [34] addresses the problem of mismatch between the training and test datasets by transferring the knowledge learned from the source domain to a robust model in the target domain. DAT is one of the approaches that attempts to match the data distributions across different domains. In [31], DAT exploits an adversarial training method in order to learn intermediate features which are invariant to the shifts in data from different domains. Here, the neural network learns two different classifiers: (i) a classifier for the main classification task, and (ii) the domain classifier. The training objective of the domain classifier, in particular, is to learn whether the input sample is from the source or target domain, given features extracted using labeled data from the source domain and unlabeled data from the target domain. The feature extractor is shared by both the main task and the domain classifiers. In implementation, a gradient reversal layer (GRL) is employed to act as an identity transformer in the forward-propagation and to reverse the gradient by multiplying a negative scalar during the back-propagation [31]. Consequently, the GRL encourages the latent features to act discriminatively when solving the main classification task, yet act indiscriminately towards the shifts across different domains. In other words, the feature extractor is trained so that the model maps data from different domains to the latent features with similar distributions via adversarial learning.
Many speech processing frameworks have adopted the idea of DAT in order to extract domain-invariant features. Under the noise robust speech recognition scheme, the clean speech was regarded as the source domain data and was used to train the senone label classifier, while the noisy speech played the role of the target domain data to be adjusted by the feature extractor [35,36]. DAT was also used to learn speaker-invariant senone labels, as shown in [37] where the adversarial training successfully aligned the feature representation of different speakers. In [38], the authors demonstrated that accent-invariant features could be learned for the ASR system. For speaker recognition tasks, DAT was adopted to tackle the channel mismatch problem. In particular, the latent features were extracted in order to learn channel-invariant, yet speaker-discriminative representations [39]. In [40], the authors showed that DAT was able to adapt multiple forms of mismatches (e.g., speaker, acoustic conditions, and emotional content) when solving the acoustic emotion recognition task. As for the speech enhancement problems, a noise adaptive method exploiting DAT was proposed in [41]. In their work, however, DAT was only used to classify stationary and non-stationary noises, and the authors did not make use of various noise components for domain-invariant regression.

Proposed Method
In this section, we propose a method to disentangle the speech and noise features for noise-invariant speech enhancement. We present (1) the proposed model architecture, (2) objective functions, and (3) the adversarial learning process.

Neural Network Architecture
Our neural network consists of five sub-networks: (i) an encoder (E), (ii) a speech decoder (D s ), (iii) a noise decoder (D n ), (iv) a noise disentangler (DE n ), and (v) a speech disentangler (DE s ). The overall architecture of the proposed model is illustrated in Figure 1. We extracted the magnitude spectra as the raw features of all signal components. Only the magnitude spectra were estimated while the phase parts of the noisy speech are kept intact. Let us denote the magnitude spectra of the noisy speech, clean speech, and noise as x ∈ R F×(2τ+1) , s ∈ R F×1 , and n ∈ R F×1 , respectively, where F denotes the number of frequency bins and τ represents an input context expansion parameter (i.e., one current frame, τ previous and τ next frames). The encoder E learned a function that maps x into speech and noise latent features, defined by the neural network parameter θ E as follows: where z s ∈ R M×1 and z n ∈ R M×1 indicate M-dimensional speech and noise latent features, respectively. Similarly, D s and D n learn mappings parameterized by θ D s and θ D n , respectively, as follows: wherem s ∈ R F×1 andm n ∈ R F×1 denote the estimated speech and noise masks, respectively. The time-frequency masks were constrained such that the sum of the estimated values should be equal to the input noisy speech. Given the masks from both of the decoders, we can obtain the predicted speech and noise through a deterministic layer [22]. Givenm s andm n , the predicted magnitude spectra of speechŝ ∈ R F×1 and noisen ∈ R F×1 can be calculated aŝ where the addition, division, and product (⊗) operators were executed element-wise. Finally, DE n and DE s were trained to separate the noise attributes from the speech latent features, and vice versa. DE n and DE s are respectively parameterized by θ DE n and θ DE s as follows: wheres ∈ R F×1 andñ ∈ R F×1 represent the speech and noise components, respectively estimated from the latent features. Note thats andñ differ fromŝ andn in Equation (3).s andñ were generated by the disentanglers which were trained to make the encoder difficult to predict the speech and noise. The GRLs are inserted between the encoder and the disentanglers to establish an adversarial setting.
On the other hand,ŝ andn were well estimated by the corresponding decoders.
In the final speech enhancement stage, after obtainingŝ from the decoders, the estimated clean speech spectrumŜ was reconstructed byŜ where x denotes the phase of the corresponding input noisy speech.Ŝ is then transformed to the time-domain signal through inverse discrete Fourier transform (IDFT). Finally, an overlap-add method as in [42] is used to synthesize the waveform of the enhanced speech.

Training Objectives
Given the estimatesŝ andn of the clean speech s and noise n, we optimized the neural network parameters of the encoder and decoders by minimizing the mean squared error defined as follows: where · indicates the l 2 -norm, K is the number of mini-batch size, andŝ k (n k ) is the estimate of the k-th speech (noise) sample s k (n k ) in the mini-batch. Similarly, we also train the encoder and the disentanglers by using the following objective functions: whereñ k ands k are obtained through Equation (4). To obtain disentangled features, we minimize L DE n and L DE s defined in Equation (7) with respect to θ DE n and θ DE s , while maximizing them with respect to θ E simultaneously. Combining Equations (6) and (7), the total loss of the proposed network was formulated as where λ 1 and λ 2 are positive hyper-parameters which control the amount of gradient reversal in the back-propagation step, and α denotes the weight controlling the contribution of the noise estimate.
In recent studies [35][36][37][38][39][40][41], GRL has only been used for domain predictions under narrowly restricted settings (with only two possible domains, e.g., the source and the target) or for classifications of channels, speakers, and noise types. The proposed model distinguishes itself from the past approaches by using two GRLs to disentangle the speech and noise latent features in a regression manner.

Adversarial Training for Disentangled Features
Neural network parameters are optimized by using the objective function given in Equation (8) via adversarial learning. D s and D n are trained to minimize L D s and L D n , and DE n and DE s are also trained to minimize L DE n and L DE s . As for the optimization of E, it was essential to ensure that it should produce disentangled features. This idea was implemented by minimizing L D s and L D n while maximizing L DE n and L DE s in an adversarial manner with respect to the encoder parameter θ E . Such a mini-max competition eventually converges to the point where the encoder network generates the noise-confusing latent feature z s and the speech-confusing latent feature z n , disentangled in the latent feature space. D s and D n then use z s and z n as input respectively and produce noise-invariant speecĥ s. In summary, optimizations of the network parameters are given by whereθ (·) denotes the optimal parameters for each given network (·). The network parameters defined by Equation (9) can be found as a stationary point of the following gradient updates: where µ indicates the learning rate. The updates of Equation (10) are very similar to stochastic gradient descent (SGD) updates for the feed-forward deep model that comprises the encoder fed into the decoders and into the disentanglers. The difference was that the gradients from the decoders and disentanglers were subtracted with loss weighted by λ 1 , λ 2 , and α, instead of being summed. The negative coefficient −λ 1 and −λ 2 enable the encoder to induce the maximization of L DE n and L DE s by reversing the gradients during the back-propagation. If both λ 1 and λ 2 were set to zero, the neural network structure presented in Figure 1 became equivalent to the conventional DNN structure. The optimized networks E, D s and D n were then used during the test stage for generating the clean speech estimates given the noisy test speech samples.

Experiments and Results
In this section, we evaluate the performance of the proposed model compared to the baseline systems using various metrics. For performance comparison, we conducted experiments in both the matched and mismatched noise conditions.

Dataset
We used 6300 utterances of clean speech data from the TIMIT database [43] to train the neural networks. TIMIT database consists of 10 sentences, each spoken by 630 English speakers. In order to make sure that various noisy utterances are considered during simulations, we selected 10 different noise types including: car, construction, office, railway, cafeteria, street, incar, train, bus from ITU-T recommendation P.501 database [44] and white noise from NOISEX-92 database [45]. In the case of matched noise conditions, two-thirds of each noise clip was used for training and the rest for testing. For each pair of the clean speech utterance and the noise waveform, a noisy speech utterance was artificially generated with an SNR value randomly chosen from −3 to 6 dB in 1 dB scale. As a result, a total of 63,000 utterances (about 54 h) were used so that the entire database was mixed with each noise type.
The test set consisted of 1400 utterances of clean speech data from TSPspeech [46], spoken by 12 male and 12 female English speakers. For the experiments in the matched noise conditions, we used the same noise types as used for training. For the experiments in the mismatched noise conditions, noises including kids, traffic, metro, and restaurant from ITU-T recommendation P.501 database were applied. Noisy speech utterances were generated with the SNR value ranging from −6 to 9 dB with 3 dB step in which −6 and 9 dB cases represented the unseen SNR conditions.

Feature Extraction
The input and target features of the networks were extracted in the following way. First, we extracted the magnitude spectra from the noisy speech, the corresponding clean speech, and noise. A 512-point Hamming window with 50% overlap was applied to the audio signals, sampled at 16kHz, and then short-time Fourier transform (STFT) was applied. 512 points STFT magnitudes were reduced to 257 points by removing the symmetric half. F and τ were fixed to 257 and 5, respectively. Thus, input feature vectors, extracted from 11 consecutive frames, were concatenated in a similar manner as in [19].

Network Setup
The network architecture of the proposed model is presented in Figure 1, in which we refer to the speech-noise disentangled training (snDT) model. The encoder E was constructed by stacking two hidden layers with 2048 leaky rectified linear units (ReLUs) [47] in each layer. The number of the input nodes of E was 257 × 11 = 2827. The output layer generated two separated outcomes of 512 nodes (i.e., the dimension M of z s and z n ) with leaky ReLUs.
The decoders D s and D n also had two hidden layers with 2048 leaky ReLUs in each layer. The numbers of the input and output nodes in each network were 512 and 257, respectively. For the output activations, Sigmoid was used to restrict the output mask (m s andm n ) values to be in [0, 1], yetŝ andn were determined implicitly by Equation (3). The structures of DE n and DE s were identical to that of D s except for the output activation functions. ReLUs were used for the output magnitudes (s andñ).
The snDT model was trained with Adam optimizer [48], with a learning rate of 1 × 10 −3 , using a mini-batch size of 10 utterances. Batch normalization [49] was applied to all of the hidden and output layers for regularization and stable training. As for the hyper-parameters λ 1 and λ 2 of Equation (8), we took an approach similar to [31]. λ 1 and λ 2 were initialized with 0 for the first 50K training iterations, and then their values were gradually increased until reaching (λ 1 , λ 2 ) = 0.3 by the end of the training. α in Equation (8) was fixed at 0.4. Figure 2 shows the training losses obtained from the snDT model, and it is seen that the model was trained properly. Through the adversarial training as defined by Equation (9), the speech and noise estimation losses decreased, and the disentangling losses increased gradually to convergence. To evaluate the performance of the disentangled feature learning technique, we implemented three baseline models for comparison. These baseline systems are as follows: • Speech training (sT) model, as shown in Figure 3a, was a deep denoising autoencoder [17], and it took a regression approach closely resembling [19]. • Speech-noise training (snT) model, as shown in Figure 3b, utilized noise components to construct the time-frequency masks. This approach was similar to the method suggested in [22]. Unlike the snDT model, however, the snT model did not exploit disentangled feature learning. • Noise disentangled training (nDT) model, as shown in Figure 3c, was trained so that the noise components were disentangled from the speech latent features without using noise latent features.
The baseline models were configured similarly in terms of hyper-parameters, the number of layers and nodes in each module, to ensure a fair comparison with the snDT model. We implemented all the networks using Tensorflow [50].

Objective Measures
For the evaluation of the models' performances, we considered four different aspects, speech quality, noise reduction, speech intelligibility, and speech distortion. The tested objective measures are summarized as in the following: • PESQ: Perceptual evaluation of speech quality defined in the ITU-T P.862 standard [51]. • segSNR: Segmental SNR, which is the average of the SNR per frame for the two speech signals [52] • eSTOI: Extended short-time objective intelligibility [53]. • SDR: Signal-to-distortion ratio [54].
All metric values for the enhanced speech were compared with the corresponding clean reference of the test set.

Performance Evaluation
In case of the matched noise conditions, we measured the objective metrics and averaged them over each SNR environment to evaluate performance for ten different noise types. Table 1 presents the PESQ scores, segSNR, eSTOI, and SDR values obtained in the matched noise conditions where the column "noisy" refers to the results obtained from the clean and the unprocessed noisy speech. The cases with SNR equal to −6 and 9 dB indicate the unseen SNR conditions that were not included during the training phase. Firstly, we investigated whether the use of noise information improves performance for speech enhancement. The results show that the snT model, which constructed the masks using both speech and noise information, performed better than the sT model whose prediction was based only on speech components. Similarly, the snDT model with noise estimates reported better performance in terms of all the metrics compared to the nDT model.
The nDT model, which disentangles the noise components in the latent feature space, resulted in lower performance improvements in comparison with the snT model. This confirms that even though the nDT model incorporated disentangled feature learning, it was not able to exploit the noise information to construct the masks during the speech enhancement process. On the other hand, in order to examine the sole effect of the disentangled feature learning, the nDT model should be compared to the sT model whose structure is identical to the nDT model except for the noise disentangler. As can be seen in the results, the nDT model outperformed the sT model in terms of all the metrics. Furthermore, the comparison of the snDT model to the snT model, both of which similarly adopted the masks except that the snDT model additionally applied disentangled feature learning, reported better performance improvements for the snDT model. In summary, the proposed model showed better performance than all the other baseline models in terms of speech quality, intelligibility, noise reduction, and speech distortion, indicating that the disentanglement between speech and noise features in the latent feature space was more effective for the prediction of the clean speech. Table 1. Results of perceptual evaluation of speech quality (PESQ), segmental signal-to-noise ratio (segSNR), extended short-time objective intelligibility (eSTOI), and signal-to-distortion ratio (SDR) values of the proposed and baseline networks in the matched noise conditions, where −6 and 9 dB cases are unseen SNR conditions.  In case of the mismatched noise conditions, we evaluated performance given four different noise types and averaged the results over each of the SNR environment. Table 2 presents the PESQ scores, segSNR, eSTOI, and SDR values obtained under the mismatched noise conditions. The results show that the snDT model outperformed the baseline methods, implying that it was more robust to the unseen noise types. Since the snDT model learned how to disentangle speech components from the latent features, the disentangled features could be obtained even in the mismatched noise conditions. From the perspective of noise reduction, in particular, it is quite noteworthy that the models using disentangled feature learning showed relatively competitive performance improvements in the mismatched noise conditions compared to the matched conditions. In case of the matched noise conditions, the relative improvement of segSNR was 16.31% for the nDT model when compared against the sT model, and 11.21% for the snDT model against the snT model. In the case of the mismatched noise conditions, however, the relative improvements of segSNR of the nDT and snDT models were 38.79% and 15.95%, respectively. It can be seen that the proposed approach is particularly effective in the aspect of noise reduction. Additionally, Figure 4 shows the spectrograms of an utterance enhanced by the snT and snDT models in the mismatched noise conditions. From this figure, it is shown that the proposed algorithm effectively reduced the noise from the original noisy speech while the speech distortion was minimized.

Subjective Test Results
We also conducted a listening test to compare the subjective performance of the proposed algorithm with the conventional scheme. For that, 18 listeners participated and were presented with 42 randomly selected sentences corrupted by the 14 different noises in the SNR values of −3, 0, and 3 dB. In the test, each listener was provided with speech samples enhanced by the snT model and snDT model. Listeners could listen to each enhanced speech as many times as they wanted, and were asked to choose the preferred one from each pair of speech samples in terms of perceptual speech quality. If the quality of the two samples was indistinguishable, listeners could select no preference. Two samples in each pair were given in arbitrary order.
The results are shown in Figure 5. It can be seen that the quality of the speech enhanced by the proposed model was better than the conventional model in all SNR values. With respect to the averaged results, the snDT model was preferred to the snT model in 52.78% of the cases, while the opposite preference was 8.20% (no preference in 39.02% of the cases). These results imply that the proposed algorithm enhances not only the objective measures but also the perceived quality.

Analysis of Noise-Invariant Speech Enhancement
As the network is trained with different types of noise, it is easily anticipated that the performance may vary depending on the noise types even when given the same SNR value. This could be problematic, especially under various real-world noise environments, because lower performance improvements for certain noise types could certainly result in lower performance in overall for the entire system. Figure 6 describes the variances of the PESQ scores obtained from different noise types. We separately measured the PESQ scores for each noise type and computed the variances of 14 different noise types used in the matched and mismatched noise conditions. The results show that the proposed algorithm yielded the smallest performance gap among the noise types in all of the SNR environments. It is noted that the snDT model produced much smaller variances at the low SNR level compared to the baseline models. This demonstrates that the proposed model was less sensitive to different noise types during the enhancement process because it disentangled the speech attributes well from the noisy speech in the latent feature space. Experimental results, therefore, suggest that the proposed model is a speech enhancement system with an improved noise-invariant property.

Disentangled Feature Representations
We further explored the effect of disentangled feature learning by visualizing the speech latent feature (z s ) using t-distributed stochastic neighbor embedding (t-SNE) [55]. The t-SNE is a popular data visualization method which projects high dimensional data into a subspace with a smaller dimension. The projection serves as a useful tool to visually inspect feature representations learned by the model. We extracted speech latent features from a subset of the test samples through trained models and projected the 512-dimensional z s into the two-dimensional space using t-SNE. Figure 7 visualizes the speech latent feature representations obtained in the matched noise conditions. Figure 7d, in particular, shows that by using two disentanglers for adversarial learning, the distribution of z s became almost indistinguishable. This implies that the noise attributes were highly likely to be disentangled in z s . In contrast, without disentangled feature learning, as shown in Figure 7a,b, we were able to separate each type of noise cluster easily in the latent feature space. This indicates that the noise attributes remain intact in z s . Figure 7c shows that the nDT model disentangled the noise components more clearly as compared to the sT and snT models, yet not as much as the snDT model. Finally, Figure 8 shows the speech latent feature representations in the mismatched noise conditions. Even though the noise types were not included in the training data, the proposed model disentangled noise components more clearly in the latent feature space compared to the conventional DNN-based models.

Conclusions
In this paper, we proposed a novel speech enhancement method in which speech and noise latent features were disentangled via adversarial learning. In order to explore the disentangled representation which has not been exploited in the conventional speech enhancement algorithms, we designed a model using GRLs. The proposed architecture is composed of five sub-networks where the decoders and the disentanglers were trained in an adversarial manner to encourage the encoder to produce noise-invariant features. The speech latent features generated by the encoder reduced the variability among different noise types while retaining the speech information intact. Experimental results showed that the proposed model outperformed the conventional DNN-based speech enhancement algorithms in terms of various measurements in both the matched and mismatched noise conditions. Moreover, the proposed model achieved more competitive noise-invariant property through disentangled feature learning. Visualization of the speech latent features demonstrated that the proposed method was able to disentangle speech attributes from the noisy speech in the latent feature space.
In our future work, we will further develop novel structures and training techniques for a better representation of disentangled speech and noise features than the current model. In addition, we will consider a model that can disentangle the various factors that occur in voice communication systems.