Integrating Dilated Convolution into DenseLSTM for Audio Source Separation

: Herein, we proposed a multi-scale multi-band dilated time-frequency densely connected convolutional network (DenseNet) with long short-term memory (LSTM) for audio source separation. Because the spectrogram of the acoustic signal can be thought of as images as well as time series data, it is suitable for convolutional recurrent neural network (CRNN) architecture. We improved the audio source separation performance by applying the dilated block with a dilated convolution to CRNN architecture. The dilated block has the role of effectively increasing the receptive ﬁeld in the spectrogram. In addition, it was designed in consideration of the acoustic characteristics that the frequency axis and the time axis in the spectrogram are changed by independent inﬂuences such as speech rate and pitch. In speech enhancement experiments, we estimated the speech signal using various deep learning architectures from a signal in which the music, noise, and speech were mixed. We conducted the subjective evaluation on the estimated speech signal. In addition, speech quality, intelligibility, separation, and speech recognition performance were also measured. In music signal separation, we estimated the music signal using several deep learning architectures from the mixture of the music and speech signal. After that, the separation performance and music identiﬁcation accuracy were measured using the estimated music signal. Overall, the proposed architecture shows the best performance compared to other deep learning architectures not only in speech experiments but also in music experiments.


Introduction
In a real environment, humans hear several mixed signals simultaneously. In these situations, we can selectively attend to a signal we want, effectively segregating a target from the perceived mixture. This is the so-called auditory scene analysis or the cocktail effect problem [1], which is the main topic of this paper. In particular, when signals such as unwanted noise are mixed with the target signal, system performance is degraded, and the necessity of this study is emphasized [2,3]. Audio source separation is used to estimate the target signal, such as speech and music, when the target signal and other signals are mixed. If the target signal to be estimated is speech, it is the speech enhancement task; if the target signal is music, it is the music signal separation task.
In the conventional method of audio source separation with linear characteristics, non-negative matrix factorization (NMF) [4] is used not only for speech enhancement but for music source separation tasks [5][6][7]. NMF is an algorithm that decomposes a signal into two non-negative matrices that are a basis matrix representing independent characteristics and an activation matrix for each characteristic [4]. Each component of the signal can be separated using the basis matrix learned in the training process. Recently, deep learning, which has non-linear characteristics, showed better results than the traditional NMF method in audio source separation [8][9][10][11][12][13][14]. Deep learning has several representative architectures, such as the fully connected neural network (FCNN) [15], convolutional neural network (CNN) [16], and recurrent neural network (RNN) [17]. Often, these architectures are modified or combined to design a new architecture suitable for the task. FCNN architecture, which is the most basic deep learning architecture, is an architecture in which the nodes of each layer are fully connected [15], while the CNN designed for image processing is an architecture that convolves filters as a weight sharing method [16]. In addition, an RNN suitable for processing time series data is characterized by accumulating and using the information of the input data in chronological order within the architecture [17].
Many studies have shown better performance than the FCNN by using bidirectional long short-term memory (BLSTM) and CNN architectures [18][19][20][21][22][23][24][25]. Long short-term memory (LSTM) has a memory cell involved in inputs and outputs to store longer time series information than vanilla RNN [26]. BLSTM architecture determines the output by combining the forward time series information and the backward time series information of the LSTM [27]. The CNN architecture that shares weights can be used to design deeper layers of networks with the same parameters, which means that CNNs can learn more complex patterns of filters than FCNN architecture.
Recently, the convolutional recurrent neural network (CRNN) architecture [28] that uses CNNs suitable for image processing and RNNs suitable for time series data processing in one architecture at the same time shows better performance than other CNN, RNN, and BLSTM architectures [29][30][31]. In the CRNN architecture, LSTM or BLSTM architectures of the RNN series, which has better performance than vanilla RNN, is used. The parameters of the CNN are learned to estimate a filter for the target, and also the parameters of the RNN are learned to store the target information using time series information [28]. In the spectrogram of the acoustic signal, the target pattern shows slightly different characteristics depending on the frequency axis but appears at various locations along the time axis. Therefore, it is suitable for the translation-equivariance characteristics of the CNN architecture. In addition, since the time series data is along the time axis, it is suitable for the RNN architecture [29][30][31]. When the spectrogram is an input, the CRNN architecture first obtains a feature map through the CNN, and then features using time series information can be obtained from the RNN since time series information can still be used along the time axis of the feature map. Therefore, it is ideal to learn various patterns together using a CRNN architecture in an acoustic signal, and it shows good performance [29][30][31].
The CNN-based architectures for music source separation have an encoder-decoder architecture through down-sampling and up-sampling [23][24][25]. This encoder-decoder style architecture is intended to effectively increase the receptive field [32] and utilize the contextual information extracted from a wider time range of input data. In our previous study [33], we created a dilated block that effectively increased the receptive field by using a dilated convolution [34], which is suitable for the acoustic characteristics. The dilated convolution has the advantage of having a larger receptive field with the same number of parameters by adding an empty space between the filter nodes [34]. In order to apply the dilated convolution more appropriately to the spectrogram, the dilated block of the previous study was designed to arrange the dilated convolution of the time axis, the dilated convolution of the frequency axis, and the standard convolution in parallel [33]. The architecture where the dilated block is added in front of the dense block is called a dilated dense block. A dilated time-frequency DenseNet (DilDenseNet), which we designed using a dilated block, confirmed that it improves the performance in a music signal separation task [33].
In this study, we proposed a multi-scale multi-band dilated time-frequency DenseNet with LSTM (MMDilDenseLSTM) architecture for source separation. In addition to the encoder-decoder style, we applied a dilated block to CRNN architecture in order to expand the receptive field more effectively. We experimented with speech enhancement and music signal separation to evaluate the performance of MMDilDenseLSTM in comparison with other deep learning architectures. In the speech enhancement task, we performed subjective evaluation for the enhanced speech signal. In addition, we also measured speech quality, intelligibility, separation performance, and speech recognition accuracy. In the music separation task, separation performance and music identification performance were tested. We found that separation performance did not always correlate with speech recognition performance or music identification performance. To investigate the cause of this uncorrelated relationship, we analyzed the separated signal by plotting an intuitive feature of the music identification system on the spectrogram. In the speech and music experiments, the proposed architecture, which effectively increased the receptive field in CRNN architecture, showed the best performance overall compared to the existing deep learning architectures.
We first introduce related works in Section 2 and then explain the proposed MMDilD-enseLSTM in detail in Section 3. In Section 4, we present our experimental results. Finally, we offer conclusions in Section 5.

Related Works
Gated residual network (GRN), which is a CNN-based architecture, consists of a frequency-dilated module that extends the receptive field on the frequency axis, a timedilated module that increases the receptive field on the time axis, and, finally, a prediction module that outputs a mask of the same size as the input [35]. In particular, the GRN architecture is a deeper network that has a wide receptive field along the time axis by arranging several time-dilated modules, and this architecture has shown better performance than vanilla RNN and BLSTM based on a fully connected layer in speech enhancement [35].
Recently, the multi-scale multi-band DenseNet (MMDenseNet) architecture using densely connected convolutional networks (DenseNet) [36] showed good performance in music source separation [25]. In the DenseNet architecture, since the final output feature map includes the input and the output feature map of each layer, it has the advantage that a lot of information is included in the feed-forward process. In addition, there is an advantage that the gradient vanishing problem can be untangled because an error is simultaneously transmitted to each layer without passing through nodes in the error backpropagation process [36]. MMDenseNet has a multi-scale DenseNet (MDenseNet) architecture in parallel for each divided frequency band. The MDenseNet architecture has a process of obtaining low-dimensional features by repeating a dense block of DenseNet and down-sampling, and then restoring the original size by repeating dense block and upsampling. For simplicity, the process of representing the input data as a low-dimensional feature is called an encoding process, and the process of restoring the low-dimensional feature obtained in the encoding process to its original size is called a decoding process. Recently, the best performance of the music source separation task is MMDenseNet with LSTM (MMDenseLSTM) [31], which combines BLSTM with CNN-based MMDenseNet [25]. MMDenseLSTM is a CRNN architecture in which a BLSTM architecture is added after several dense block of MMDenseNet. Similar to speech enhancement tasks, the CRNN architecture is currently the best performing architecture.
In the time domain, end-to-end methods not relying on spectrogram have been successfully applied to audio source separation [37,38]. There are two drawbacks when the spectrogram is used as input. First, it consumes time to extract the spectrogram and restore it to a signal. Second, the phase information of the mixture used in the restoration process causes distortion. Recently, one of the end-to-end methods, Wave-U-Net, actually did not show good performance in music signal separation [37], whereas the other Conv-TasNet showed very good performance in the speaker separation task [38]. From these previous study results, we took the approach of using spectrogram as and input and will pursue the end-to-end methods in the future research work.

Proposed Architecture
The proposed architecture is presented in three steps. First, we introduced a novel dilated dense block that combines the dilated block [33] with the dense block of DenseNet. The dilated dense block was integrated into the multi-scale DenseLSTM (MDenseLSTM) architecture [31] to create a multi-scale dilated time-frequency DenseLSTM (MDilDenseL-STM). Finally, we combined multiple MDilDenseLSTM covering different frequency bands into the proposed architecture, MMDilDenseLSTM.

DenseNet and Dilated Dense Block
The dilated dense block architecture had the dilated block on the left and the dense block on the right, as shown in Figure 1. The dilated block can consider a wide receptive field and the dense block outputs a feature map containing more accurate target information while passing through several layers. Therefore, we placed the dense block after the dilated block in order to naturally inherit the influence considering the wide receptive field. The dense block of DenseNet concatenates the output feature maps, which is the output of the CNN filter convolving with the input data [39,40], as shown in the right block of Figure 1 in order to exploit the advantage of efficient information transmission in the feed-forward and error back-propagation processes. The dense block is composed of several composite functions [36], and the composite function consists of a sequence of batch normalization (BN) [41], rectified linear unit (ReLU) [42], and 3 × 3 convolution (Conv). The equation below represents the concatenation of the dense block.
where x represents the output of the layer, and H represents the composite function.

Multi-Scale Dilated Time-Frequency DenseLSTM
MDilDenseLSTM is an architecture consisting of the dilated dense block (DDB), compression block ("Compr."), LSTM block, down-sampling, and up-sampling, as shown in Figure 2. MDilDenseLSTM has four down-sampling (DS) and four up-sampling (US). The down-sampling used 2 × 2 average pooling and the up-sampling used 2 × 2 transposed convolution. The LSTM block is placed before the up-sampling, which makes scale 1 and the up-sampling located at the smallest scale. The advantage of the information and error transmission is further enhanced through an inter-block skip connection that connects outputs of the same size to each other during encoding and decoding [25]. The compression block compresses the information of many feature map into a small number of feature maps. The down-sampling and up-sampling reduces or increases the time-frequency size of the compressed feature map. The LSTM block, which is located between the compression block and up-sampling, outputs the target sequence information along the time axis. MDilDenseLSTM is a CRNN architecture by combining a CNN-based dilated dense block and an RNN-based LSTM block and is suitable when the input is an image and time series data such as a spectrogram. The purpose of the dilated block is to increase the receptive field more effectively with down-sampling and up-sampling on a spectrogram. The dilated block is an architecture in which the frequency dilated convolution (FDConv), time dilated convolution (TDConv), and 3 × 3 standard convolution are configured in parallel after BN and ReLU, as shown in the left block of Figure 1. The dilated convolution [34] was used for the FDConv and TDConv, and the kernel size and dilation rate were adjusted appropriately to the frequency and time axes, as shown in Figure 1 [33]. In the image task, the dilation rate of dilated convolution changes the horizontal and vertical axes at the same ratio because the size of the target image changes at the same ratio in width and height according to the distance [34]. However, in the spectrogram, since the acoustic characteristics of the time axis and the frequency axis changes with independent influences, the FDConv and TDConv are arranged in parallel. For example, the speech rate affects the time axis, and the gender-dependent pitch affects the frequency axis. The FDConv broadens the receptive field of the frequency axis, and the TDConv broadens the receptive field of the time axis. The dilated block is located in front of the dense block, and this whole architecture is called a dilated dense block. The number of output feature maps in the dilated block is m 0 + 3k, and the number of output feature maps in the dilated dense block is m 0 + (3 + L)k. m 0 is the number of input feature maps, k is the growth rate, and L is the number of composite functions [36].

Multi-Scale Dilated Time-Frequency DenseLSTM
MDilDenseLSTM is an architecture consisting of the dilated dense block (DDB), compression block ("Compr."), LSTM block, down-sampling, and up-sampling, as shown in Figure 2. MDilDenseLSTM has four down-sampling (DS) and four up-sampling (US). The down-sampling used 2 × 2 average pooling and the up-sampling used 2 × 2 transposed convolution. The LSTM block is placed before the up-sampling, which makes scale 1 and the up-sampling located at the smallest scale. The advantage of the information and error transmission is further enhanced through an inter-block skip connection that connects outputs of the same size to each other during encoding and decoding [25]. The compression block compresses the information of many feature map into a small number of feature maps. The down-sampling and up-sampling reduces or increases the time-frequency size of the compressed feature map. The LSTM block, which is located between the compression block and up-sampling, outputs the target sequence information along the time axis. MDilDenseLSTM is a CRNN architecture by combining a CNN-based dilated dense block and an RNN-based LSTM block and is suitable when the input is an image and time series data such as a spectrogram.

Multi-Scale Dilated Time-Frequency DenseLSTM
MDilDenseLSTM is an architecture consisting of the dilated dense block (DDB), compression block ("Compr."), LSTM block, down-sampling, and up-sampling, as shown in Figure 2. MDilDenseLSTM has four down-sampling (DS) and four up-sampling (US). The down-sampling used 2 × 2 average pooling and the up-sampling used 2 × 2 transposed convolution. The LSTM block is placed before the up-sampling, which makes scale 1 and the up-sampling located at the smallest scale. The advantage of the information and error transmission is further enhanced through an inter-block skip connection that connects outputs of the same size to each other during encoding and decoding [25]. The compression block compresses the information of many feature map into a small number of feature maps. The down-sampling and up-sampling reduces or increases the time-frequency size of the compressed feature map. The LSTM block, which is located between the compression block and up-sampling, outputs the target sequence information along the time axis. MDilDenseLSTM is a CRNN architecture by combining a CNN-based dilated dense block and an RNN-based LSTM block and is suitable when the input is an image and time series data such as a spectrogram. The compression block [36] consists of the BN, ReLU, and 1 × 1 convolution, and is placed behind the dilated dense block to appropriately limit the number of output feature maps from the dilated dense block and compress information at the same time. In the dilated dense block, which is a DenseNet-based architecture, since the output feature maps of each layer are concatenated, the number of output feature maps increases by the product of the number of layers. The compression rate has a value of 0 1. The The compression block [36] consists of the BN, ReLU, and 1 × 1 convolution, and is placed behind the dilated dense block to appropriately limit the number of output feature maps from the dilated dense block and compress information at the same time. In the dilated dense block, which is a DenseNet-based architecture, since the output feature maps of each layer are concatenated, the number of output feature maps increases by the product of the number of layers. The compression rate θ has a value of 0 < θ ≤ 1. The number of output feature maps by placing a compression block behind the dilated dense block can be expressed as (m 0 + (3 + L)k) × θ, and when θ = 1, the number of feature maps is maintained without compression.
As shown in Figure 3, the LSTM block consists of a sequence layer of a 1 × 1 convolution, BLSTM, and FCNN, and has an architecture in which the input feature map is concatenated to the output feature map. The LSTM block makes the CNN-based DenseNet architecture into a CRNN architecture. In the CRNN architecture, a feature is firstly extracted with the CNN, and then the RNN using time series information extracts features and classifies the input into the target class [28]. Since the feature maps obtained from the spectrogram retain the time series information as it is, adding the RNN architecture can reflect the time series information for all frames on the time axis. In addition, the receptive field can be widened on all frames of the time axis. The 1 × 1 convolution makes the number of input feature maps to 1 and puts them into the BLSTM layer. Since the frequency dimension of the output feature map varies according to the number of nodes in the BLSTM layer, the FCNN adjusts the number of frequency dimensions in the output feature map to be the same as the number of frequency dimensions in the input feature map. Since the input feature maps and the output feature map in LSTM block are concatenated, the number of output feature maps in the dilated dense block, compression block, and LSTM block can be expressed as (m 0 + (3 + L)k) × θ + 1. Figure 3, the LSTM block consists of a sequence layer of a 1 × 1 convolution, BLSTM, and FCNN, and has an architecture in which the input feature map is concatenated to the output feature map. The LSTM block makes the CNN-based DenseNet architecture into a CRNN architecture. In the CRNN architecture, a feature is firstly extracted with the CNN, and then the RNN using time series information extracts features and classifies the input into the target class [28]. Since the feature maps obtained from the spectrogram retain the time series information as it is, adding the RNN architecture can reflect the time series information for all frames on the time axis. In addition, the receptive field can be widened on all frames of the time axis. The 1 × 1 convolution makes the number of input feature maps to 1 and puts them into the BLSTM layer. Since the frequency dimension of the output feature map varies according to the number of nodes in the BLSTM layer, the FCNN adjusts the number of frequency dimensions in the output feature map to be the same as the number of frequency dimensions in the input feature map. Since the input feature maps and the output feature map in LSTM block are concatenated, the number of output feature maps in the dilated dense block, compression block, and LSTM block can be expressed as ( + (3 + ) ) × + 1.

Multi-Scale Multi-Band Dilated Time-Frequency DenseLSTM
The proposed architecture, MMDilDenseLSTM, has the advantage of having a wide receptive field in the time axis as well as the frequency axis of the spectrogram by combining the dilated block designed in the previous study [33] with the MMDenseLSTM introduced in [31]. In addition, the model used in our previous study, based on a CNN architecture, had limitations in utilizing time series information in spectrograms. MMDenseLSTM extracts CNN features from dense blocks and then extracts features considering time series information from LSTM blocks. We can extract a CNN feature considering a wider range by adding a dilated block to the dense block, and then the LSTM block takes over the influence by using the CNN feature as the input. It is an architecture in which the spectrogram is divided into three bands and MDilDenseLSTM is arranged in parallel in each band and the full band, as shown in Figure 4. Since we used the audio sampled at 16 kHz as our dataset, we used a spectrogram with a frequency range of 8 kHz as input. Therefore, the frequency band was divided by the boundary of 2 kHz and 4 kHz, and the ratio was equal to that of MMDenseLSTM [31]. Moreover, 0~2 kHz is a low band, which is a frequency band where the speech signals mainly exist, 2~4 kHz is called the middle band, and 4~8 kHz is called the high band. The output of each MDilDenseLSTM is combined into one tensor, and a mask of the same size as the input is outputted through the dilated dense block, the compression block, and the last 3 × 3 convolution.

Multi-Scale Multi-Band Dilated Time-Frequency DenseLSTM
The proposed architecture, MMDilDenseLSTM, has the advantage of having a wide receptive field in the time axis as well as the frequency axis of the spectrogram by combining the dilated block designed in the previous study [33] with the MMDenseLSTM introduced in [31]. In addition, the model used in our previous study, based on a CNN architecture, had limitations in utilizing time series information in spectrograms. MMDenseLSTM extracts CNN features from dense blocks and then extracts features considering time series information from LSTM blocks. We can extract a CNN feature considering a wider range by adding a dilated block to the dense block, and then the LSTM block takes over the influence by using the CNN feature as the input. It is an architecture in which the spectrogram is divided into three bands and MDilDenseLSTM is arranged in parallel in each band and the full band, as shown in Figure 4. Since we used the audio sampled at 16 kHz as our dataset, we used a spectrogram with a frequency range of 8 kHz as input. Therefore, the frequency band was divided by the boundary of 2 kHz and 4 kHz, and the ratio was equal to that of MMDenseLSTM [31]. Moreover, 0~2 kHz is a low band, which is a frequency band where the speech signals mainly exist, 2~4 kHz is called the middle band, and 4~8 kHz is called the high band. The output of each MDilDenseLSTM is combined into one tensor, and a mask of the same size as the input is outputted through the dilated dense block, the compression block, and the last 3 × 3 convolution.  For each MDilDenseLSTM, the growth rate and the number of composite functions in the dilated dense block, and the number of LSTM blocks were applied differently. Table  1 shows the detailed architecture of MMDilDenseLSTM. The growth rate, the number of  For each MDilDenseLSTM, the growth rate and the number of composite functions in the dilated dense block, and the number of LSTM blocks were applied differently. Table 1 shows the detailed architecture of MMDilDenseLSTM. The growth rate, the number of composite functions, and the scale of each band, which are hyper-parameters of MMDilDenseLSTM, were referred to MMDenseLSTM [31]. Since the low frequency band is important due to the characteristics of the acoustic signal, the growth rate k and the number L of the composite function in the low band and full band, which include low frequency band, were applied more than other bands. In addition, the LSTM block in the low and full band was applied before the up-sampling, which makes scale 1 and the up-sampling located at the smallest scale. However, the LSTM block in the high and middle band was applied only before the up-sampling located at the smallest scale. In full band, since the input size is more than twice that of other bands, the receptive field must be large; therefore, the scale is lower than that of other bands. In order to combine the output feature maps along the frequency axis from each MDilDenseLSTM except for the full band, the number of output feature maps must be the same; therefore, the number of output feature maps is made the same by adjusting the compression rate applied to each band. Table 1. Details of the proposed architecture (scale, the number of output feature map). If the bidirectional long short-term memory (BLSTM) layer was applied, the number of nodes was indicated in parentheses. The "L" of full band is shown in Figure 2. The lack of values indicates no architecture. Therefore, DDB and CP4 of the low band is connected to US3 and CC3.  To learn the proposed architecture, the following loss function was used.
where X is the magnitude spectrogram of the input data, Y is the magnitude spectrogram of the reference data, is element-wise multiplication,M is a mask estimated by neural network, and . 1 is 1-norm. The estimatedM is multiplied by the input spectrogram X to estimate clean speech data, and the difference from Y is obtained by 1-norm.

Experiments
In order to find out how well the proposed architecture separated the music and speech signals with different characteristics, we conducted experiments for music and speech, respectively. In the speech experiment, when speech, noise, and music were mixed, clean speech signals were estimated, and then a speech recognition experiment was performed using the estimated speech signals. In the music experiment, when mixed music and speech were extracted from broadcast contents of various genres, a music signal was separated out, and then a music identification experiment was performed using the separated music signal. Section 4.1 describes the speech domain experiment, and Section 4.2 describes the music domain experiment.

Dataset for Speech Experiment
For the speech enhancement experiment, we mixed the music and noise signal with the speech signal. We used 115 noises database (DB) [43], ESC-50 [44], and NOISEX-92 [45] as the noise DB, WSJ1 [46] as the speech DB, and MUSDB [47] as the music DB.
The speech, music, and noise DB is organized, as shown in Table 2. MUSDB is already divided into training, validation, and test datasets, and NOISEX-92 is composed of 6 noise types (babble, destroyer engine, destroyer operation, factory1, factory2, m109). WSJ1, a speech DB, is already divided into the training, validation, and test datasets. For the test dataset of speech DB, the eval93 dataset, which is the "si_et_h1" folder, in WSJ1 was used. Since the size of the public noise DBs is small, several public noise DBs were used. To cope with various noise types, the deep learning model used 115 noise DB and ESC-50 DB with many noise types for training, while the remaining NOISEX-92 DB was used for the test. Because the public noise DB and music DB have less data than the target speech DB, they are repeatedly mixed. Since speech DB is used only once, the mixed signals are different. Of course, it would be nice if the amount of noise and music DB was as large as the speech DB, but it is not easy to collect noise and music data as much as the speech DB.
Mixed training, validation, and test datasets are created by using each training, validation, and test datasets of music, noise, and speech databases. The training data mixes the noise and music signals to have a random signal-to-noise ratio (SNR) and signal-to-music ratio (SMR) from −10 to 20 dB based on the speech signal. The validation data mixes through the same process as training. To create the test dataset for each SNR and SMR combination, the music is mixed with an SMR of −10, −5, 0, 5, 10, 20, and 30 dB, and the noise is mixed with an SNR of −10, 0, and 10 dB. As a result, the total training dataset and validation dataset were 30 h and 5 h, respectively. In addition, the test dataset for each SNR and SMR combination is 30 min.

Setup for Speech Experiment
We used the spectrogram magnitude of the 16 kHz single-channel audio signal as the model input. Spectrograms can be obtained through a short-time Fourier transform (STFT) with Hanning window of 320 samples and 50% overlap. The number of frames in the input spectrogram is 256. The learning rate was 0.001 and the optimizer used Adam [48]. The batch size was 16 and the number of epochs was 20. The validation dataset was tested at every epoch and the model with the best signal-to-distortion ratio (SDR) [49] was finally selected and tested. To evaluate speech enhancement performance, we performed a subjective speech quality test. In addition, we also calculated objective measures. As objective measures, SDR, which is the signal separation performance, and perceptual evaluation of speech quality (PESQ) [50], which is performance related to speech quality, and short-time objective intelligibility (STOI) [51], normalized-covariance measure (NCM) [52], coherence-based speech intelligibility index (CSII) [53], which are speech intelligibility measures, were computed. PESQ ranges from −0.5 to 4.5, STOI, NCM, and CSII range from 0 to 1, and SDR had no fixed range. Larger values of the performance evaluation indicators represented better performance. In addition, to evaluate speech recognition performance, we used the nnet3 (chain) model of the Kaldi toolkit [54]. The Kaldi speech recognition model was trained with the WSJ1's clean speech, the SI-284 training dataset. The speech recognition performance is computed by word error rate (WER), and lower WER indicates better performance.

Experimental Results of the Subjective Quality Measure for Speech Enhancement
We tested four models for speech enhancement: GRN [35], MMDenseLSTM [31], DilDenseNet [33], and the proposed architecture. In several recent studies [55,56], GRN proposed for speech enhancement was experimented for performance comparison. Deep learning models were implemented by ourselves because there was no open code. In DilDenseNet which we designed in the previous experiment [33], we did not reverse the feature map in the multi-band block for speech enhancement and deleted the multi-band block in the decoding process. In addition, dividing the band at 2 kHz and 4 kHz was equally applied to DilDenseNet, MMDenseLSTM, and proposed architecture. It was confirmed by experiments that these modifications showed better performance in speech enhancement.
For the subjective listening evaluation of the speech quality, we conducted a relative preference test [57] targeting GRN, MMDenseLSTM, DilDenseNet, and the proposed architecture. From the four models, we obtained six combinations of model for comparison. The subjects participated in a total of 18 conditions (six combinations of model for comparison, three different noise and music levels). Each condition had five pairwise comparisons and the subject performed a total of 90 pairwise comparisons. To prevent subjects from predicting information, such as the condition and speech material of each sample, we provided the condition and speech material in random order, and the sample length used for each preference test was 2-3 s. In addition, each subject independently conducted the subjective evaluation in order not to share their opinions with each other. In each pairwise comparison, a mixture and pairs of enhanced speech samples, which resulted from the comparison models, were provided, and the mixture was always heard first.
As a total of 30 subjects participated in the subjective listening test, 150 preference results were obtained for each condition. The listeners had the ability to select a preferred sample (1 score) or "can't decide" (0.5 score), and the preference score and significance were calculated by combining the preference results of all listeners. We determined the preference score by applying the average to the preferred frequency. In addition, we calculated the one-tailed significance of the binomial test. Table 3 shows the preference scores of the subjective speech quality listening test for the proposed architecture and other deep learning architectures. In the subjective evaluation, the speech quality of the proposed architecture was evaluated better than DilDenseNet, MMDenseLSTM, and GRN. DilDenseNet and MMDenseLSTM were evaluated to have the same speech quality. GRN was evaluated to have the lowest speech quality among deep learning architectures. In addition, the comparisons including GRN showed high significance in all conditions, which indicated that the difference in performance from other deep learning structures was clear. Likewise, "Total" results of all comparisons other than the DilDenseNet and MMDenseLSTM comparisons showed high significance and thus represent reliable results. Table 3. Result of subjective listening test. "Hard" indicates an environment that music and noise are mixed at 0 dB, "Medium" mixed at 5 dB, and "Easy" mixed at 10 dB ("n.s.": not significant, " * ": p < 0.05, " * * ": p < 0.01, " * * * ": p < 0.001).

Experimental Results of the Objective Measures for Speech Enhancement and Recognition
In speech recognition after speech enhancement, objective indicators (PESQ, SDR, STOI, NCM, SCII, and WER) were measured for GRN, MMDenseLSTM, DilDenseNet, proposed architecture, and MMDenseLSTM+, which increased only the number of parameters in MMDenseLSTM. Table 4 shows the number of parameters for each architecture. The proposed architecture had 50% more parameters than MMDenseLSTM because it is an architecture in which dilated blocks are added to MMDenseLSTM. Since the performance may be improved simply by increasing the parameters of MMDenseLSTM, we designed MMDenseLSTM+ having almost the same number of parameters as the proposed architecture to confirm this. MMDenseLSTM+ has the same architecture as MMDenseLSTM, and the hyper-parameter (growth rate, compression rate, the number of composite function) was properly adjusted.  Figure 5 compares the results of PESQ, SDR, STOI, NCM, SCII, and WER for each deep learning architecture and an unprocessed mixture signal in 0 dB SMR and SNR environments. MMDenseLSTM+ showed the best performance among the deep learning architectures in SDR results and showed the same or better performance than MM-DenseLSTM in PESQ, STOI, and CSII, but showed lower NCM, WER performance than MMDenseLSTM, DilDenseNet, and the proposed architecture. Therefore, we can see that simply increasing the parameter did not improve the performance. In addition, GRN had more than twice the parameters of the proposed architecture but showed the lowest performance among deep learning architectures. Lastly, the proposed architecture showed lower performance than MMDenseLSTM and MMDenseLSTM+ in SDR, but it had the best performance in other results. In particular, the proposed architecture in WER showed a relatively 14.4% improvement in performance compared to MMDenseLSTM, and it showed the best performance compared to other deep learning architectures. Overall, PESQ, STOI, CSII, and NCM performance showed a higher correlation with WER results than SDR performance, and especially NCM had the highest correlation. The matched pairs sentencesegment word error test [58] using the NIST speech recognition scoring toolkit (SCTK) (https://github.com/usnistgov/SCTK) was performed to confirm the statistical significance in the WER results of Figure 5. In the comparison between the proposed architecture and DilDenseNet, the p-value was less than 0.01, and in the comparison between the proposed architecture and other deep learning architectures, the p-value was less than 0.001 in all cases. Therefore, in the WER results of Figure 5, the difference in performance between the proposed architecture and other deep learning architectures was statistically significant. Figure 6 shows the speech recognition results for the unprocessed mixture signal and signals enhanced by the deep learning models. Figure 6a-c show the WER results with SNR of −10, 0, and 10 dB, respectively. In Figure 6a, MMDenseLSTM+ shows the best performance. However, Figure 6b,c shows how the performance of MMDenseLSTM+ deteriorated more and more in an environment with higher SNR. GRN showed the lower overall performance than others, and then MMDenseLSTM and DilDenseNet showed good performance in order. The proposed architecture showed the best performance after MMDenseLSTM+ at −10 dB SNR, but the difference of performance from MMDenseLSTM+ was not large, and it showed the best performance in 0 dB and 10 dB SNR environments, as can be seen in Table 5. Table 5. Average word error rate (WER) over the entire signal-to-music ratio (SMR) at each signal-tonoise ratio (SNR). Lower WER indicates better performance. scoring toolkit (SCTK) (https://github.com/usnistgov/SCTK) was performed to confirm the statistical significance in the WER results of Figure 5. In the comparison between the proposed architecture and DilDenseNet, the p-value was less than 0.01, and in the comparison between the proposed architecture and other deep learning architectures, the pvalue was less than 0.001 in all cases. Therefore, in the WER results of Figure 5, the difference in performance between the proposed architecture and other deep learning architectures was statistically significant.   Figure 6 shows the speech recognition results for the unprocessed mixture signal and signals enhanced by the deep learning models. Figure 6a-c show the WER results with SNR of −10, 0, and 10 dB, respectively. In Figure 6a, MMDenseLSTM+ shows the best performance. However, Figure 6b,c shows how the performance of MMDenseLSTM+ deteriorated more and more in an environment with higher SNR. GRN showed the lower overall performance than others, and then MMDenseLSTM and DilDenseNet showed good performance in order. The proposed architecture showed the best performance after MMDenseLSTM+ at −10 dB SNR, but the difference of performance from MMDenseLSTM+ was not large, and it showed the best performance in 0 dB and 10 dB SNR environments, as can be seen in Table 5.

Architecture
(a) When noise is mixed at −10 dB SNR.
(b) When noise is mixed at 0 dB SNR.
(c) When noise is mixed at 10 dB SNR. Figure 6. Speech recognition results using unprocessed mixture signals mixture signals enhanced using GRN, etc. WER results according to signal-to-music ratio (SMR). (a) when noise is mixed at −10 dB signal-to-noise ratio (SNR), (b) when noise is mixed at 0 dB SNR, (c) when noise is mixed at 10 dB SNR. Lower WER indicates better performance.

Music Experiment
We experimented with the proposed architecture in the previous task [33]. Figure 7 shows the overall structure of the music experiment. The mixture of music and speech was separated into a clean music signal in the music separation model, and music identification was attempted using the separated music signal. Speech collected from broadcast contents were mixed with music at −30~0 dB music-to-speech ratio (MSR), and then the mixed signal was used as the input and the music signal was used as the target to train the music separation model. was separated into a clean music signal in the music separation model, and music identification was attempted using the separated music signal. Speech collected from broadcast contents were mixed with music at −30~0 dB music-to-speech ratio (MSR), and then the mixed signal was used as the input and the music signal was used as the target to train the music separation model. Fingerprints were extracted from all the songs used for evaluation and then stored in the fingerprint database. The fingerprint of the separated music signal was extracted and matched with all the songs in the fingerprint database to find matching music. For the fingerprint, the landmark-based audio fingerprinting method [59] was used. The landmark-based fingerprint was created by first extracting peak points from the spectrogram of the query sample in consideration of density and then connecting the peak points [59]. It will be an important point to determine the music identification performance that the peak points of the estimated music signal are well preserved. In addition, in a deep learning model that divides the frequency band of the spectrogram, it is necessary to examine how the boundary of the frequency band affects the fingerprint. Table 6 shows the configuration of training, validation, and test datasets for the music experiment. The music DB had 9118 popular songs from various countries and genres. The speech DB had 12 h of broadcast contents of various genres (drama, entertainment, documentary, and kids). The training data were mixed at a random MSR between −30 and 0 dB, considering the characteristics of broadcast content in which the speech signal was mixed in a larger volume than the music signal. Test data were mixed at −10 dB, 0 dB MSR. There were 14,590 query samples per MSR case. When creating the training and test dataset, the speech was randomly selected and mixed with the music because the speech data was smaller than the music data. Both speech and music data were recorded with 44,100 Hz sampling.  Fingerprints were extracted from all the songs used for evaluation and then stored in the fingerprint database. The fingerprint of the separated music signal was extracted and matched with all the songs in the fingerprint database to find matching music. For the fingerprint, the landmark-based audio fingerprinting method [59] was used. The landmarkbased fingerprint was created by first extracting peak points from the spectrogram of the query sample in consideration of density and then connecting the peak points [59]. It will be an important point to determine the music identification performance that the peak points of the estimated music signal are well preserved. In addition, in a deep learning model that divides the frequency band of the spectrogram, it is necessary to examine how the boundary of the frequency band affects the fingerprint. Table 6 shows the configuration of training, validation, and test datasets for the music experiment. The music DB had 9118 popular songs from various countries and genres. The speech DB had 12 h of broadcast contents of various genres (drama, entertainment, documentary, and kids). The training data were mixed at a random MSR between −30 and 0 dB, considering the characteristics of broadcast content in which the speech signal was mixed in a larger volume than the music signal. Test data were mixed at −10 dB, 0 dB MSR. There were 14,590 query samples per MSR case. When creating the training and test dataset, the speech was randomly selected and mixed with the music because the speech data was smaller than the music data. Both speech and music data were recorded with 44,100 Hz sampling. We used a mono signal down-sampled to 16 kHz for the experiment. The input of the deep learning architectures is a spectrogram using 1024-point STFT with Hanning window and 75% overlap size. The spectrogram estimated from the deep learning algorithm was reconstructed as a waveform and used as an input for the identification program. For music identification, the landmark-based identification program (https://github.com/ dpwe/audfprint) was used. For the Wave-U-Net architecture, we used the open-source (https://github.com/f90/Wave-U-Net) that the author provided. Other deep learning architectures were implemented directly for the experiment by ourselves.

Experimental Results for Music Signal Separation and Identification
In order to compare the objective performance differences of deep learning architectures, we performed music signal separation and music identification experiments of several deep learning architectures in the same environment. The separation performance was SDR and was expressed using median and mean statistics. The music identification performance was the identification accuracy. The performance of other deep learning architectures except the proposed architecture was the same as the previous study [33]. In addition, we analyzed the fingerprint used in the music identification system. Since the fingerprint was composed of indexes, it could be plotted overlapped on the spectrogram, which helped intuitive interpretation. Table 7 shows the separation performance. The proposed architecture improved the separation performance of MMDenseLSTM, and it showed the best performance compared to other deep learning architectures. Table 8 shows the music identification (MI) performance and the average number of matched fingerprints (MF) in the identified queries. Likewise, the proposed architecture in which the dilated block was added to MMDenseLSTM improved the identification performance of MMDenseLSTM by 16.9% at 0 dB MSR and 10.5% at −10 dB MSR, relatively, and it showed the best performance compared to the existing deep learning architecture. The statistical significance test was confirmed through the SCTK toolkit in the MI results of Table 8. The proposed architecture showed the p-value less than 0.001 in all comparisons with other deep learning architectures. In the 0 dB, −10 dB MSR environment, the difference in performance between the proposed architecture and other deep learning architectures was statistically significant. Apart from the identification performance, MF showed how well the peaks were estimated. Overall, MF has correlation with the identification performance. However, at 0 dB MSR, the MF for the unprocessed signal was quite high considering the identification performance. This showed that the landmark-based fingerprinting scheme was robust to noise at 0 dB MSR. The proposed architecture showed the largest MF except Oracle at 0 dB and −10 dB MSR.  Figure 8 shows the fingerprints of the separated music signal from the MMDenseLSTM and the proposed architecture. In both results, the proposed architecture had a larger number of matched fingerprints than the MMDenseLSTM. This means that the peak points of the estimated music signal were better preserved in the proposed architecture than in MMDenseLSTM. In Result 1, the matched fingerprints of the low band were the same in the MMDenseLSTM and the proposed architecture, and more fingerprints were matched in the middle and high bands of the proposed architecture. In Result 2, the proposed architecture showed more matched fingerprints in wider areas of the middle and high bands than the MMDenseLSTM.
In the previous study, we found that distortion occurs at the frequency band boundary of the output spectrogram when the deep learning architecture was designed by independently placing excessive parameters in each frequency band of the input spectrogram [33]. We investigate whether such distortion interferes with fingerprint extraction and how much such distortion occurs in each deep learning architecture as the quantitative indicator. To express the effect of this distortion as the quantitative indicator, the average number of all fingerprints across the frequency boundaries (AcrAF) and the average number of matched fingerprints across the frequency boundaries (AcrMF) were measured as shown in Table 9. In our experiment, deep learning architectures in which many parameters were placed independently in each frequency band were MMDenseNet, MMDenseLSTM, and the proposed architecture, and the frequency band boundaries were 2 and 4 kHz. MMDenseNet had the smallest value in these two indicators, which could explain why MMDenseNet showed higher separation performance than Wave-U-Net and MDenseNet but had lower music identification performance. In the identification performance, AcrMF represented matched fingerprints and was more important than AcrAF index. Which represented all fingerprints. The proposed architecture had more AcrMF value than other deep learning architectures.  Figure 8 shows the fingerprints of the separated music signal from the MMDenseLSTM and the proposed architecture. In both results, the proposed architecture had a larger number of matched fingerprints than the MMDenseLSTM. This means that the peak points of the estimated music signal were better preserved in the proposed architecture than in MMDenseLSTM. In Result 1, the matched fingerprints of the low band were the same in the MMDenseLSTM and the proposed architecture, and more fingerprints were matched in the middle and high bands of the proposed architecture. In Result 2, the proposed architecture showed more matched fingerprints in wider areas of the middle and high bands than the MMDenseLSTM. In the previous study, we found that distortion occurs at the frequency band boundary of the output spectrogram when the deep learning architecture was designed by in-

Conclusions
We proposed an MMDilDenseLSTM for speech recognition or music identification after audio source separation. MMDilDenseLSTM is a CRNN-based MMDenseLSTM and it has a dilated block that effectively increases the receptive field in consideration of acoustic characteristics in the spectrogram.
In the speech recognition experiment after speech enhancement, subjective evaluation was performed on the enhanced speech, and various objective indicators (PESQ, SDR, STOI, CSII, NCM, and WER) were measured. Encoder-decoder style architectures, MMDenseLSTM, DilDenseNet, and the proposed architecture showed better performance with fewer parameters than GRN. In addition, it was confirmed that simply increasing the number of parameters did not improve performance. The speech recognition performance of WER had a highest correlation with NCM than with other indicators, and the proposed architecture in the preference score, PESQ, STOI, CSII, NCM, and WER showed the best performance compared to other deep learning architectures.
In the music identification experiment after music signal separation, the performance of separation was measured using SDR. Although SDR and identification accuracy did not have a correlation in all deep learning architectures, the proposed architecture showed the best performance compared to other deep learning architectures in SDR performance and identification performance. In addition, when the fingerprints of the query and reference were plotted overlapped, it was confirmed that more fingerprints are matched over wider areas in the proposed architecture than in the MMDenseLSTM.
In conclusion, the proposed architecture greatly improved the performance of MM-DenseLSTM in all experiments for speech and music signals, and it showed the best performance compared to other deep learning architectures. In addition, it was shown that the separation performance was not quite well correlated with the overall system performance. When determining the architecture of the separation model, the characteristics of the system to which the separated signal was to be applied should be taken in consideration. Based on these results, we expected that the proposed architecture could be successfully applied to music identification and speech recognition systems in noisy environments, e.g., speech recognition in cars or automatic music identification in stores. However, there were clear limitations in deep learning models operating in the spectrum domain as in our proposed method because the phase information of the mixture causes distortion in the signal restoration process. In future work, we plan to alleviate this problem by taking a model operating in the waveform domain as baseline.