Unsupervised Single-Channel Singing Voice Separation with Weighted Robust Principal Component Analysis Based on Gammatone Auditory Filterbank and Vocal Activity Detection

Singing-voice separation is a separation task that involves a singing voice and musical accompaniment. In this paper, we propose a novel, unsupervised methodology for extracting a singing voice from the background in a musical mixture. This method is a modification of robust principal component analysis (RPCA) that separates a singing voice by using weighting based on gammatone filterbank and vocal activity detection. Although RPCA is a helpful method for separating voices from the music mixture, it fails when one single value, such as drums, is much larger than others (e.g., the accompanying instruments). As a result, the proposed approach takes advantage of varying values between low-rank (background) and sparse matrices (singing voice). Additionally, we propose an expanded RPCA on the cochleagram by utilizing coalescent masking on the gammatone. Finally, we utilize vocal activity detection to enhance the separation outcomes by eliminating the lingering music signal. Evaluation results reveal that the proposed approach provides superior separation outcomes than RPCA on ccMixter and DSD100 datasets.


Introduction
Singing voice separation (SVS) has drawn a lot of interest and consideration in many downstream applications [1][2][3][4]. It deals with the technique of separating a singing voice or background from a mix of music, which is a crucial strategy for singer identification [5,6], music information retrieval [7,8], lyric recognition and alignment [9][10][11][12], song language identification [13,14], and chord recognition [15][16][17]. The recent separation techniques, however, fall well short of the capabilities of human hearing. It is challenging to resolve the existing SVS because of the instruments utilized and the spectral overlap between the speech and background music [11,[18][19][20][21]. In daily life, human listeners generally have the remarkable ability to distinguish sound streams from a mixture of sounds, but this continues to be a difficult task for machines, particularly in the monaural case because it lacks the spatial cues that can be learned when two or more microphones are used. Additionally, the singing separation feeling could not directly translate from spoken separation. Speaking and singing voices have many similarities with one another but also differ in important ways. Because singing and speaking have distinct histories, there are significant challenges involved in separating them. The nature of the other accompanying sounds is the key distinction between singing and speech in terms of their independence from a background. The background noise that mingles with speech may be harmonic or nonharmonic, narrowband or broadband, and often unrelated to the speech. However, the musical accompaniment to a song is typically harmonic and wideband, associated effectively applied to SVS, RPCA fails when one singular value, such as drums, is significantly greater than others, which lowers the separation results, particularly for drums included in the combined music signal. Although all approaches can produce effective separation results, they all ignore the characteristics of auditory system, which is crucial for enhancing the quality of separation outcomes. To overcome this problem, in previous studies, we proposed a novel unsupervised approached that extends RPCA exploiting rank-1 constraint for SVS tasking [37]. A recent study found that the cochleagram, a different time frequency (T-F) masking technique with gammatone, is more effective for audio signal separation than the spectrogram [38][39][40]. In the cochleagram representation, a gammatone filter is used to simulate the cochleagram representation's frequency components, which are based on the human cochlea's frequency selectivity ability. Yuan et al. [38] proposed a data augmentation method by using chromagram-based and pitch-aware methods for SVS. A popular and effective method for synchronizing and aligning music is the use of chromagrams or chroma-based characteristics. The twelve distinct pitch classes and chromagram are closely connected. In order to create a 1-D vector that represents how the harmonic content of the representation inside the timeframe is distributed throughout the 12 chroma bands, the fundamental concept is to aggregate each pitch class across octaves for a specific local time window. A 2-D time chroma representation is produced as the time frame is moved across the song. As a result of its great resilience to timbral fluctuations and tight relationship to the musical harmony, we employ a chromagram correlation across song sections as a metric by which to evaluate song commonalities. Gao et al. [39] proposed a novel machine learning method, the optimized nonnegative matrix factorization (NMF) for SVS. The suggested cost function was created specifically for factorization of nonstationary signals with temporally dependent frequency patterns. Moreover, He et al. [40] suggested a method that will be able to get around all of the sparse nonnegative matrix factorization (SNMF) 2-D's previously mentioned drawbacks. The suggested model allows for many spectral and temporal changes, which are not inherent in the NMF and SNMF models. This allows for overcomplete representation. In order to provide distinctive and accurate representations of the nonstationary audio signals, sparsity must be imposed.
Additonally, the singing voice performances element on the cochleagram is rather distinct from the background music. For a singing voice, the spectral energy concentrates in a small number of time frequency units, so we may thus presume that it is sparse [41]. Additionally, the cochleagram's accompaniment to music exhibits comparable patterns and structures that can be represented in the basis spectral vectors. As a result, an example of blind monaural SVS system is described in Figure 1. The underlying low-rank and sparsity hypotheses, however, could not always hold true. Both the decomposed low-rank matrix and the decomposed sparse matrix may include vocal sounds in addition to instrumental sounds (such as percussion). There is still some background music audible while listening to the separated singing voice. Similar to this, a portion of the singing voice is mistakenly categorized as background music. In order to improve the separation accuracy, additional approaches or techniques must be used to categorize the RPCA output. Therefore, in this paper, to address the existing problems in RPCA for SVS, in our work, we provide the varying values approach to characterize low-rank and sparse matrices. This method is referred as weighted RPCA (WRPCA) [41], and it selects various weighted values from the separated by low-rank and sparse matrices. Meanwhile, as the first step of WRPCA, we simulate the human auditory system by using the gammatone filterbank. To further remove the nonseparated background music, we combined the harmonic masking and T-F masking [42][43][44]. The mixed signal's time-frequency (T-F, or spectrogram) representation has been employed in the majority of prior speech-separation techniques. This signal is approximated from the waveform by using the short-time Fourier transform (STFT). The goal of speech separation technologies in the T-F domain is to approach the mixed spectrogram's clean spectrogram of the separate sources. Nonlinear regression techniques may be used to directly approximate each source's spectrogram description from the combination in this procedure. Finally, we utilize the vocal activity detection (VAD) [45][46][47] to get rid of any remaining background music. In a word, the key contributions of the work are outlined below as a summary.

•
We offer the WRPCA addition to RPCA, which uses various weighted RPCA to achieve the improved separation performance. • We combine gammatone auditory filterbank with vocal activity detection for SVS.
Gammatone filterbanks are designed to imitate the human auditory system. • We build the coalescent masking by fusing the harmonic masking and T-F masking, which can remove nonseparated background music. Additionally, we restrict the temporal segments that can include the singing voice part by using VAD. • The extensive monaural SVS experiments reveal that the proposed approach can achieve greater separation performance than the RPCA method.
The remainder of this paper is arranged as follows. Section 2 provides RPCA and RPCA for SVS tasks. The proposed WRPCA on the cochleagram with VAD is illustrated in Section 3. The proposed approach is assessed with two different datasets in Section 4. Finally, we draw some conclusions from the paper and ideas for the further study in Section 5.

Related Work
This section discusses the RPCA and its application in SVS.

Overview of RPCA
RPCA was first introduced by Candés et al. [48] to divide the M ∈ R m×n into L ∈ R m×n plus S ∈ R m×n . Thus, the optimization model is defined as where |L| * stands for the sum of singular values and |S| 1 is the sum of absolute values of the matrix. According to the previous study, we set λ = 1/ max(m, n). Meanwhile, we solve the convex program by accelerated proximal gradient (APG) or augmented Lagrange multiplier (ALM) [49] algorithms. In our work, a baseline experiment was conducted by using an inexact version of ALM.

RPCA for SVS
Music is typically made up of a variety of blended sounds, including both human singing voice and background music. The conventional RPCA approach can solve the task of SVS [34]. The magnitude spectrogram of a song may be broken down by using RPCA and can be thought of as the superposition of a sparse matrix and a low-rank matrix. The low-rank decomposition matrix and sparse matrix seem to match to the singing voice and background music. In light of these assumptions, RPCA may be used to solve the singing/accompaniment separation problem. The assumptions are that singing corresponds to sparse matrices and low rank to accompaniment.
Due to the fact that musical instruments may replicate the same sounds again in music, the low-rank structure is used to conceptualize the spectrogram. In summary, the harmonic structure element of the singing voice part causes it to vary widely and to have a sparse distribution, producing in a spectrogram with the sparse matrix structure. Figure 2 shows the separation process of SVS with the low-rank and sparse model. Music is a low-rank signal because musical instruments can recreate the same sounds each time a piece is performed and music generally has an underlying recurring melodic pattern. Contrarily, voices are relatively scarce in the temporal and frequency domains but have greater diversity (higher rank). The singing voices can thus be seen as elements of the sparse matrix. By RPCA, we anticipate that the sparse matrix S will contain voice signals and the low-rank matrix L will include backing music. Consequently, in this study, we may divide an input matrix by using the RPCA approach into a low-rank and sparse matrices. Nevertheless, it does make significant assumptions. Drums, for instance, could not be low rank but rather lie in the sparse subspace, which lowers the results, especially for drums included in mixed music.

Proposed Method
This section firstly presents the WRPCA approach. Then, the gammatone filterbank and vocal acivity detection are utilized as the postprocessing for SVS. Finally, we provide the architecture of the proposed SVS approach.

Overview of WRPCA
WRPCA is an extension of RPCA, which has different scale values between sparse and low-rank matrices. The corresponding model can be defined as follows, where |L| w, * is the different weighted values in the matrix of low rank, and the S is sparse. M ∈ R m×n is made up L ∈ R m×n and S ∈ R m×n , and the parameter λ = 1/ max(m, n) is indicated [48]. Thus, we define the function of |L| w, * as follows, where w i denotes the weight assigned to singular value σ i (M).
In this paper, we also adopted an efficient, inexact version of the augmented Lagrange multiplier (ALM) [49] to solve this convex model. The corresponding augmented Lagrange function is defined as follows: where J is the Lagrange multiplier and µ is a positive scaler. The process corresponding to music mixture signal separation can be seen in Algorithm 1 WRPCA for SVS. The value of M is a mixture music signal from the observed data. After the separation by using WRPCA, we can obtain a sparse matrix S (singing voice) and a low-rank matrix L (music accompaniment).
9: end while. Output: S m×n , L m×n .

Weighted Values
The standard nuclear norm minimization regularizes each singular value equally to pursue the convexity of the objective function. However, the RPCA method simply ignores the differences between the scales of the sparse and low-rank matrices. In order to solve this problem, and inspired by the success of weighted nuclear norm minimization [50], we adopted different weighted value strategies to trim the low-rank matrix during the SVS processing. This enables the features of the separated matrices to be better represented.
and δ i (M) represents the i-th singular value of M.

Gammatone Filterbank
The information obtained from primary auditory fibers is characterized by the gammatone function [52]. It characterizes physiological impulse-response data gathered from primary auditory fibers in the cat. Gammatone filter banks were designed to model the human auditory system. The modeling process mimics the organization of the peripheral sound processing step cites using a physiologically strategy [53]. In our work, we first pass a mixture music signal into the gammatone filterbank. Thus, the impulse response function is defined as follows, where A is an arbitrary factor, N is the filter order, b is the between the impulse functions'length and the filter's bandwidth, f c is the center frequency, and ϕ is the tone phase.
In the human auditory system, there are around 3000 inner hair cells along the 35-mm spiral path cochlea. Each hair cell could resonate to a certain frequency within a suitable critical bandwidth. This means that there are approximately 3000 bandpass filters in the human auditory system. This high resolution of filters can be approximated by specifying certain overlapping between the contiguous filters. The impulse response of each filter follows the gammatone function shape. The bandwidth of each filter is determined according to the auditory critical band, which is the bandwidth of the human auditory filter at different characteristic frequencies along the cochlea path [54]. The speech signal shown in the left panel is passed through a bank of 16 gammatone filters spaced between 80 Hz and 8000 Hz. The output of each individual filter is shown in the right panel. As a result, Figure 3 depicts the gammatone filterbank.

T-F Masking
After obtaining the separation results of sparse S and low-rank matrices L by using WRPCA, we applied T-F masking to further improve the separation performance. Thus, we define the ideal binary mask (IBM) and ideal ratio mask (IRM) as follows, and where S i,j and L i,j denote the complex spectral values of singing voice and accompaniment, respectively.

F0 Estimation
In this work, we use F0 to enhance the effectiveness of separation results. Due to the fact that F0 varies over time and is a property of the parts played by various singing voice and background accompaniment, it can greatly improve separation quality by removing the spectral components of nonrepeating instruments (e.g., bass and guitar). The salience function is defined as h n P(t, s + 1200 log 2 (n)), (13) where t is the sequence index and s is the logarithmic frequency. The number of harmonic components is N and the decaying fact is h n .
The function of C can be calculated as (log a t H(t, s t ) + log T(s t , s t+1 )), where a t is the factor in normalization that brings the salience values to a sum of 1, and T(s t , s t+1 ) is a transition probability that denotes the likelihood of current F0 moving to the next F0 in the following sequence. Additionally, by utilizing the Viterbi search approach, the melody contour C value is optimized.

Harmonic Masking
As a result of our prior study [55], the harmonic masking is defined as where w is the frequency width used to extract the energy surrounding each harmonic, n is the harmonic's index, and the vocal F0 is represented by F t at sequence t.

Coalescent Masking
We are interested in constructing coalescent masking by using harmonic masking M h and IBM. It is possible to define the corresponding formulation M c as follows, where the elementwise multiplication operator is indicated by ⊗, and the time frequency masking and harmonic masking are denoted by IBM and M h , respectively.

Vocal Activity Detection
To remove the residual music signal and restrict the values of voice and accompaniment, we are using a VAD approach. The output results s o can be described as follows, where s v is the state of the singing voice, s a is the state of the background music, and k is the threshold. According to the vocal F0 estimation methods [56], the definition of the function Ω is as follows, where P(t, s) denotes the the value of power and H f is the sum of harmonics for each frequency.
The architecture of our proposed method for the blind monaural SVS system is illustrated in Figure 4. We first apply a gammatone filterbank to the test dataset's mixture music signal to obtain the cochleagram, and then utilize proposed WRPCA approach to separate it into the L and S. By combining T-F masking and harmonic masking, we also create coalescent masking to eliminate nonseparated music. To enhance separation performance, VAD is being used. Finally, we synthesize the voice and background music. According to Wang et al. [57], the separated signal can be synthesized. In this work, we randomly selected 30-s audio sample data at random in the ccMixter. The spectrograms of the isolated singing voice and background portions from mixed musical signals are illustrated in Figure 5. The original spectrograms are contrasted employing different separation approaches. As seen in the figures, the original clean singing voice and music spectrograms are shown in (a) and (b), whereas (c) and (d) exhibit the separated signal divided by RPCA. The WRPCA has split the signals (e) and (f) in the Proposed 1. Similarly, (g) and (h) show the separation results by the Proposed 2.
Singing Voice Accompaniment Figure 5. Examples are taken from the ccMixter dataset's spectrograms. The singing voice is represented by the left four spectrograms, whereas the equivalent musical accompaniment is represented by the right four.
From the abovementioned spectrograms, we can see that Figure 5c has the strongest background music signal (accompaniment), whereas Figure 5g has the lowest constraint. In other words, the latter is therefore preferable than the former in processing of SVS.

Experimental Evaluation
This section discusses the two experiments for SVS, including datasets we used on ccMixter [58] and DSD100 [59], respectively. We also present a comparison and analysis of the experiment results.

Datasets
One was the ccMixter, for which we selected 43 full stereo tracks with only 30 s (from 30 s to 1 min) at the same time of each track, and every piece of music can only contain voice for so long. Three components make up each mixture song: voice, background and the combination.
Another was the DSD100 dataset, which consisted of 36 development data and 46 test data. To reduce dimensionality and speed up speech processing, we also utilized 30-s fragments (from 1 min 45 s to 2 min 15 s) simultaneously for all data.

Settings
In our work, we focused on single-channel source separation, which is more difficult and complex than multichannel source separation because only less useful information is available. The two-channel stereo mixture datasets we used were downmixed to be mono by averaging two channels.
To evaluate our proposed approach, the spectrogram was computed by STFT by using 1024 points, and the size of hope is 256 samples. The experimental data was converted to mono after being sampled at 44.1 kHz. We established 128 channels, the frequency length ranged from 40 to 11,025 Hz, and a 256 frequency length for cochleagram analysis.
To confirm the effectiveness of our proposed algorithm, we assessed its quality of separation in terms of the source-to-distortion ratio (SDR) and the source-to-artifact ratio (SAR) by using the BSS-EVAL 3.0 metrics [60,61] and the normalized SDR (NSDR). Therefore, we define the estimated valueŜ(t) as follows, where S target (t) is the target audio's permissible distortion, S inter f (t) denotes the allowable length change of source information to account for disturbances from unwanted sources, and S arti f (t) denotes a potential artifact connected to the artifact of the separation technique. We therefore categorize them as follows, wherev is the estimated signal, v stands for the reference isolated, and x for the mixed music. The NSDR takes into account the SDR's overall increase from x tov. The measurement units used are all dB.
The higher values of the SDR, SAR and NSDR represent that the method exhibits better separation performance of source separation. The SDR represents the quality of the separated target sound signals. The SAR represents the absence of artificial distortion. All the metrics are expressed in dB.

Experiment Results
The following two approaches are presented based on the proposed WRPCA; we take them as Proposed 1 and Proposed 2, respectively. More specifically, the Proposed 1 is utilize WRPCA and T-F masking, whereas Proposed 2 adopts WRPCA and coalescent masking. Both algorithms are use VAD technology: • Proposed 1: WRPCA with T-F masking • Proposed 2: WRPCA with coalescent masking.
We evaluated them by using the ccMixter. The comparative outcomes for RPCA, MLRR, WRPCA, CRPCA, RPCA with IRM, WRPCA using IRM, CRPCA using IRM, RPCA using IBM, WRPCA using IBM, CRPCA using IBM, Proposed 1, and Proposed 2 are shown in Figure 6. To further completely confirm the efficacy of our proposed methodology, we designed multiple comparative experiments. The RPCA, MLRR, WRPCA, CRPCA, RPCA using IRM, RPCA using IBM, CRPCA using IRM, and CRPCA using IBM are evaluated on the spectrogram, whereas the WRPCA using IRM, WRPCA using IBM, Proposed 1, and Proposed 2 are evaluated on the cochleagram. We can find from the SDR and SAR experiment results that WRPCA performs better, especially for VAD on the cochleagram. The standard RPCA, in contrast, performed less well than the others. Additionally, we assessed WRPCA by using the DSD100 dataset. The comparative outcomes for RPCA, MLRR, WRPCA, CRPCA, RPCA using IRM, WRPCA using IRM, CRPCA using IRM, RPCA using IBM, WRPCA using IBM, CRPCA using IRM, Proposed 1, and Proposed 2 are each shown in Figure 7. Similarly, RPCA, WRPCA, CRPCA, RPCA using IRM, RPCA using IBM, CRPCA using IRM, and CRPCA using IBM are evaluated on the spectrogram, whereas the WRPCA using IRM, WRPCA using IBM, Proposed 1, and Proposed 2 are evaluated with the cochleagram. We can find from the SDR and SAR experiment results that WRPCA performs better, especially for the VAD on the cochleagram. The standard RPCA, in contrast, performed less well than the others in Figure 6 and   Figure 8 exhibits the NSDR results from the ccMixter and DSD100 datasets that we obtained by using WRPCA. In other words, the NSDR gives improved removal efficiency in SVS and overall optimizes the SDR. The results demonstrated that our Proposed 2, which was used, produced the best results. As a consequence, we confirm that WRPCA on a cochleagram offers higher sensitivity and selectivity than RPCA under similar circumstances with or without T-F masking based on the results of Figure 6, Figure 7, and Figure 8, respectively. Additionally, WR-PCA delivered superior outcomes to RPCA by using the gammatone and T-F masking. We show that, across all evaluation modalities, our suggested strategies offer improved separation outcomes.

Conclusions
In this work, we proposed an extension of RPCA by using weighting on the cochleagram. The mixing signal's cochleagram was segmented into low-rank and sparse matrices by WRPCA, and the coalescent masking was constructed by integrating the harmonic and T-F masking. Finally, we constrained the temporal segments that could include the singing voice part utilizing VAD. Evaluations on ccMixter and DSD100 datasets reveal that WRPCA performs better than RPCA for SVS, especially for WRPCA on cochleagram using gammatone and VAD approach.
In future work, to further expand the functionality of our system, we will research the vocal augmentation option. Additionally, unsupervised training that depends on the complementary nature of these two tasks will be tried because of the modest size of the public datasets that comprise both pure vocal samples and their related F0 annotations.
Author Contributions: F.L., conceptualization, methodology, formal analysis, project administration, experiment, writing. Y.H., investigation, data curation, validation, and visualization. L.W., experimental data processing and funding acquisition. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: The datasets generated during and analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this paper: