Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment
Abstract
1. Introduction
2. Related Works
2.1. Traditional Music-Separation Methods
2.2. Frequency-Domain Music-Separation Methods
2.3. Time-Domain Music-Separation Methods
3. Attention-Driven Time-Domain Convolutional Network
3.1. Separation Network
3.2. Embedding Attention Module
3.3. Efficient Convolutional Block Attention Module
3.4. Decoder
3.5. Optimization
4. Experimental Results and Analysis
4.1. Experimental Environment and Dataset
4.2. Evaluation Indicators
4.3. Implementation Details
4.4. Baselines
4.5. Performance Comparisons and Analyses
4.6. Ablation Study
4.7. Computational Complexity
- Standard 1D convolution:
- Pointwise 1 × 1 convolution:
- Depthwise-separable 1D convolution:
- Depthwise-separable conv: depthwise () + pointwise 1 × 1
- Two bottleneck 1 × 1 projections: and
- Example (first encoder layer, mono input , kernel : , .
- Other encoder/decoder layers follow the core formulas above. Their total cost is small compared to the TCN backbone. EAM/E-CBAM overhead is negligible.
5. Conclusions
5.1. Practical Applications
- Professional music production: Clean stems support remixing, remastering, and restoration of legacy recordings. Isolated vocals enable a cappella creation and singer re-synthesis, while accompaniment stems facilitate sampling and mashups with minimal bleed and artifacts.
- Live and interactive audio: Real-time or near-real-time separation allows on-stage vocal enhancement, instrument emphasis/suppression in live mixing, and audience-personalized sound fields in immersive events.
- Music education and practice: Track-level isolation lets learners solo or mute parts to study arrangement, articulation, and intonation. Educators can create stem-based exercises and feedback tools that target pitch, rhythm, and blend.
- Consumer applications: Karaoke and singing apps rely on robust vocal removal/ extraction; streaming platforms can provide interactive controls (e.g., “vocals up,” “drums down”) to personalize listening. Clean stems also improve AI cover generation and user-driven remixes.
- Accessibility and health: Speech/vocal foregrounding improves intelligibility for hearing-impaired users. Customized mixes can reduce listening fatigue and support therapeutic or rehabilitation scenarios.
- MIR and downstream AI tasks: Separated sources improve chord recognition, beat/tempo estimation, melody transcription, singer identification, and timbre analysis. Cleaner inputs reduce error propagation in tagging, recommendation, and generative modeling.
5.2. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ADTDCN | Attention-driven time-domain convolutional network |
BSS_EVAL | Blind source separation evaluation toolbox |
CASA | Computational auditory scene analysis |
CBAM | Convolutional block attention module |
EAM | Embedding attention module |
E-CBAM | Efficient convolutional block attention module |
GSAR | Global source-to-artifacts ratio |
GSDR | Global signal–distortion ratio |
GSIR | Global source-to-interference ratio |
IBM | Ideal binary mask |
MIR | Music information retrieval |
NMF | Non-negative matrix factorization |
RNNs | Recurrent neural networks |
SDR | Source-to-distortion ratio |
SI-SNR | Scale-invariant signal-to-noise ratio |
TCN-Block | Temporal convolutional network block |
TCN | Time-domain convolutional network |
VAT-SNet | Vocal and accompaniment time-domain separation network |
References
- Li, W.; Li, Z.; Gao, Y. Understanding Digital Music: A Review of Music Information Retrieval Technologies. Fudan Univ. J. (Nat. Sci.) 2018, 57, 271–313. [Google Scholar] [CrossRef]
- Kang, J.; Wang, H.; Su, G.; Liu, L. A Survey on Music Emotion Recognition. Comput. Eng. Appl. 2022, 58, 64–72. [Google Scholar] [CrossRef]
- Kum, S.; Nam, J. Joint detection and classification of singing voice melody using convolutional recurrent neural networks. Appl. Sci. 2019, 9, 1324. [Google Scholar] [CrossRef]
- You, S.D.; Liu, C.H.; Chen, W.K. Comparative study of singing voice detection based on deep neural networks and ensemble learning. Hum.-Centric Comput. Inf. Sci. 2018, 8, 34. [Google Scholar] [CrossRef]
- Sharma, B.; Das, R.K.; Li, H. On the Importance of Audio-Source Separation for Singer Identification in Polyphonic Music. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 2020–2024. [Google Scholar] [CrossRef]
- Weintraub, M. A computational model for separating two simultaneous talkers. In Proceedings of the ICASSP’86—IEEE International Conference on Acoustics, Speech, and Signal Processing, Tokyo, Japan, 7–11 April 1986; Volume 11, pp. 81–84. [Google Scholar] [CrossRef]
- Bregman, A.S. Progress in the study of auditory scene analysis. In Proceedings of the 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, NY, USA, 21–24 October 2007; pp. 122–126. [Google Scholar] [CrossRef]
- Rosenthal, D.F.; Okuno, H.G.; Okuno, H.; Rosenthal, D. Computational Auditory Scene Analysis: Proceedings of the Ijcai-95 Workshop; CRC Press: Boca Raton, FL, USA, 2021. [Google Scholar]
- Roman, N.; Wang, D.; Brown, G.J. Speech segregation based on sound localization. J. Acoust. Soc. Am. 2003, 114, 2236–2252. [Google Scholar] [CrossRef] [PubMed]
- Hu, G.; Wang, D. Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans. Neural Netw. 2004, 15, 1135–1150. [Google Scholar] [CrossRef]
- Hu, G.; Wang, D. Auditory segmentation based on onset and offset analysis. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 396–405. [Google Scholar] [CrossRef]
- Jin, Z.; Wang, D. A supervised learning approach to monaural segregation of reverberant speech. IEEE Trans. Audio Speech Lang. Process. 2009, 17, 625–638. [Google Scholar] [CrossRef]
- Hu, G.; Wang, D. A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 2067–2079. [Google Scholar] [CrossRef]
- Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef]
- Chanrungutai, A.; Ratanamahatana, C.A. Singing voice separation for mono-channel music using non-negative matrix factorization. In Proceedings of the 2008 International Conference on Advanced Technologies for Communications, Hanoi, Vietnam, 6–9 October 2008; pp. 243–246. [Google Scholar] [CrossRef]
- Rafii, Z.; Pardo, B. A simple music/voice separation method based on the extraction of the repeating musical structure. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 221–224. [Google Scholar] [CrossRef]
- Rafii, Z.; Pardo, B. Repeating pattern extraction technique (REPET): A simple method for music/voice separation. IEEE Trans. Audio Speech Lang. Process. 2012, 21, 73–84. [Google Scholar] [CrossRef]
- Liutkus, A.; Rafii, Z.; Badeau, R.; Pardo, B.; Richard, G. Adaptive filtering for music/voice separation exploiting the repeating musical structure. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 53–56. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, D. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1381–1390. [Google Scholar] [CrossRef]
- Huang, P.S.; Kim, M.; Hasegawa-Johnson, M.; Smaragdis, P. Deep learning for monaural speech separation. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 1562–1566. [Google Scholar] [CrossRef]
- Jansson, A.; Humphrey, E.; Montecchio, N.; Bittner, R.; Kumar, A.; Weyde, T. Singing voice separation with deep U-Net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference, Suzhou, China, 23–27 October 2017. [Google Scholar]
- Park, S.; Kim, T.; Lee, K.; Kwak, N. Music source separation using stacked hourglass networks. arXiv 2018, arXiv:1805.08559. [Google Scholar] [CrossRef]
- Lu, W.T.; Wang, J.C.; Kong, Q.; Hung, Y.N. Music source separation with band-split rope transformer. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 481–485. [Google Scholar]
- Tong, W.; Zhu, J.; Chen, J.; Kang, S.; Jiang, T.; Li, Y.; Wu, Z.; Meng, H. SCNet: Sparse compression network for music source separation. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 1276–1280. [Google Scholar]
- Takahashi, N.; Mitsufuji, Y. D3net: Densely connected multidilated densenet for music source separation. arXiv 2020, arXiv:2010.01733. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar] [CrossRef]
- Luo, Y.; Mesgarani, N. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef] [PubMed]
- Rouard, S.; Massa, F.; Défossez, A. Hybrid transformers for music source separation. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Qiao, X.; Luo, M.; Shao, F.; Sui, Y.; Yin, X.; Sun, R. Vat-snet: A convolutional music-separation network based on vocal and accompaniment time-domain features. Electronics 2022, 11, 4078. [Google Scholar] [CrossRef]
- Xu, L.; Wang, J.; Yang, W.; Luo, Y. Multi Feature Fusion Audio-visual Joint Speech Separation Algorithm Based on Conv-TasNet. J. Signal Process 2021, 37, 1799–1805. [Google Scholar]
- Hasumi, T.; Kobayashi, T.; Ogawa, T. Investigation of network architecture for single-channel end-to-end denoising. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 24–28 August 2020; pp. 441–445. [Google Scholar] [CrossRef]
- Zhang, Y.; Jia, M.; Gao, S.; Wang, S. Multiple Sound Sources Separation Using Two-stage Network Model. In Proceedings of the 2021 4th International Conference on Information Communication and Signal Processing (ICICSP), Shanghai, China, 24–26 September 2021; pp. 264–269. [Google Scholar] [CrossRef]
- Jin, R.; Ablimit, M.; Hamdulla, A. Speech separation and emotion recognition for multi-speaker scenarios. In Proceedings of the 2022 3rd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 22–24 July 2022; pp. 280–284. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W. Tobias Weyand Marco Andreetto and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Haase, D.; Amthor, M. Rethinking depthwise separable convolutions: How intra-kernel correlations lead to improved mobilenets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14600–14609. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
- Hsu, C.L.; Jang, J.S.R. On the improvement of singing voice separation for monaural recordings using the MIR-1K dataset. IEEE Trans. Audio Speech Lang. Process. 2009, 18, 310–319. [Google Scholar] [CrossRef]
- Vincent, E.; Gribonval, R.; Févotte, C. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef]
- Raffel, C.; McFee, B.; Humphrey, E.J.; Salamon, J.; Nieto, O.; Liang, D.; Ellis, D.P.; Raffel, C.C. MIR_EVAL: A Transparent Implementation of Common MIR Metrics. In Proceedings of the ISMIR, Taipei, Taiwan, 27–31 October 2014; Volume 10, p. 2014. [Google Scholar]
- Sun, C.; Yu, Q.; Gong, X.; Luo, R. Vocal and Accompaniment Separation Algorithm Based on Hourglass Network Using Speech Signal Characteristics. Comput. Appl. Softw. 2023, 40, 89–95. [Google Scholar] [CrossRef]
- Chen, Y.; Hu, Y.; He, L.; Huang, H. Multi-stage music separation network with dual-branch attention and hybrid convolution. J. Intell. Inf. Syst. 2022, 59, 635–656. [Google Scholar] [CrossRef]
Symbol | Parameter Interpretation | Parameter Value |
---|---|---|
N | Number of channels in the encoder | 512 |
J | Number of convolutional layers in the encoder | 4 |
L | Size of convolution kernel in the encoder | 16 |
B | Number of channels in convolutional blocks | 128 |
H | Number of channels in convolutional blocks | 512 |
Q | Size of convolution kernel in 1D convolution | 3 |
R | Number of ResNetBlocks in the music extractor | 3 |
Z | Number of fusion times of the encoded music Signal Y and the target music embedding | 4 |
Model | Vocal | ||
---|---|---|---|
GSDR | GSIR | GSAR | |
Huang [20] | 7.21 | 17.84 | 9.32 |
U-Net [21] | 10.09 | 11.96 | 11.30 |
SH-4stack [22] | 11.29 | 15.93 | 13.52 |
Hourglass Network [40] | 10.69 | 16.02 | 12.79 |
MSSN [41] | 11.96 | 18.07 | 13.57 |
Conv-TasNet [27] | 12.16 | 17.38 | 14.29 |
VAT-SNet [29] | 14.57 | 25.53 | 15.12 |
ADTDCN (EAM) | 15.59 | 26.31 | 15.66 |
ADTDCN (E-CBAM) | 15.36 | 25.78 | 15.47 |
ADTDCN | 16.07 | 26.49 | 15.82 |
Model | Accompaniment | ||
---|---|---|---|
GSDR | GSIR | GSAR | |
Huang [20] | 7.49 | 16.89 | 9.97 |
U-Net [21] | 10.32 | 11.65 | 11.42 |
SH-4stack [22] | 12.07 | 15.21 | 14.95 |
Hourglass Network [40] | 10.01 | 14.21 | 12.68 |
MSSN [41] | 10.64 | 14.82 | 13.41 |
Conv-TasNet [27] | 12.57 | 16.87 | 13.64 |
VAT-SNet [29] | 15.88 | 24.77 | 15.47 |
ADTDCN (EAM) | 16.79 | 25.31 | 15.95 |
ADTDCN (E-CBAM) | 16.57 | 25.87 | 15.84 |
ADTDCN | 16.98 | 26.27 | 16.07 |
Method | SI-SNR |
---|---|
Conv-TasNet [27] | 9.73 |
VAT-SNet [29] | 10.87 |
ADTDCN (EAM) | 11.28 |
ADTDCN (E-CBAM) | 11.23 |
ADTDCN | 11.39 |
Position | SI-SNR |
---|---|
AP1 | 10.92 |
AP2 | 11.15 |
AP3 | 11.39 |
AP4 | 10.52 |
AP5 | 10.61 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, Z.; Luo, M.; Qiao, X.; Shao, C.; Sun, R. Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment. Electronics 2025, 14, 3982. https://doi.org/10.3390/electronics14203982
Zhao Z, Luo M, Qiao X, Shao C, Sun R. Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment. Electronics. 2025; 14(20):3982. https://doi.org/10.3390/electronics14203982
Chicago/Turabian StyleZhao, Zhili, Min Luo, Xiaoman Qiao, Changheng Shao, and Rencheng Sun. 2025. "Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment" Electronics 14, no. 20: 3982. https://doi.org/10.3390/electronics14203982
APA StyleZhao, Z., Luo, M., Qiao, X., Shao, C., & Sun, R. (2025). Attention-Driven Time-Domain Convolutional Network for Source Separation of Vocal and Accompaniment. Electronics, 14(20), 3982. https://doi.org/10.3390/electronics14203982