Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis
Abstract
:1. Introduction
- We decompose the binaural audio representation into two components to accurately model the sound propagation process, which are separately extracted and reconstructed based on time-domain warping.
- To enhance the accuracy of the synthesized binaural audio, we propose the gated-conv fusion module (GCFM) to integrate the reconstructed components by suppressing less informative features, allowing only the useful information to be utilized for further binaural audio synthesis.
- To capture the spatial perception induced by source movement, we employ a position–orientation self-attention module (POS-ORI self-attention, POSA), which effectively combines audio features with spatial localization cues.
2. Related Work
2.1. Binaural Audio Synthesis
2.2. UNet
2.3. Attention Mechanism
3. Method
3.1. Signal Model
3.2. Overall Structure
3.3. Downsampling and Upsampling
3.4. POS-ORI Self-Attention Module
3.5. Spatial Information Fusion Module
3.6. Gated-Conv Fusion Module
3.7. Loss Function
4. Experiment
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
- Wave-: The MSE in the temporal waveform between the synthesized binaural audio and the ground truth.
- Phase-: The MSE in the phase component between the synthesized binaural audio and the ground truth after applying STFT, which provides an indication of the accuracy of the synthesized audio’s ITD.
- Amplitude-: The MSE in the amplitude between the synthesized binaural audio and the ground truth after STFT, which serves as a measure of the accuracy of ILD in the synthesized audio.
- MOS: Overall naturalness and clarity of the audio.
- Spatialization MOS: The spatial perception of the audio.
- Similarity MOS: The similarity between the synthesized binaural audio and the ground truth.
5. Results and Analysis
5.1. Performance Comparison of Binaural Audio Synthesis Models
5.1.1. Quantitative Evaluation
5.1.2. Perceptual Evaluation
5.2. Ablation Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
VR | Virtual reality |
AR | Augmented reality |
ITD | Interaural time difference |
ILD | Interaural level difference |
DSP | Digital signal processing |
HRIRs | Head-related impulse responses |
BRIRs | Binaural room impulse responses |
HRTFs | Head-related transfer functions |
CP | Common portion |
DP | Differential portion |
POSA | POS-ORI self-attention module |
GCFM | Gated-Conv fusion module |
DS | Downsampling operation |
US | Upsampling operation |
SA | Self-attention mechanism |
CA | Cross-attention mechanism |
Pos | Position |
Ori | Orientation |
MSE | Mean squared error |
STFT | Short-time Fourier transform operation |
References
- Rumsey, F. Spatial Audio; Routledge: New York, NY, USA, 2012. [Google Scholar] [CrossRef]
- Hendrix, C.; Barfield, W. The Sense of Presence within Auditory Virtual Environments. Presence Teleoperators Virtual Environ. 1996, 5, 290–301. [Google Scholar] [CrossRef]
- Hawley, M.L.; Litovsky, R.Y.; Culling, J.F. The benefit of binaural hearing in a cocktail party: Effect of location and type of interferer. J. Acoust. Soc. Am. 2004, 115, 833–843. [Google Scholar] [CrossRef] [PubMed]
- Asano, F.; Suzuki, Y.; Sone, T. Role of spectral cues in median plane localization. J. Acoust. Soc. Am. 1990, 88, 159–168. [Google Scholar] [CrossRef] [PubMed]
- Wright, D.; Hebrank, J.H.; Wilson, B. Pinna reflections as cues for localization. J. Acoust. Soc. Am. 1974, 56, 957–962. [Google Scholar] [CrossRef]
- Sunder, K.; He, J.; Tan, E.; Gan, W. Natural Sound Rendering for Headphones: Integration of signal processing techniques. IEEE Signal Process. Mag. 2015, 32, 100–113. [Google Scholar] [CrossRef]
- Zhang, W.; Samarasinghe, P.N.; Chen, H.; Abhayapala, T.D. Surround by Sound: A Review of Spatial Audio Recording and Reproduction. Appl. Sci. 2017, 7, 532. [Google Scholar] [CrossRef]
- Zotkin, D.N.; Duraiswami, R.; Davis, L.S. Rendering localized spatial audio in a virtual auditory space. IEEE Trans. Multimed. 2004, 6, 553–564. [Google Scholar] [CrossRef]
- Zamir, B.; Alon, D.; Ravish, M.; Rafaely, B. Efficient Representation and Sparse Sampling of Head-Related Transfer Functions Using Phase-Correction Based on Ear Alignment. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 2249–2262. [Google Scholar] [CrossRef]
- Richard, A.; Markovic, D.; Gebru, I.D.; Krenn, S.; Butler, G.A.; Torre, F.; Sheikh, Y. Neural Synthesis of Binaural Speech From Mono Audio. In Proceedings of the International Conference on Learning Representations (ICLR 2021), Vienna, Austria, 3–7 May 2021. [Google Scholar]
- Leng, Y.; Chen, Z.; Guo, J.; Liu, H.; Chen, J.; Tan, X.; Mandic, D.; He, L.; Li, X.; Qin, T.; et al. A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis. Adv. Neural Inf. Process. Syst. 2022, 35, 23689–23700. [Google Scholar]
- Xu, S.; Zhang, Z.; Wang, M. Channel and Temporal-Frequency Attention UNet for Monaural Speech Enhancement. EURASIP J. Audio Speech Music. Process. 2023, 1, 1687–4722. [Google Scholar] [CrossRef]
- Hinton, G.; Osindero, S.; Teh, Y. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
- Heymann, J.; Drude, L.; Haeb-Umbach, R. Neural network based spectral mask estimation for acoustic beamforming. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 196–200. [Google Scholar] [CrossRef]
- Ren, X.; Chen, L.; Zheng, X.; Xu, C.; Zhang, X.; Zhang, C.; Guo, L.; Yu, B. A Neural Beamforming Network for B-Format 3D Speech Enhancement and Recognition. In Proceedings of the IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia, 25–28 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Zhang, Z.; Xu, S.; Zhuang, X.; Qian, Y.; Wang, M. Dual branch deep interactive UNet for monaural noisy-reverberant speech enhancement. Appl. Acoust. 2023, 212, 109574. [Google Scholar] [CrossRef]
- Gao, R.; Feris, R.; Grauman, K. Learning to Separate Object Sounds by Watching Unlabeled Video. In Computer Vision—ECCV 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 36–54. [Google Scholar] [CrossRef]
- Gebru, I.D.; Marković, D.; Richard, A.; Krenn, S.; Butler, G.A.; De la Torre, F.; Sheikh, Y. Implicit HRTF Modeling Using Temporal Convolutional Networks. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3385–3389. [Google Scholar] [CrossRef]
- Morgado, P.; Vasconcelos, N.; Langlois, T.; Wang, O. Self-Supervised Generation of Spatial Audio for 360 Video. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, DC, Canada, 3–8 December 2018; Available online: https://api.semanticscholar.org/CorpusID:52177577 (accessed on 9 March 2025).
- Lu, Y.; Lee, H.; Tseng, H.; Yang, M. Self-Supervised Audio Spatialization with Correspondence Classifier. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3347–3351. [Google Scholar] [CrossRef]
- Huang, W.; Markovic, D.; Richard, A.; Gebru, I.D.; Menon, A. End-to-End Binaural Speech Synthesis. arXiv 2022, arXiv:2207.03697. [Google Scholar] [CrossRef]
- Gao, R.; Grauman, K. 2.5D Visual Sound. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
- Zhou, H.; Xu, X.; Lin, D.; Wang, X.; Liu, Z. Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation. In Computer Vision—ECCV 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 4705–4714. [Google Scholar] [CrossRef]
- Xu, X.; Zhou, H.; Liu, Z.; Dai, B.; Wang, X.; Lin, D. Visually Informed Binaural Audio Generation without Binaural Audios. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15480–15489. [Google Scholar] [CrossRef]
- Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
- Lee, J.; Lee, K. Neural Fourier Shift for Binaural Speech Rendering. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
- Rayleigh, M.A.L. On Our Perception of the Direotion of a Source of Sound. J. R. Music. Assoc. 2020, 2, 75–84. [Google Scholar] [CrossRef]
- Wightman, F.L.; Kistler, D.J. The dominant role of low-frequency interaural time differences in sound localization. J. Acoust. Soc. Am. 1992, 91, 1648–1661. [Google Scholar] [CrossRef]
- Niccolò, A.; Sena, E.D.; Moonen, M.; Naylor, P.A.; Waterschoot, T.V. Room Impulse Response Interpolation Using a Sparse Spatio-Temporal Representation of the Sound Field. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1929–1941. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015 Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar] [CrossRef]
- Kim, J.; Chang, J. Attention Wave-U-Net for Acoustic Echo Cancellation. Interspeech 2020, 3969–3973. [Google Scholar] [CrossRef]
- Guimarães, H.R.; Nagano, H.; Silva, D.W. Monaural speech enhancement through deep wave-U-net. Expert Syst. Appl. 2020, 158, 113582. [Google Scholar] [CrossRef]
- Nair, A.; Koishida, K. Cascaded Time + Time-Frequency Unet For Speech Enhancement: Jointly Addressing Clipping, Codec Distortions, And Gaps. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7153–7157. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762v7. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016, arXiv:1409.0473. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. Available online: https://aclanthology.org/N19-1423/ (accessed on 9 March 2025).
- Sutskever, I.; Vinyals, O.; Le, Q. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [Google Scholar] [CrossRef]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019. [Google Scholar]
- Altmann, S.L. Rotations, Quaternions, and Double Groups; Dover Publications: Mineola, NY, USA, 2005. [Google Scholar]
- Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and Checkerboard Artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
- Brown, C.P.; Duda, R.O. A structural model for binaural sound synthesis. IEEE Trans. Speech Audio Process. 1998, 6, 476–488. [Google Scholar] [CrossRef]
- Kingma, D.P.; Jimmy, B. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar] [CrossRef]
- Levkovitch, A. Zero-Shot Mono-to-Binaural Speech Synthesis. arXiv 2024, arXiv:2412.08356. [Google Scholar] [CrossRef]
Layer | Input Size | Output Size |
---|---|---|
DS0 | ||
DS1 | ||
DS2 | ||
DS3 | ||
DS4 | ||
DS5 | ||
US0 | ||
US1 | ||
US2 | ||
US3 |
Layer | Input Size | Output Size |
---|---|---|
Conv1 | ||
Inter | ||
Concat | ||
Conv1(after connect) |
Model | Year | Wave- ↓ | Phase- ↓ | Amplitude- ↓ | #Param↓ | MACs↓ |
---|---|---|---|---|---|---|
DSP | - | 0.485 | 1.388 | 0.058 | - | - |
WaveNet [25] | 2016 | 0.179 | 0.968 | 0.037 | 4.65 M | 22.34 G |
WarpNet [10] | 2021 | 0.167 | 0.807 | 0.048 | 8.59 M | 4.27 G |
WarpNet* [11] | 2022 | 0.157 | 0.838 | 0.038 | ||
BinauralGrad [11] | 2022 | 0.128 | 0.837 | 0.030 | 13.82 M | 15.32 G |
NFS [26] | 2023 | 0.172 | 0.999 | 0.035 | 0.55 M | 357.58 M |
ZERO-SHOT [44] | 2024 | 0.440 | 1.508 | 0.053 | - | - |
ours | 2025 | 0.147 | 0.789 | 0.036 | 6.74 M | 626.14 M |
Model | Year | MOS ↑ | Spatialization ↑ | Similarity ↑ |
---|---|---|---|---|
GroundTruth | - | 4.47 ± 0.147 | 4.31 ± 0.172 | - |
- | - | - | ||
DSP | - | 3.48 ± 0.388 | 3.75 ± 0.192 | 3.64 ± 0.142 |
3.23 ± 0.231 | 3.56 ± 0.167 | - | ||
WaveNet | 2016 | 4.03 ± 0.263 | 3.78 ± 0.293 | 4.14 ± 0.264 |
3.94 ± 0.25 | 3.67 ± 0.21 | - | ||
WarpNet | 2021 | 4.01 ± 0.285 | 3.84 ± 0.228 | 4.04 ± 0.198 |
4.00 ± 0.266 † | 4.05 ± 0.233 | - | ||
BinauralGrad | 2022 | 4.04 ± 0.298 † | 4.13 ± 0.292 † | 4.19 ± 0.174 † |
3.22 ± 0.189 | 3.7 ± 0.194 | - | ||
NFS | 2023 | 3.13 ± 0.434 | 3.56 ± 0.23 | 3.93 ± 0.348 |
3.99 ± 0.315 | 4.27 ± 0.255 | - | ||
ours | 2024 | 4.25 ± 0.287 | 4.17 ± 0.223 | 4.26 ± 0.169 |
4.15 ± 0.219 | 4.15 ± 0.215 † | - |
Model | Wave-↓ | Phase-↓ | Amplitude-↓ |
---|---|---|---|
Ours | 0.147 | 0.789 | 0.036 |
TW | 0.415 | 1.047 | 0.060 |
POSA | 0.199 | 0.846 | 0.043 |
GCFM | 0.157 | 0.813 | 0.040 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, W.; He, C.; Cao, Y.; Xu, S.; Wang, M. Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis. Sensors 2025, 25, 1790. https://doi.org/10.3390/s25061790
Zhang W, He C, Cao Y, Xu S, Wang M. Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis. Sensors. 2025; 25(6):1790. https://doi.org/10.3390/s25061790
Chicago/Turabian StyleZhang, Wenjie, Changjun He, Yinghan Cao, Shiyun Xu, and Mingjiang Wang. 2025. "Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis" Sensors 25, no. 6: 1790. https://doi.org/10.3390/s25061790
APA StyleZhang, W., He, C., Cao, Y., Xu, S., & Wang, M. (2025). Two-Stage Unet with Gated-Conv Fusion for Binaural Audio Synthesis. Sensors, 25(6), 1790. https://doi.org/10.3390/s25061790