Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization
Abstract
1. Introduction
2. Related Works
3. Conventional Production Method for Object-Based Audio
4. Proposed Method
- Step 1: Estimation of the object coordinates from the visual image
- Step 2: Conversion of the estimated coordinates
- Step 3: Design of acoustic metadata from the converted coordinates

4.1. Step 1: Estimation of the Object Coordinates from the Visual Image
4.2. Step 2: Conversion of the Estimated Coordinates
4.3. Step 3: Design of Acoustic Metadata from the Converted Coordinates
5. Objective Evaluation Experiment
5.1. Experimental Conditions for Objective Evaluation
5.2. Experimental Results for Objective Evaluation
6. Subjective Evaluation Experiment
6.1. Experimental Conditions for Subjective Evaluation
6.2. Experimental Results for Subjective Evaluation
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sohn, Y.; Cho, M.; Paik, J. Design of 8K Broadcasting System based on MMT over Heterogeneous Networks. KSII Trans. Internet Inf. Syst. 2017, 11, 4077–4091. [Google Scholar] [CrossRef]
- Rec. ITU-R BS.1909-1; Performance Requirements for an Advanced Sound System for Use with or Without Accompanying Picture. International Telecommunication Union: Geneva, Switzerland, 2023.
- Rec. ITU-R BS.775-3; Multichannel Stereophonic Sound System with and Without Accompanying Picture. International Telecommunication Union: Geneva, Switzerland, 2012.
- Hamasaki, K.; Nishiguchi, T.; Okumura, R.; Nakayama, Y.; Ando, A. A 22.2 multichannel sound system for ultrahigh-definition TV (UHDTV). SMPTE Motion Imaging J. 2008, 117, 40–49. [Google Scholar] [CrossRef]
- Herre, J.; Hilpert, J.; Kuntz, A.; Plogsties, J. MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio. IEEE J. Sel. Top. Signal Process. 2015, 9, 770–779. [Google Scholar] [CrossRef]
- Mann, M.; Churnside, A.; Bonney, A.; Melchior, F. Object-Based Audio Applied to Football Broadcasts. In Proceedings of the 2013 ACM International Workshop on Immersive Media Experiences, Barcelona, Spain, 22 October 2013; pp. 13–16. [Google Scholar]
- Sadia, S.; Carbon, C.-C. Looking for the Edge of the World: How 3D Immersive Audio Produces a Shift from an Internalised Inner Voice to Unsymbolised Affect-Driven Ways of Thinking and Heightened Sensory Awareness. Behav. Sci. 2023, 13, 858. [Google Scholar] [CrossRef] [PubMed]
- Dolby Laboratories. Dolby Atmos. Available online: https://www.dolby.com/technologies/dolby-atmos/ (accessed on 29 October 2025).
- DTS, Inc. DTS:X. Available online: https://dts.com/dts-x/ (accessed on 29 October 2025).
- Rec. ITU-R BS.2076-3; Audio Definition Model. International Telecommunication Union: Geneva, Switzerland, 2025.
- Bleidt, R.; Borsum, A.; Fuchs, H.; Weiss, S.M. Object-based audio: Opportunities for improved listening experience and increased listener involvement. SMPTE Motion Imaging J. 2015, 124, 1–13. [Google Scholar] [CrossRef]
- Coleman, P.; Franck, A.; Francombe, J.; Liu, Q.; de Campos, T.; Hughes, R.J.; Menzies, D.; Simón Gálvez, M.F.; Tang, Y.; Woodcock, J.; et al. An audio-visual system for object-based audio: From recording to listening. IEEE Trans. Multimed. 2018, 20, 1919–1931. [Google Scholar] [CrossRef]
- Smisek, J.; Jancosek, M.; Pajdla, T. 3D with Kinect. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1154–1160. [Google Scholar]
- Neumann. KU 100 Dummy Head Microphone. Available online: https://www.neumann.com/en-us/products/microphones/ku-100 (accessed on 14 November 2025).
- Arteaga, D.; Pons, J. Multichannel-based Learning for Audio Object Extraction. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 206–210. [Google Scholar]
- Shi, J.; Wu, Q.; Zhang, D.; Ye, L. Enhancing Immersion in Virtual Reality: Cost-Efficient Spatial Audio Generation for Panoramic Videos. In Proceedings of the 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Orlando, FL, USA, 16–21 March 2024; pp. 1224–1225. [Google Scholar]
- Roebel, A.; Pons, J.; Liuni, M.; Lagrangey, M. On automatic drum transcription using non-negative matrix deconvolution and itakura saito divergence. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 414–418. [Google Scholar]
- Mitsufuji, Y.; Liuni, M.; Baker, A.; Roebel, A. Online non-negative tensor deconvolution for source detection in 3DTV audio. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3082–3086. [Google Scholar]
- Weninger, F.; Schuller, B.; Wöllmer, M.; Rigoll, G. Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5840–5843. [Google Scholar]
- Nikunen, J.; Virtanen, T.; Vilermo, M. Multichannel audio upmixing based on non-negative tensor factorization representation. In Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 16–19 October 2011; pp. 33–36. [Google Scholar]
- Kato, S.; Iwai, K.; Nishiura, T.; Soeta, Y. Construction Method of Acoustic Metadata on an Object-Based Audio Utilizing Coordinate Estimation of a Moving Sound Source. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics, Kitakyushu, Japan, 29 October–1 November 2024; pp. 411–414. [Google Scholar]
- Kato, S.; Nakayama, M.; Nishiura, T.; Soeta, Y. Design of Acoustic Metadata on Object-Based Audio Utilizing Estimated 3D-Position of Sound Source in Video. In Proceedings of the 2025 IEEE 14th Global Conference on Consumer Electronics, Osaka, Japan, 23–26 September 2025; pp. 1407–1410. [Google Scholar]
- Benezeth, Y.; Jodoin, P.M.; Emile, B.; Laurent, H.; Rosenberger, C. Review and Evaluation of Commonly-Implemented Background Subtraction Algorithms. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
- Vajgl, M.; Hurtik, P.; Nejezchleba, T. Dist-YOLO: Fast Object Detection with Distance Estimation. Appl. Sci. 2022, 12, 1354. [Google Scholar] [CrossRef]
- Pulkki, V. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc. 1997, 45, 456–466. [Google Scholar]
- Alais, D.; Burr, D. The ventriloquist effect results from near-optimal bimodal integration. Curr. Biol. 2004, 14, 257–262. [Google Scholar] [CrossRef] [PubMed]
- Hendrickx, E.; Paquier, M.; Koehl, V.; Palacino, J. Ventriloquism effect with sound stimuli varying in both azimuth and elevation. J. Acoust. Soc. Am. 2015, 138, 3686–3697. [Google Scholar] [CrossRef] [PubMed]
- Komiyama, S. Subjective evaluation of angular displacement between picture and sound directions for HDTV sound systems. J. Audio Eng. Soc. 1989, 37, 210–214. [Google Scholar]
- Hládek, L.; Le Dantec, C.C.; Kopčo, N.; Seitz, A. Ventriloquism effect and aftereffect in the distance dimension. Proc. Meet. Acoust. 2013, 19, 050042. [Google Scholar] [CrossRef]
- Rec. ITU-R BS.2088-1; Long-Form File Format for the International Exchange of Audio Programme Materials with Metadata. International Telecommunication Union: Geneva, Switzerland, 2019.
- Rec. ITU-R BS.2094-2; Common Definitions for the Audio Definition Model. International Telecommunication Union: Geneva, Switzerland, 2025.
- EBU Tech 3285; Specification of the Broadcast Wave Format (BWF)—A Format for Audio Data Files in Broadcasting. European Broadcasting Union: Grand-Saconnex, Switzerland, 2011.
- EBU Tech 3285 Supplement 6; Specification of the Broadcast Wave Format (BWF)—Supplement 6: Dolby Metadata, <dbmd> chunk. European Broadcasting Union: Grand-Saconnex, Switzerland, 2009.
- EBU Tech 3306; MBWF/RF64: An Extended File Format for Audio—A BWF-Compatible Multichannel File Format Enabling File Sizes to Exceed 4 Gbyte. European Broadcasting Union: Grand-Saconnex, Switzerland, 2009.
- Dolby Laboratories. Dolby Atmos Master ADM Profile. Available online: https://professionalsupport.dolby.com/s/article/Dolby-Atmos-ADM-Profile-specification?language=en_US (accessed on 29 October 2025).
- Dolby Laboratories. Overview of Dolby Atmos Master File Formats. Available online: https://professionalsupport.dolby.com/s/article/Overview-of-Dolby-Atmos-Master-File-Formats?language=en_US (accessed on 29 October 2025).
- Avid Technology, Inc. Pro Tools. Available online: https://www.avid.com/pro-tools (accessed on 29 October 2025).
- Apple Inc. Logic Pro for Mac. Available online: https://www.apple.com/logic-pro/ (accessed on 29 October 2025).
- Steinberg Media Technologies GmbH. Nuendo. Available online: https://www.steinberg.net/nuendo/ (accessed on 29 October 2025).
- Avid Technology; Inc. 10D Course. Available online: https://www.avid.com/courses/pt210d-pro-tools-dolby-atmos-production (accessed on 29 October 2025).
- Unity Technologies. Unity Real-Time Development Platform. Available online: https://unity.com/ (accessed on 29 October 2025).
- Dolby Laboratories. 7.1.4 Overhead Speaker Setup. Available online: https://www.dolby.com/about/support/guide/speaker-setup-guides/7.1.4-overhead-speaker-setup-guide (accessed on 29 October 2025).
- Blender Foundation. Blender—The Free and Open Source 3D Creation Software. Available online: https://www.blender.org/ (accessed on 29 October 2025).
- BBC. BBC Sound Effects. Available online: https://sound-effects.bbcrewind.co.uk/ (accessed on 29 October 2025).
- Blau, M.; Budnik, A.; Fallahi, M.; Steffens, H.; Ewert, S.D.; Van de Par, S. Toward realistic binaural auralizations–perceptual comparison between measurement and simulation-based auralizations and the real room for a classroom scenario. Acta Acustica 2021, 5, 8. [Google Scholar] [CrossRef]
- Pawlak, A.; Lee, H.; Mäkivirta, A.; Lund, T. Spatial Analysis and Synthesis Methods: Subjective and Objective Evaluations Using Various Microphone Arrays in the Auralization of a Critical Listening Room. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3986–4001. [Google Scholar] [CrossRef]
- Roman, N.; Wang, D.; Brown, G.J. Speech segregation based on sound localization. J. Acoust. Soc. Am. 2003, 114, 2236–2252. [Google Scholar] [CrossRef] [PubMed]
- Lu, Y.C.; Cooke, M. Binaural Estimation of Sound Source Distance via the Direct-to-Reverberant Energy Ratio for Static and Moving Sources. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1793–1805. [Google Scholar][Green Version]
- Madmoni, L.; Tibor, S.; Nelken, I.; Rafaely, B. The Effect of Partial Time-Frequency Masking of the Direct Sound on the Perception of Reverberant Speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2037–2047. [Google Scholar] [CrossRef]
- Prodi, N.; Pellegatti, M.; Visentin, C. Effects of type of early reflection, clarity of speech, reverberation and diffuse noise on the spatial perception of a speech source and its intelligibility. J. Acoust. Soc. Am. 2022, 151, 3522–3534. [Google Scholar] [CrossRef] [PubMed]












| Setting | Explanation |
|---|---|
| samplePos | Sample position at which the setting is applied |
| pos | Sound image position |
| active | Object active state |
| snap | Force the sound image to localize to the nearest loudspeaker |
| elevation | Presence or absence of panning in the vertical direction |
| zones | Specify the loudspeaker zone used to render the sound image |
| size | Sound image size |
| decorr | Sound spreading |
| importance | Object importance |
| gain | Object volume |
| rampLength | Time required for a setting change to complete |
| trimBypass | Downmix setting for different formats |
| dialog | Dialog flag |
| music | Music flag |
| screenFactor | Pulls the lateral sound image position toward the screen |
| depthFactor | Depth factor for binaural rendering |
| headTrackMode | Head-tracking mode for headphone playback |
| binauralRenderMode | Binaural rendering mode |
| Object Movement Patterns | Start | End |
|---|---|---|
| Horizontal | ||
| Vertical | ||
| Depth | ||
| Horizontal + Vertical | ||
| Horizontal + Depth | ||
| Vertical + Depth | ||
| Horizontal + Vertical + Depth |
| Sampling frequency of sound sources | 48,000 Hz |
| Quantization of sound sources | 24 bits |
| File type of visual images | mov |
| Codec of visual images | H.264/AVC |
| Bitrate of visual images | 1.14 Mb/s |
| Bit depth of visual images | 8 bits |
| Ambient noise level | = 35.1 dB |
| Sound pressure level at the listening point | = 75.2 dB |
| Frame rate of visual images | 30 fps |
| Resolution of visual images | 1920 × 1080 pixels |
| Monitor width and height | 1.32 m, 0.71 m |
| Resolution of monitor | 3840 × 2160 pixels |
| Playback space width and height | 2.82 m, 2.62 m |
| Number of participants | 10 (1 woman, 9 men) |
| Device | Type |
|---|---|
| Loudspeaker | YAMAHA (Hamamatsu, Japan), VXS5 |
| Power amplifier | YAMAHA (Hamamatsu, Japan), XMV8280 |
| D/A converter | RME (Haimhausen, Germany), M-32 DA |
| Audio interface | Avid (Burlington, VT, USA), Pro Tools|MTRX |
| Monitor | Panasonic (Kadoma, Japan), TH-60DX850 |
| Comparison Target | p-Value (Unajusted) | p-Value (Ajusted) | Significance |
|---|---|---|---|
| Cond. AA | 0.0010 | 0.0039 | Significant (**) |
| Cond. AB | 0.0042 | 0.0167 | Significant (*) |
| Cond. AC | 0.3458 | 1.0000 | Not significant |
| Cond. AE | 0.0003 | 0.0012 | Significant (**) |
| Comparison Target | p-Value (Unajusted) | p-Value (Ajusted) | Significance |
|---|---|---|---|
| Cond. AA | 0.0020 | 0.0079 | Significant (**) |
| Cond. AB | 0.0078 | 0.0312 | Significant (*) |
| Cond. AD | 0.5000 | 1.0000 | Not significant |
| Cond. AE | 0.0078 | 0.0312 | Significant (*) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kato, S.; Nakayama, M.; Nishiura, T.; Soeta, Y. Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization. Acoustics 2026, 8, 3. https://doi.org/10.3390/acoustics8010003
Kato S, Nakayama M, Nishiura T, Soeta Y. Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization. Acoustics. 2026; 8(1):3. https://doi.org/10.3390/acoustics8010003
Chicago/Turabian StyleKato, Subaru, Masato Nakayama, Takanobu Nishiura, and Yoshiharu Soeta. 2026. "Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization" Acoustics 8, no. 1: 3. https://doi.org/10.3390/acoustics8010003
APA StyleKato, S., Nakayama, M., Nishiura, T., & Soeta, Y. (2026). Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization. Acoustics, 8(1), 3. https://doi.org/10.3390/acoustics8010003

