Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization

Kato, Subaru; Nakayama, Masato; Nishiura, Takanobu; Soeta, Yoshiharu

doi:10.3390/acoustics8010003

Open AccessArticle

Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization

¹

Graduate School of Information Science and Engineering, Ritsumeikan University, Osaka 567-8570, Japan

²

College of Information Science and Engineering, Ritsumeikan University, Osaka 567-8570, Japan

³

Molecular Biosystems Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Osaka 563-8577, Japan

^*

Author to whom correspondence should be addressed.

Acoustics 2026, 8(1), 3; https://doi.org/10.3390/acoustics8010003

Submission received: 14 November 2025 / Revised: 14 January 2026 / Accepted: 22 January 2026 / Published: 23 January 2026

Download

Browse Figures

Versions Notes

Abstract

Multichannel audio is a sound field reproduction technology that uses multiple loudspeakers. Object-based audio is a playback method for multichannel audio that enables the construction of sound images at specified positions using coordinates within the playback space. However, the sound image positions must be manually specified by audio content creators, which increases the production workload, especially for works containing many sound images or feature films. We have previously proposed a method to reduce the workload of content creators by constructing sound images based on object positions in visual images. However, a significant challenge remains since depth localization of the sound image is not accurate enough. This paper aims to improve localization accuracy by changing the range of sound image movement along the depth direction. To confirm the localization accuracy of sound images constructed using the proposed method, we conducted a subjective evaluation experiment. The experiment identified the optimal movement range by presenting participants with visual images synchronized with sound images moving across varying spatial scales. Consequently, we were able to identify the range of sound image movement in the depth direction necessary for presenting sound images with high consistency with the visual images.

Keywords:

object-based audio; multichannel audio; acoustic metadata; sound image localization; audio-visual consistency; 3D-position estimation

1. Introduction

In recent years, with the advances of video technologies, highly immersive audio technologies for reproducing realistic sound fields have attracted increasing attention [1]. Multichannel audio is a key technology for presenting highly immersive sound to listeners [2,3,4]. Multichannel audio reproduces sound arriving from various directions by arranging multiple loudspeakers around the listener. Object-based audio is a playback method for multichannel audio [5,6,7,8,9]. In object-based audio, sound image positions can be specified by using coordinates within the playback space. The sound image position is described in the acoustic metadata along with temporal information [10]. By treating these coordinates as time-series data, object-based audio enables flexible specification of sound image positioning and movement over time. Due to this characteristic, object-based audio has recently gained attention in film production as a technology for creating sound spaces that are highly consistent with the visual images [11].

However, there is a problem in film production using object-based audio. In film production employing object-based audio, content creators specify the sound image positions to match the objects within the visual images. The design of acoustic metadata poses a significant burden on content creators for film productions with many objects or for feature films.

To address these issues, various methods have been proposed ranging from specialized recording techniques to automatic generation of object-based audio from existing multichannel audio content. However, many approaches depend on specialized hardware, including RGB-D and panoramic cameras [12,13,14,15,16,17,18,19,20], which greatly limits their applicability. As a result, integration into filmmaking workflows remains difficult, and adoption in real-world production environments faces substantial barriers.

Therefore, we propose a method that reduces the creators’ workload by designing acoustic metadata based on object coordinates estimated from visual images [21,22]. Our method requires no additional hardware, relying only on standard video and audio sources. As a result, it avoids the equipment-specific implementation barriers noted in prior work. It is also highly compatible with existing production workflows, facilitating integration.

In our previous study, we have proposed a method that estimates the positions of moving objects in a visual image using the frame subtraction method [23] and utilizes the obtained positional information for the sound image construction of object-based audio [21]. However, object position estimation based on the frame subtraction method can only determine positions in the horizontal and vertical directions on a monitor placed in front of the viewer. Consequently, the sound image position cannot be controlled in the depth direction, leading to the problem of low localization accuracy along the depth axis.

To address this, we introduced object coordinate estimation using Dist-YOLO [24] for sound image position control in the depth direction [22]. Dist-YOLO is a machine learning model capable of estimating three-dimensional object coordinates, including depth, from two-dimensional visual images. The object coordinates estimated by this model are mapped from screen coordinate system to the coordinate system of the object-based audio reproduction space. This conversion is based on a linear mapping of Vector Base Amplitude Panning (VBAP) [25]. This mapping enables three-dimensional sound image control, including the depth dimension. In this method, the sound image is constructed by normalizing the estimated time-series depth data and setting a movement range where the farthest point is the monitor position and the nearest point is the listening position (the viewer’s head). However, evaluations revealed that this approach still suffers from lower sound image localization accuracy in the depth direction compared to the horizontal and vertical directions. This paper aims to further improve sound image localization accuracy in the depth direction. In our conventional method, the range of sound image movement was set from the position of the monitor in front of the viewer to the viewer’s head. In contrast, this paper proposes a method that restricts the nearest point of the sound image to a position closer to the monitor to improve localization accuracy. Specifically, we introduced a new coefficient to control the movement range into the definition formula for the sound image’s depth position.

In general, the ventriloquism effect [26,27,28], a well-known phenomenon in audio-visual integration, is also effective along the depth direction [29]. However, this effect diminishes as the physical distance between the visual and auditory stimuli increases. Therefore, an improvement in localization accuracy is expected by limiting the sound image movement to the vicinity of the monitor and suppressing the spatial discrepancy with the visual image. The proposed method considers only a single visible object within the camera’s field of view. It does not handle multiple objects, objects that enter or leave the frame, or occlusions. Finally, through evaluation experiments, we verified the consistency between the visual image and the sound image and identified the movement range that yields the highest localization accuracy.

The remainder of this paper is organized as follows. Section 2 presents the related works of the proposed method. Section 3 introduces conventional production method on object-based audio. Section 4 describes the proposed method, while Section 5 discusses objective evaluation experiment results. Section 6 discusses subjective evaluation experiment. Finally, Section 7 concludes the paper.

2. Related Works

Multichannel audio technology enables the reproduction of highly immersive sound fields. However, managing the large number of audio signals complicates the production workflow compared with traditional stereo audio [3]. In multichannel audio, a wide range of approaches has been investigated throughout the workflow from production to reproduction to enhance interactivity and presence [12,15,16,17,18,19,20]. This section focuses on object-based audio and reviews prior work on content-production methods that leverage this technology.

Coleman et al. (2018) [12] proposed an end-to-end pipeline for object-based audio. They introduced Objectification, a process that automatically estimates source-position metadata by integrating visual and auditory information. The system acquires environmental signals using RGB-D cameras [13] and binaural microphones [14]. Applying blind source separation and beamforming to these signals enables the extraction of specific sources as discrete objects that can be rendered dynamically as a function of listener position.

Arteaga et al. (2021) [15] proposed a deep-learning method for extracting individual audio objects from existing multichannel mixes. The key feature is multichannel-based learning, which uses pre-rendered multichannel material as training data rather than isolated object stems. This strategy enables efficient object extraction in complex acoustic scenes, such as film production, where many objects are intricately mixed.

Shi et al. (2024) [16] proposed a low-cost, efficient system for generating spatial audio from panoramic (360°) video to streamline VR content production. The method estimates three-dimensional sound source positions by applying object detection and depth estimation to video frames, while a scene-classification module selects appropriate background sounds. Using the inferred spatial information, the system upmixes existing stereo audio to a 5.1.4 ch. format via Vector Base Amplitude Panning (VBAP) [25]. This approach substantially reduces production time relative to conventional workflows.

However, these methods rely on specialized equipment such as RGB-D cameras, panoramic cameras, and binaural microphones, which are not standard in traditional film production. As a result, integrating them into established workflows remains challenging, and barriers to adoption remain significant in actual film production.

3. Conventional Production Method for Object-Based Audio

In recent years, object-based audio has been widely adopted in film production. Channel-based audio has traditionally relied on assigning sound to specific loudspeakers [2,3,4]. In contrast, object-based audio treats sound as an object, designing metadata for each object, such as positional information, movement information, and loudness. This feature allows content creators to freely control the positioning and movement of sound images, enabling them to construct their intended sound space with precision.

Figure 1 shows the workflow of film production using object-based audio. Content creators manually design the acoustic metadata by defining the positions and movements of sound images synchronized with the visual images.

The acoustic metadata is standardized as the Audio Definition Model under the ITU-R BS.2076, which is based on XML format [10,30,31,32,33,34]. In this paper, the Dolby Atmos Master File (DAMF) is used as the acoustic metadata format [35]. Table 1 shows the object setting parameters included in the DAMF metadata [36].

The acoustic metadata is designed using a Digital Audio Workstation (DAW) [37,38,39]. Figure 2 illustrates the 3D Panner, an acoustic metadata design tool in Pro Tools [37]. The position of each sound image can be adjusted by moving the red sphere shown in Figure 2. Through the individual control of sound images, the creators’ intended acoustic space can be precisely reproduced.

However, the control of each individual sound images remain a manual process. As a result, the production workload increases in large-scale projects, such as feature films or works involving many sound images. Furthermore, acquiring the specialized knowledge required for object-based audio mixing is difficult, as it often demands certification and training through dedicated curricula [40].

Therefore, the proposed method automatically generates positional information within the acoustic metadata design process, based on object positions estimated from visual images, which has conventionally imposed a significant burden on creators. This approach aims to support acoustic metadata design and reduce the creators’ production workload.

4. Proposed Method

Figure 3 shows an overview of the proposed method. In the proposed method, sound images are constructed using the object positions estimated from the visual images. The proposed method consists of the following three steps:

Step 1: Estimation of the object coordinates from the visual image
Step 2: Conversion of the estimated coordinates
Step 3: Design of acoustic metadata from the converted coordinates

Figure 3. Overview of the proposed method.

4.1. Step 1: Estimation of the Object Coordinates from the Visual Image

In Step 1, the 3D coordinates

(x_{l}^{'}, y_{l}^{'}, z_{l}^{'})

of an object in the l-th visual image frame are estimated. Dist-YOLO is employed for this estimation [24]. Dist-YOLO is a machine learning model that estimates object positions, along with depth information for each visual image frame. As shown in Figure 4, this model outputs bounding boxes for each detected object. In this study, we calculate the center of the bounding box and designated as the object’s coordinates

x_{l}^{'}

and

y_{l}^{'}

. These coordinates are calculated using the following equations:

\begin{matrix} x_{l}^{'} & = \frac{1}{2} (x_{l, top}^{'} + x_{l, bottom}^{'}), \end{matrix}

(1)

\begin{matrix} y_{l}^{'} & = \frac{1}{2} (y_{l, top}^{'} + y_{l, bottom}^{'}), \end{matrix}

(2)

where

(x_{l, top}^{'}, y_{l, top}^{'})

are the top-left coordinates of the bounding box in the l-th visual image frame, and

(x_{l, bottom}^{'}, y_{l, bottom}^{'})

are the bottom-right coordinates of the bounding box in the l-th visual image frame. The 3D-coordinates

(x_{l}^{'}, y_{l}^{'}, z_{l}^{'})

are formed by combining

x_{l}^{'}

and

y_{l}^{'}

with the depth value

z_{l}^{'}

, which Dist-YOLO also estimates. This method is designed for scenes containing a single object. For this reason, we do not define object classes, and all detections are handled as the same object regardless of the assigned category.

4.2. Step 2: Conversion of the Estimated Coordinates

In Step 2, the estimated object coordinates are transformed into a coordinate system suitable for object-based audio. As shown in Figure 3, the playback space is defined by a 3D Cartesian coordinate system, with the listener at the origin. The x-, y-, and z-axes represent the horizontal, vertical, and depth directions, respectively. The playback space coordinates

x_{l}

,

y_{l}

, and

z_{l}

are normalized to the range

[- 1, 1]

. In this study, the playback environment assumes that the monitor is positioned in front of the listener at

z = 1

. The coordinates

(x_{l}^{'}, y_{l}^{'}, z_{l}^{'})

estimated in Step 1 are converted to the normalized playback space coordinates

(x_{l}, y_{l}, z_{l})

using the following equations:

\begin{matrix} x_{l} & = \frac{W_{m}}{W_{s}} \cdot \frac{2 x_{l}^{'} - (N_{x} - 1)}{N_{x} - 1}, \end{matrix}

(3)

\begin{matrix} y_{l} & = \frac{H_{m}}{H_{s}} \cdot \frac{2 y_{l}^{'} - (N_{y} - 1)}{N_{y} - 1}, \end{matrix}

(4)

\begin{matrix} z_{l} & = d + (1 - d) \frac{z_{l}^{'} - min_{k} (z_{k}^{'})}{max_{k} (z_{k}^{'}) - min_{k} (z_{k}^{'})}, \end{matrix}

(5)

where

l, k = (1, 2, \dots, L)

represents the visual image frame index, and L is the total number of visual image frames.

N_{x} [pixel]

and

N_{y} [pixel]

are the number of visual image pixels in the horizontal and vertical directions.

W_{m} [m]

and

H_{m} [m]

represent the width and height of the monitor.

W_{s} [m]

and

H_{s} [m]

represent the width and height of the playback space. This conversion maps the sound image to the same position as the corresponding object on the monitor in the

x y

-plane. Along the

x z

-plane, the system positions the sound image between the monitor and the listener. Furthermore,

d \in [0, 1]

in Equation (5) is a coefficient that constrains the range of sound image movement in the depth direction. The optimal value of d will be determined in a subsequent subjective evaluation experiment.

4.3. Step 3: Design of Acoustic Metadata from the Converted Coordinates

In Step 3, acoustic metadata is designed from the converted coordinates. The time resolution of the acoustic metadata is determined by the visual image frame rate and the sampling frequency of the sound signal. The time resolution

T [sample / frame]

is given by the following equation:

T = \frac{F_{s}}{F_{v}},

(6)

where

F_{v} [frame / s]

is the frame rate of the visual image and

F_{s} [sample / s]

is the sampling frequency of the sound signal. The time information

T_{l} [sample]

is calculated using the following equation:

T_{l} = (l - 1) T .

(7)

In this method, we use the .atmos format, which is part of the Dolby Atmos Master File (DAMF) specification, as the acoustic metadata [35,36]. Figure 5 shows an example of description of coordinate and temporal information in the DAMF format. In Figure 5, sampleRate corresponds to the sampling frequency of the sound signal

F_{s}

, samplePos corresponds to the time information for the l-th frame

T_{l}

and pos corresponds to the coordinate information for the l-th frame

(x_{l}, y_{l}, z_{l})

. Here, the estimated coordinates are defined such that the x, y, and z axes correspond to the horizontal, depth, and vertical directions, respectively. In contrast, in the .atmos acoustic metadata format, the x-axis is used for the horizontal direction, the y-axis for the depth direction, and the z-axis for the vertical direction. Therefore, the coordinates are described in the format pos:

(x_{l}, z_{l}, y_{l})

.

5. Objective Evaluation Experiment

We conducted an objective evaluation to confirm the accuracy of object position estimation in the horizontal and vertical directions, as estimation errors directly affect the precision of acoustic metadata design. This evaluation is important because positional inaccuracies can cause inconsistencies between the visual images and the corresponding sound images. The object positions estimated by Dist-YOLO in Step 1 of the proposed method contain estimation errors. Therefore, this experiment was designed to evaluate the impact of these errors on the accuracy of sound image localization.

5.1. Experimental Conditions for Objective Evaluation

We created 3D CG animations using Unity to provide controlled visual images [41]. Each visual image featured a single object (a sphere) moving within the field of view. Seven different object movement patterns were used. These patterns were created by integrating three primary movement directions: horizontal, vertical, and depth, as illustrated in Figure 6. Table 2 lists the object movement patterns. Additionally, in Unity, the starting coordinates of the object’s movement were denoted as Start, and the ending coordinates were denoted as End, respectively. The camera coordinates were

(0.0, 0.0, - 10.0)

utilizing a perspective projection, and the field of view was 27 degrees. Object in the visual images moved at an average speed of 643 pixels/s. The visual image resolution was

1920 \times 1080

pixels, and the frame rate was 30 fps. As shown in Figure 7, the estimation errors were evaluated as the angular differences,

θ

and

φ

, between the ground truth and estimated object positions from the listening point’s perspective. The ground truth object positions were obtained from Unity, whereas the estimated positions were the output of Dist-YOLO.

5.2. Experimental Results for Objective Evaluation

Figure 8 shows estimation errors along the horizontal and vertical directions for each movement pattern. Objects moving horizontally show larger errors in the horizontal direction, whereas objects moving vertically show larger errors in the vertical direction. Conditions involving depth directional movement also yield larger errors than the other conditions, likely because scale changes induced by depth directional movement increase Dist-YOLO’s estimation error. For evaluation purposes, we considered the ventriloquism effect [26]. The ventriloquism effect refers to the perceptual phenomenon in which listeners tolerate spatial misalignment between visual and auditory stimuli within certain angular thresholds [28]. When visual and auditory stimuli are presented simultaneously, spatial misalignments up to approximately 11 degrees horizontally and 19 degrees vertically are perceived as consistent [27]. This effect is important for the proposed method, as it defines the acceptable range of positional estimation errors.

As indicated in Figure 8, the estimation errors measured in this evaluation fell within the acceptable range of 11 degrees horizontally and 19 degrees vertically for all movement patterns.

6. Subjective Evaluation Experiment

We confirmed the effectiveness of the proposed method through a subjective evaluation experiment of sound image localization accuracy. In this experiment, we simultaneously presented participants with visual images and sound images constructed using the proposed method, and evaluated the consistency between the visual images and sound images. Specifically, this experiment focused on sound image localization accuracy in the depth direction. We identified the condition that achieved the highest consistency with the visual images by varying the range of sound image movement for sound images constructed from the same visual images.

6.1. Experimental Conditions for Subjective Evaluation

In this experiment, sound images were constructed using Dolby Atmos, an object-based audio [8]. Figure 9 and Figure 10 show the loudspeaker and monitor arrangements used in the experiment [42]. Table 3 shows the conditions for the visual images and audio signals, and Figure 11 shows the spatial relationship between the participants and the monitor. Table 4 shows the equipment used in the experiment. Figure 12 shows examples of start and end visual image frames used in this experiment. These visual images feature a car or a helicopter moving in a straight line within the field of view. These were created using Blender [43]. Sound images were constructed using the proposed method, with sound effects (car driving and helicopter flying sounds) from the BBC Sound Effects corresponding to each visual image [44]. Five evaluation conditions with different ranges of sound image movement in the depth direction were implemented. These were set by varying the value of the coefficient d in Equation (5). The five evaluation conditions, Cond. AA to Cond. AE, correspond to different ranges of sound image movement, as shown in Figure 11. The baseline condition, Cond. AA, is defined by the absence of sound image movement in the depth direction, with movement restricted to the

x y

-plane of the monitor. This configuration conforms to the conventional method proposed by Kato et al. (2024) [21]. In this study, it serves as a reference for evaluating the effect of the presence and range of movement along the depth direction on sound image localization accuracy.

Ten participants (one woman, nine men) took part in the experiment [45,46]. All had normal hearing, confirmed during their annual medical checkups and by audiometric screening, and no hearing-related disorders that could affect the results were observed. Participants rated the consistency between the visual images and sound images for each condition on a four-point scale: “Strongly consistent”, “Slightly consistent”, “Slightly inconsistent”, and “Strongly inconsistent”. In this experiment, consistency was defined as the correspondence between the movement of a visible object (e.g., a car or helicopter) and the movement of the perceived auditory image. This definition explicitly excluded judgments based on any incongruity between the visual object and the timbre of the presented sound. Before the experiment, all participants attended a briefing in which the evaluation criteria were explained; they were instructed that the primary focus of their judgments was the consistency of the movement. The participants did not undergo any prior training for this experiment.

The experimental procedure was conducted as follows: first, five conditions (Cond. AA to Cond. AE) were presented in a randomized order once for a visual image of a car, followed by the same procedure for a visual image of a helicopter. A 30-s interval was provided after each condition, during which participants recorded their responses. Responses were collected using Google Forms, and the sessions were conducted individually for each participant. The total duration of the experiment was approximately 10 min per person.

6.2. Experimental Results for Subjective Evaluation

Figure 13 shows the Mean Opinion Score (MOS) for the consistency between the perceived sound image and the visual image. The vertical axis shows MOS on a four-point scale (4 = Strongly consistent, 3 = Slightly consistent, 2 = Slightly inconsistent, 1 = Strongly inconsistent). The horizontal axis lists the experimental conditions, Cond. AA through Cond. AE. Among the conditions, Cond. AD yielded the highest mean (3.7), followed by Cond. AC (3.5), Cond. AE (2.8), Cond. AB (2.3), and the baseline condition, Cond. AA (1.9).

To test for differences across conditions, we applied the nonparametric Friedman test. The test indicated a significant effect of condition,

χ^{2} (4) = 31.27

,

p < 0.001

. We then conducted Wilcoxon signed-rank tests comparing the top-rated condition, Cond. AD, with each of the other conditions; p-values were Bonferroni correction for four comparisons (

m = 4

). The results are summarized in Table 5. Post hoc analyses showed that Cond. AD yielded significantly higher consistency than the conventional baseline (Cond. AA) as well as Cond. AB and Cond. AE. The marked improvement over Cond. AA suggests that introducing depth-wise motion of the sound image is a key factor in enhancing consistency with the visual image. Conversely, the comparisons with Cond. AB and Cond. AE indicate that consistency tends to decrease when the amount of depth movement is too restricted or overly exaggerated.

Using Cond. AC, which did not differ significantly from Cond. AD, as the reference, we conducted the same post hoc Wilcoxon signed-rank tests (Bonferroni correction;

m = 4

). The results, summarized in Table 6, show that Cond. AC yielded significantly higher consistency than Cond. AA, Cond. AB, and Cond. AE. Taken together with the analyses for Cond. AD, these findings indicate that the sound image achieves the highest level of consistency with the visual image under either Cond. AC or Cond. AD. Furthermore, the visual images of both cars and helicopters yielded comparable results, with no significant differences observed between the visual images. This is thought to have enhanced the ventriloquism effect and improved sound image localization accuracy by limiting the movement of the sound image to the vicinity of the monitor and suppressing the spatial discrepancy with visual information. In particular, setting the parameter d in Equation (5) to values between

0.3

and

0.5

produces a sound image that closely aligns the perceived auditory movement with the visual movement.

7. Conclusions

In this paper, we proposed a method for designing acoustic metadata for object-based audio. The sound image positions were determined by estimating the coordinates of objects in the visual images. Then, acoustic metadata was designed based on the estimated coordinates and the frame rate of the visual images, and the sampling frequency of the sound signal. The objective evaluation confirmed that the object position estimation errors fell within the acceptable range of the ventriloquism effect. The subjective evaluation confirmed the range of sound image movement in the depth direction necessary for constructing sound images that are highly consistent with the visual images.

In the proposed method, control of the sound image was driven by coordinate information in the acoustic metadata, so distance was effectively cued only by sound pressure level. However, human judgments of auditory distance rely on multiple cues, including spectral variations and the Direct-to-Reverberant Ratio [47,48,49,50]. We anticipate that incorporating these additional cues will further improve sound image localization accuracy. In future work, we plan to predict and compensate for object coordinates when the object moves out of the field of view in the visual images based on object’s past movement information.

Author Contributions

Conceptualization, M.N. and T.N.; methodology, S.K., M.N. and T.N.; software, S.K.; validation, S.K., M.N., T.N. and Y.S.; formal analysis, S.K.; investigation, S.K.; resources, T.N.; data curation, S.K.; writing—original draft preparation, S.K.; writing—review and editing, S.K., M.N., T.N. and Y.S.; visualization, S.K.; supervision, M.N., T.N. and Y.S.; project administration, T.N.; funding acquisition, M.N., T.N. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by Ritsumeikan University R-GIRO, ARC and RARA, and JSPS KAKENHI Grant Numbers JP23K28115 and JP25H01158.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of this study; in the collection, analyses, or interpretation of the data; in the writing of the manuscript; or in the decision to publish the results.

References

Sohn, Y.; Cho, M.; Paik, J. Design of 8K Broadcasting System based on MMT over Heterogeneous Networks. KSII Trans. Internet Inf. Syst. 2017, 11, 4077–4091. [Google Scholar] [CrossRef]
Rec. ITU-R BS.1909-1; Performance Requirements for an Advanced Sound System for Use with or Without Accompanying Picture. International Telecommunication Union: Geneva, Switzerland, 2023.
Rec. ITU-R BS.775-3; Multichannel Stereophonic Sound System with and Without Accompanying Picture. International Telecommunication Union: Geneva, Switzerland, 2012.
Hamasaki, K.; Nishiguchi, T.; Okumura, R.; Nakayama, Y.; Ando, A. A 22.2 multichannel sound system for ultrahigh-definition TV (UHDTV). SMPTE Motion Imaging J. 2008, 117, 40–49. [Google Scholar] [CrossRef]
Herre, J.; Hilpert, J.; Kuntz, A.; Plogsties, J. MPEG-H 3D Audio—The New Standard for Coding of Immersive Spatial Audio. IEEE J. Sel. Top. Signal Process. 2015, 9, 770–779. [Google Scholar] [CrossRef]
Mann, M.; Churnside, A.; Bonney, A.; Melchior, F. Object-Based Audio Applied to Football Broadcasts. In Proceedings of the 2013 ACM International Workshop on Immersive Media Experiences, Barcelona, Spain, 22 October 2013; pp. 13–16. [Google Scholar]
Sadia, S.; Carbon, C.-C. Looking for the Edge of the World: How 3D Immersive Audio Produces a Shift from an Internalised Inner Voice to Unsymbolised Affect-Driven Ways of Thinking and Heightened Sensory Awareness. Behav. Sci. 2023, 13, 858. [Google Scholar] [CrossRef] [PubMed]
Dolby Laboratories. Dolby Atmos. Available online: https://www.dolby.com/technologies/dolby-atmos/ (accessed on 29 October 2025).
DTS, Inc. DTS:X. Available online: https://dts.com/dts-x/ (accessed on 29 October 2025).
Rec. ITU-R BS.2076-3; Audio Definition Model. International Telecommunication Union: Geneva, Switzerland, 2025.
Bleidt, R.; Borsum, A.; Fuchs, H.; Weiss, S.M. Object-based audio: Opportunities for improved listening experience and increased listener involvement. SMPTE Motion Imaging J. 2015, 124, 1–13. [Google Scholar] [CrossRef]
Coleman, P.; Franck, A.; Francombe, J.; Liu, Q.; de Campos, T.; Hughes, R.J.; Menzies, D.; Simón Gálvez, M.F.; Tang, Y.; Woodcock, J.; et al. An audio-visual system for object-based audio: From recording to listening. IEEE Trans. Multimed. 2018, 20, 1919–1931. [Google Scholar] [CrossRef]
Smisek, J.; Jancosek, M.; Pajdla, T. 3D with Kinect. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1154–1160. [Google Scholar]
Neumann. KU 100 Dummy Head Microphone. Available online: https://www.neumann.com/en-us/products/microphones/ku-100 (accessed on 14 November 2025).
Arteaga, D.; Pons, J. Multichannel-based Learning for Audio Object Extraction. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 206–210. [Google Scholar]
Shi, J.; Wu, Q.; Zhang, D.; Ye, L. Enhancing Immersion in Virtual Reality: Cost-Efficient Spatial Audio Generation for Panoramic Videos. In Proceedings of the 2024 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), Orlando, FL, USA, 16–21 March 2024; pp. 1224–1225. [Google Scholar]
Roebel, A.; Pons, J.; Liuni, M.; Lagrangey, M. On automatic drum transcription using non-negative matrix deconvolution and itakura saito divergence. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 414–418. [Google Scholar]
Mitsufuji, Y.; Liuni, M.; Baker, A.; Roebel, A. Online non-negative tensor deconvolution for source detection in 3DTV audio. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 3082–3086. [Google Scholar]
Weninger, F.; Schuller, B.; Wöllmer, M.; Rigoll, G. Localization of non-linguistic events in spontaneous speech by non-negative matrix factorization and long short-term memory. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 5840–5843. [Google Scholar]
Nikunen, J.; Virtanen, T.; Vilermo, M. Multichannel audio upmixing based on non-negative tensor factorization representation. In Proceedings of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 16–19 October 2011; pp. 33–36. [Google Scholar]
Kato, S.; Iwai, K.; Nishiura, T.; Soeta, Y. Construction Method of Acoustic Metadata on an Object-Based Audio Utilizing Coordinate Estimation of a Moving Sound Source. In Proceedings of the 2024 IEEE 13th Global Conference on Consumer Electronics, Kitakyushu, Japan, 29 October–1 November 2024; pp. 411–414. [Google Scholar]
Kato, S.; Nakayama, M.; Nishiura, T.; Soeta, Y. Design of Acoustic Metadata on Object-Based Audio Utilizing Estimated 3D-Position of Sound Source in Video. In Proceedings of the 2025 IEEE 14th Global Conference on Consumer Electronics, Osaka, Japan, 23–26 September 2025; pp. 1407–1410. [Google Scholar]
Benezeth, Y.; Jodoin, P.M.; Emile, B.; Laurent, H.; Rosenberger, C. Review and Evaluation of Commonly-Implemented Background Subtraction Algorithms. In Proceedings of the 2008 19th International Conference on Pattern Recognition, Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [Google Scholar]
Vajgl, M.; Hurtik, P.; Nejezchleba, T. Dist-YOLO: Fast Object Detection with Distance Estimation. Appl. Sci. 2022, 12, 1354. [Google Scholar] [CrossRef]
Pulkki, V. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc. 1997, 45, 456–466. [Google Scholar]
Alais, D.; Burr, D. The ventriloquist effect results from near-optimal bimodal integration. Curr. Biol. 2004, 14, 257–262. [Google Scholar] [CrossRef] [PubMed]
Hendrickx, E.; Paquier, M.; Koehl, V.; Palacino, J. Ventriloquism effect with sound stimuli varying in both azimuth and elevation. J. Acoust. Soc. Am. 2015, 138, 3686–3697. [Google Scholar] [CrossRef] [PubMed]
Komiyama, S. Subjective evaluation of angular displacement between picture and sound directions for HDTV sound systems. J. Audio Eng. Soc. 1989, 37, 210–214. [Google Scholar]
Hládek, L.; Le Dantec, C.C.; Kopčo, N.; Seitz, A. Ventriloquism effect and aftereffect in the distance dimension. Proc. Meet. Acoust. 2013, 19, 050042. [Google Scholar] [CrossRef]
Rec. ITU-R BS.2088-1; Long-Form File Format for the International Exchange of Audio Programme Materials with Metadata. International Telecommunication Union: Geneva, Switzerland, 2019.
Rec. ITU-R BS.2094-2; Common Definitions for the Audio Definition Model. International Telecommunication Union: Geneva, Switzerland, 2025.
EBU Tech 3285; Specification of the Broadcast Wave Format (BWF)—A Format for Audio Data Files in Broadcasting. European Broadcasting Union: Grand-Saconnex, Switzerland, 2011.
EBU Tech 3285 Supplement 6; Specification of the Broadcast Wave Format (BWF)—Supplement 6: Dolby Metadata, <dbmd> chunk. European Broadcasting Union: Grand-Saconnex, Switzerland, 2009.
EBU Tech 3306; MBWF/RF64: An Extended File Format for Audio—A BWF-Compatible Multichannel File Format Enabling File Sizes to Exceed 4 Gbyte. European Broadcasting Union: Grand-Saconnex, Switzerland, 2009.
Dolby Laboratories. Dolby Atmos Master ADM Profile. Available online: https://professionalsupport.dolby.com/s/article/Dolby-Atmos-ADM-Profile-specification?language=en_US (accessed on 29 October 2025).
Dolby Laboratories. Overview of Dolby Atmos Master File Formats. Available online: https://professionalsupport.dolby.com/s/article/Overview-of-Dolby-Atmos-Master-File-Formats?language=en_US (accessed on 29 October 2025).
Avid Technology, Inc. Pro Tools. Available online: https://www.avid.com/pro-tools (accessed on 29 October 2025).
Apple Inc. Logic Pro for Mac. Available online: https://www.apple.com/logic-pro/ (accessed on 29 October 2025).
Steinberg Media Technologies GmbH. Nuendo. Available online: https://www.steinberg.net/nuendo/ (accessed on 29 October 2025).
Avid Technology; Inc. 10D Course. Available online: https://www.avid.com/courses/pt210d-pro-tools-dolby-atmos-production (accessed on 29 October 2025).
Unity Technologies. Unity Real-Time Development Platform. Available online: https://unity.com/ (accessed on 29 October 2025).
Dolby Laboratories. 7.1.4 Overhead Speaker Setup. Available online: https://www.dolby.com/about/support/guide/speaker-setup-guides/7.1.4-overhead-speaker-setup-guide (accessed on 29 October 2025).
Blender Foundation. Blender—The Free and Open Source 3D Creation Software. Available online: https://www.blender.org/ (accessed on 29 October 2025).
BBC. BBC Sound Effects. Available online: https://sound-effects.bbcrewind.co.uk/ (accessed on 29 October 2025).
Blau, M.; Budnik, A.; Fallahi, M.; Steffens, H.; Ewert, S.D.; Van de Par, S. Toward realistic binaural auralizations–perceptual comparison between measurement and simulation-based auralizations and the real room for a classroom scenario. Acta Acustica 2021, 5, 8. [Google Scholar] [CrossRef]
Pawlak, A.; Lee, H.; Mäkivirta, A.; Lund, T. Spatial Analysis and Synthesis Methods: Subjective and Objective Evaluations Using Various Microphone Arrays in the Auralization of a Critical Listening Room. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 3986–4001. [Google Scholar] [CrossRef]
Roman, N.; Wang, D.; Brown, G.J. Speech segregation based on sound localization. J. Acoust. Soc. Am. 2003, 114, 2236–2252. [Google Scholar] [CrossRef] [PubMed]
Lu, Y.C.; Cooke, M. Binaural Estimation of Sound Source Distance via the Direct-to-Reverberant Energy Ratio for Static and Moving Sources. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1793–1805. [Google Scholar][Green Version]
Madmoni, L.; Tibor, S.; Nelken, I.; Rafaely, B. The Effect of Partial Time-Frequency Masking of the Direct Sound on the Perception of Reverberant Speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 2037–2047. [Google Scholar] [CrossRef]
Prodi, N.; Pellegatti, M.; Visentin, C. Effects of type of early reflection, clarity of speech, reverberation and diffuse noise on the spatial perception of a speech source and its intelligibility. J. Acoust. Soc. Am. 2022, 151, 3522–3534. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Workflow for film production using object-based audio.

Figure 2. Example of 3D Panner in a DAW (Pro Tools).

Figure 4. Example of object position estimation using Dist-YOLO.

Figure 5. Example of coordinate and time information in the .atmos acoustic metadata format.

Figure 6. Example of a visual image used in the experiment.

Figure 7. Definition of estimation errors: (a) Horizontal direction. (b) Vertical direction.

Figure 8. Estimation errors for each object movement pattern: (a) Horizontal direction. (b) Vertical direction.

Figure 9. Loudspeaker arrangement used in the experiment: (a) Middle layer. (b) Top layer.

Figure 10. Monitor arrangement used in the experiment: (a) Top view. (b) Side view.

Figure 11. Sound image movement ranges for each evaluation condition.

Figure 12. Examples of visual image frames used in the experiment: (a) Start frame of car visual image. (b) End frame of car visual image. (c) Start frame of helicopter visual image. (d) End frame of helicopter visual image.

Figure 13. Consistency results between the visual images and sound images.

Table 1. Object setting parameters included in the DAMF metadata.

Setting	Explanation
samplePos	Sample position at which the setting is applied
pos	Sound image position
active	Object active state
snap	Force the sound image to localize to the nearest loudspeaker
elevation	Presence or absence of panning in the vertical direction
zones	Specify the loudspeaker zone used to render the sound image
size	Sound image size
decorr	Sound spreading
importance	Object importance
gain	Object volume
rampLength	Time required for a setting change to complete
trimBypass	Downmix setting for different formats
dialog	Dialog flag
music	Music flag
screenFactor	Pulls the lateral sound image position toward the screen
depthFactor	Depth factor for binaural rendering
headTrackMode	Head-tracking mode for headphone playback
binauralRenderMode	Binaural rendering mode

Table 2. Object position for each movement pattern within Unity.

Object Movement Patterns	Start	End
Horizontal	$(- 7.5, 0.0, 10.0)$	$(7.5, 0.0, 10.0)$
Vertical	$(0.0, - 4.0, 10.0)$	$(0.0, 4.0, 10.0)$
Depth	$(0.0, 0.0, 10.0)$	$(0.0, 0.0, - 5.0)$
Horizontal + Vertical	$(- 7.5, - 4.0, 10.0)$	$(7.5, 4.0, 10.0)$
Horizontal + Depth	$(- 7.5, 0.0, 10.0)$	$(1.5, 0.0, - 5.0)$
Vertical + Depth	$(0.0, - 4.0, 10.0)$	$(0.0, 0.6, - 5.0)$
Horizontal + Vertical + Depth	$(- 7.5, - 4.0, 10.0)$	$(1.5, 0.6, - 5.0)$

Table 3. Experimental conditions of visual images, sound signals, and participants.

Sampling frequency of sound sources $(F_{s})$	48,000 Hz
Quantization of sound sources	24 bits
File type of visual images	mov
Codec of visual images	H.264/AVC
Bitrate of visual images	1.14 Mb/s
Bit depth of visual images	8 bits
Ambient noise level	$L_{A}$ = 35.1 dB
Sound pressure level at the listening point	$L_{A}$ = 75.2 dB
Frame rate of visual images $(F_{v})$	30 fps
Resolution of visual images $(N_{x}, N_{y})$	1920 × 1080 pixels
Monitor width and height $(W_{m}, H_{m})$	1.32 m, 0.71 m
Resolution of monitor	3840 × 2160 pixels
Playback space width and height $(W_{s}, H_{s})$	2.82 m, 2.62 m
Number of participants	10 (1 woman, 9 men)

Table 4. Equipment used in the experiment.

Device	Type
Loudspeaker	YAMAHA (Hamamatsu, Japan), VXS5
Power amplifier	YAMAHA (Hamamatsu, Japan), XMV8280
D/A converter	RME (Haimhausen, Germany), M-32 DA
Audio interface	Avid (Burlington, VT, USA), Pro Tools\|MTRX
Monitor	Panasonic (Kadoma, Japan), TH-60DX850

Table 5. Comparison results of sound image consistency between the Cond. AD and other conditions.

Comparison Target	p-Value (Unajusted)	p-Value (Ajusted)	Significance
Cond. AA	0.0010	0.0039	Significant (**)
Cond. AB	0.0042	0.0167	Significant (*)
Cond. AC	0.3458	1.0000	Not significant
Cond. AE	0.0003	0.0012	Significant (**)

* :

p < 0.05

, ** :

p < 0.01

.

Table 6. Comparison results of sound image consistency between the Cond. AC and other conditions.

Comparison Target	p-Value (Unajusted)	p-Value (Ajusted)	Significance
Cond. AA	0.0020	0.0079	Significant (**)
Cond. AB	0.0078	0.0312	Significant (*)
Cond. AD	0.5000	1.0000	Not significant
Cond. AE	0.0078	0.0312	Significant (*)

* :

p < 0.05

, ** :

p < 0.01

.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kato, S.; Nakayama, M.; Nishiura, T.; Soeta, Y. Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization. Acoustics 2026, 8, 3. https://doi.org/10.3390/acoustics8010003

AMA Style

Kato S, Nakayama M, Nishiura T, Soeta Y. Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization. Acoustics. 2026; 8(1):3. https://doi.org/10.3390/acoustics8010003

Chicago/Turabian Style

Kato, Subaru, Masato Nakayama, Takanobu Nishiura, and Yoshiharu Soeta. 2026. "Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization" Acoustics 8, no. 1: 3. https://doi.org/10.3390/acoustics8010003

APA Style

Kato, S., Nakayama, M., Nishiura, T., & Soeta, Y. (2026). Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization. Acoustics, 8(1), 3. https://doi.org/10.3390/acoustics8010003

Article Menu

Acoustic Metadata Design on Object-Based Audio Using Estimated 3D-Position from Visual Image Toward Depth-Directional Sound Image Localization

Abstract

1. Introduction

2. Related Works

3. Conventional Production Method for Object-Based Audio

4. Proposed Method

4.1. Step 1: Estimation of the Object Coordinates from the Visual Image

4.2. Step 2: Conversion of the Estimated Coordinates

4.3. Step 3: Design of Acoustic Metadata from the Converted Coordinates

5. Objective Evaluation Experiment

5.1. Experimental Conditions for Objective Evaluation

5.2. Experimental Results for Objective Evaluation

6. Subjective Evaluation Experiment

6.1. Experimental Conditions for Subjective Evaluation

6.2. Experimental Results for Subjective Evaluation

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI