3-D Sound Image Reproduction Method Based on Spherical Harmonic Expansion for 22.2 Multichannel Audio

Kenta Iwai; Hiromu Suzuki; Takanobu Nishiura

doi:10.3390/app12041994

,

and

College of Information Science and Engineering, Ritsumeikan University, Kusatsu 525-8577, Japan

^*

Authors to whom correspondence should be addressed.

Appl. Sci.2022, 12(4), 1994;https://doi.org/10.3390/app12041994

This article belongs to the Special Issue Audio and Acoustic Signal Processing

Version Notes

Order Reprints

Review Reports

Abstract

In this paper, we propose a three-dimensional (3-D) sound image reproduction method based on spherical harmonic (SH) expansion for 22.2 multichannel audio. 22.2 multichannel audio is a 3-D sound field reproduction system that has been developed for ultra-high definition television (UHDTV). This system can reproduce 3-D sound images by simultaneously driving 22 loudspeakers and two sub-woofers. To control the 3-D sound image, vector base amplitude panning (VBAP) is conventionally used. VBAP can control the direction of 3-D sound image by weighting the input signal and emitting it from three loudspeakers. However, VBAP cannot control the distance of the 3-D sound image because it calculates the weight by only considering the image’s direction. To solve this problem, we propose a novel 3-D sound image reconstruction method based on SH expansion. The proposed method can control both the direction and distance of the 3-D sound image by controlling the sound directivity on the basis of spherical harmonics (SHs) and mode matching. The directivity of the 3-D sound image is obtained in the SH domain. In addition, the distance of the 3-D sound image is represented by the mode strength. The signal obtained by the proposed method is then emitted from loudspeakers and the 3-D sound image can be reproduced accurately with consideration of not only the direction but also the distance. A number of experimental results show that the proposed method can control both the direction and distance of 3-D sound images.

Keywords:

22.2 multichannel audio; spherical harmonic expansion; 3-D sound image reproduction; mode matching

1. Introduction

Three-dimensional (3-D) sound field reproduction systems have become increasingly popular as video technology has advanced. The 3-D sound field reproduction systems are classified with a psychoacoustics-based system and physical acoustics-based system. The binaural system and transaural system are traditional psychoacoustics-based systems [1]. These systems represent the sound pressure at a listener’s ears by using a head-related transfer function, which represents the reflection and diffraction by a user’s head and torso. In other words, the psychoacoustics-based system represents the direction of the sound image. On the other hand, physical acoustics-based systems, such as wave field synthesis [1], are based on the Kirchhoff–Helmholtz integral and reproduce the sound field by using multiple loudspeakers. In [2], the researchers proposed the sound field reproduction system by using dodecahedron loudspeaker array and achieved the reproducing sound field outside of the loudspeaker array. In [3], higher-order Ambisonics (HOA) is used to reproduce a 2-D sound field in the surrounding area by a circular loudspeaker array and cylindrical loudspeaker array. These researches depict the effectiveness of using loudspeaker arrays to achieve accurate sound field reproduction. In this paper, we focus on a 22.2 multichannel audio [4] as multiple loudspeakers.

22.2 multichannel audio [4] is a 3-D sound field reproduction systems, which has been developed for ultra-high definition television (UHDTV). Figure 1 and Figure 2 and Table 1 show the loudspeaker arrangement of 22.2 multichannel audio, the labels and installation intervals of loudspeakers, and the requirement of the loudspeaker arrangement, respectively. This system can be divided into three layers: upper, middle, and lower. It consists of nine loudspeakers in the upper layer, ten in the middle layer, and three in the lower layer along with two sub-woofers called low frequency effects (LFEs). One of the practical uses of the 22.2 multichannel audio is the theater bar for home use (https://www.nhk.or.jp/strl/open2018/tenji/t2_e.html) (accessed date 20 January 2022). The theater bar is the home reproduction system of 22.2 multichannel audio with a line–array loudspeaker. Recently, the theater bar as a consumer product has been studied in Japan (https://www.jas-audio.or.jp/journal_contents/journal202111_post16264) (accessed date 20 January 2022). According to the ITU-R standards [5], 22.2 multichannel audio can achieve the following effects in sound field.

Figure 1. Loudspeaker arrangement of 22.2 multichannel audio.

Figure 2. Labels and installation intervals of loudspeakers.

Table 1. Requirement of loudspeaker arrangement in a 22.2 multichannel audio.

1.: The arrival of sound from all directions surrounding a listening position;
2.: High quality 3-D sound impression beyond the 5.1 multichannel audio;
3.: High accuracy adjustment of the position between sound and video images.

To achieve these effects, 3-D sound field reproduction systems are generally required to control signals emitted from each loudspeaker. To reproduce the 3-D sound image or field, physical acoustics model-based methods have been studied, for example, in [6,7,8]. Physical acoustics model-based methods represent the target sound field on the basis of the Kirchhoff–Helmholtz integral equation. In other words, these methods represent the sound field as the physical quantity. In addition, a number of systems adopt the method that represents the arrival direction of the sound [4,9]. The simplest system is the two channels stereophonic system, which is based on the perception of the direction of arrival [10]. Many sound field reproduction systems of this type utilize the panning method to control a 3-D sound image.

For 22.2 multichannel audio, the conventional panning method, vector base amplitude panning (VBAP) [11,12], can control the direction of the 3-D sound image by vector synthesis. VBAP divides the reproduction space into a triangular area consisting of three loudspeakers and calculates the gains for the respective loudspeakers. However, a VBAP-based system cannot control the distance of 3-D sound images because VBAP considers only the direction of the 3-D sound image. In [12], the sound intensity is considered to obtain the gain vector for three selected loudspeakers to represent the sound intensity of the sound image with the assumption of locating the real loudspeaker at the position of the sound image. However, both VBAC cannot represent the various directivity pattern because these methods only use a direction vector from the loudspeaker to the sound image. From hereafter, we focus on the original VBAP [11] to simplify the discussion in this paper.

To solve the problem for the original VBAP, we propose a novel 3-D sound image reproduction method based on spherical harmonic (SH) expansion [13]. The proposed method can control both the direction and distance of the 3-D sound image by controlling the sound directivity on the basis of SHs and mode matching [14]. The directivity of the 3-D sound image is obtained in the SH domain and the distance of the 3-D sound image is represented by the mode strength. The signal obtained by the proposed method is then emitted from the loudspeakers and the 3-D sound image can be reproduced accurately. Through two experiments, we evaluate the accuracy of reproduced 3-D sound images between VBAP and the proposed method.

This paper is organized as follows. Section 2 explains the principle of the VBAP as the conventional panning method. In Section 3, the proposed panning method based on SH expansion is explained. A number of experimental results are shown in Section 4. Finally, Section 5 shows the conclusions of this study.

2. Conventional 3-D Sound Image Reproduction Based on VBAP

VBAP is a 3-D amplitude panning method based on vector synthesis that localizes any 3-D sound image by using three loudspeakers. By using the position of a sound image and three loudspeakers, the sound image can be reproduced at the desired position. Figure 3 shows an overview of VBAP. In 22.2 multichannel audio, the three loudspeakers, with the exception of the LFEs, are used to reproduce the 3-D sound image by VBAP. The 3-D sound image is reproduced by the following procedure.

Figure 3. Overview of vector base amplitude panning (VBAP).

1.: Obtaining the 3-D sound source position vector $p$ .
Panning requires the position vector of the 3-D sound image $p = {[p_{x} p_{y} p_{z}]}^{T}$ to generate signals for reproduction of the sound image. The vector $p$ is acquired automatically or manually from media such as video.
2.: Calculation of the gain vector $g$ .
VBAP calculates the gain vector $g = {[g_{1} g_{2} g_{3}]}^{T}$ to control the sound image by using $p$ and each unit vector $l_{1} = {[l_{1 x} l_{1 y} l_{1 z}]}^{T}$ , $l_{2} = {[l_{2 x} l_{2 y} l_{2 z}]}^{T}$ , $l_{3} = {[l_{3 x} l_{3 y} l_{3 z}]}^{T}$ from the listening point to the three loudspeakers. In VBAP, it is assumed that the 3-D sound image is on the median point among the three loudspeakers. Hence, the position of the 3-D sound image is represented as:

$\begin{matrix} p & = Lg, \end{matrix}$

(1)

where $L = [l_{1} l_{2} l_{3}]$ is the matrix of unit vectors. From Equation (1), the gain vector $g$ is obtained by:

$\begin{matrix} g & = L^{- 1} p . \end{matrix}$

(2)
3.: Normalize the gain vector $\bar{g}$ .
To prevent excessive sound pressure, the gain vector $g$ should be normalized by the $L_{2}$ norm $∥ g ∥$ as:

$\begin{matrix} \bar{g} & = \frac{g}{∥ g ∥}, \end{matrix}$

(3)

where $\bar{g} = [{\bar{g}}_{1} {\bar{g}}_{2} {\bar{g}}_{3}]$ is the normalized gain vector.
4.: Generation of the input signals $y_{i} (t)$ of the three loudspeakers.
The input signals $y_{i} (t)$ are generated by the object signal $x (t)$ and the calculated gains ${\bar{g}}_{1}$ , ${\bar{g}}_{2}$ , and ${\bar{g}}_{3}$ as:

$\begin{matrix} y_{i} (t) & = {\bar{g}}_{i} x (t), \end{matrix}$

(4)

where t is the time index and $i \in {1, 2, 3}$ is the loudspeaker index.

From these procedures, VBAP can reproduce the 3-D sound image by using the three loudspeakers.

However, VBAP has an issue regarding accurate 3-D sound image reproduction in that it cannot represent the distance of a 3-D sound image. This is because the gain vector

g

is calculated by using the radial unit vector. In other words, the gain vector

g

only considers the direction of the sound image. To solve this problem, we propose a novel 3-D sound image reproduction method that controls both the directivity and distance of the target sound image on the basis of SH expansion and mode matching.

3. Proposed 3-D Sound Image Reproduction Based on Spherical Harmonic Expansion

In this section, we propose a novel 3-D sound reproduction method for 22.2 multichannel audio based on SH expansion [13]. The proposed method controls the direction and distance of a 3-D sound image by controlling sound directivity on the basis of mode matching [14]. Mode matching generates weighting factors that match the reproduced directivity pattern with the target directivity pattern. In addition, the mode strength used in the mode matching includes the distance of the 3-D sound image. Here, in the spatial sound field, the directivity pattern consists of various types of basic directivity, such as monopole and dipole. On the basis of this fact, SH expansion [15] analyzes the strength of each directivity pattern like a Fourier series expansion. Examples of Fourier series expansion and SH expansion are shown in Figure 4.

Figure 4. Examples of Fourier series expansion and spherical harmonic expansion (SH) expansion.

The method proposed in [2] has achieved interactive directivity control outside the dodecahedron loudspeaker array using SH expansion, as shown in Figure 5a. The proposed method reproduces the 3-D sound image inside of the 22.2 loudspeakers, as shown in Figure 5b. The proposed method generates a directivity pattern including a peak at the target sound image position. Furthermore, it calculates weighting factors by mode matching in the SH domain. Hence, we can control the amplitude by transforming the obtained weighting factors with consideration of the distance of the 3-D sound image. Here, more than three loudspeakers except for the LFEs are used to reproduce the 3-D sound image. Section 3.1 and Section 3.2 explain SH expansion and the proposed method based on it, respectively.

Figure 5. Difference between the position of the target sound image on each system.

3.1. Spherical Harmonic Expansion

The SH function

Y_{n}^{m} (θ, ϕ)

is a solution of the 3-D wave equation in the spherical coordinate system shown in Figure 6 [15]. It is defined as:

\begin{matrix} Y_{n}^{m} (θ, ϕ) & = \sqrt{\frac{(2 n + 1)}{4 π} \frac{(n - m)!}{(n + m)!}} P_{n}^{m} (cos θ) e^{j m ϕ}, \end{matrix}

(5)

where

n (0 \leq n \leq \infty)

is the order of the SH function,

m (- n \leq m \leq n)

is the degree of the SH function,

θ (0 \leq θ \leq π)

is the elevation angle,

ϕ (0 \leq ϕ \leq 2 π)

is the azimuth angle, and

P_{n}^{m} (\cdot)

is the Legendre function. Figure 7 shows the shapes of the SH function at the order of

n = 2

and degree of

m = 2

in Cartesian coordinates. The color of the SH function means phase

∠ Y_{n}^{m} (θ, ϕ)

: yellow and blue show positive and negative values, respectively. This function can be applied to an orthogonal function expansion to determine the arrival direction of a plane wave. Therefore, any directivity pattern

D (θ, ϕ)

can be expanded using the SH function

Y_{n}^{m} (θ, ϕ)

and coefficient

A_{n m}

in the SH domain as:

\begin{matrix} D (θ, ϕ) & = \sum_{n = 0}^{\infty} \sum_{m = - n}^{n} A_{n m} Y_{n}^{m} (θ, ϕ) . \end{matrix}

(6)

Figure 6. Spherical coordinate system.

Figure 7. Shapes of SH function at order

n = 2

and degree

m = 2

.

Here,

A_{n m}

indicates the strength of each basic directivity. In the SH domain, it is possible to analyze what percentage of the desired directivity pattern

D (θ, ϕ)

includes basic directivity, such as monopole and dipole. From Equation (6),

A_{n m}

can be calculated as:

\begin{matrix} A_{n m} & = \int_{0}^{π} \int_{0}^{2 π} D (θ, ϕ) Y_{n}^{m} {(θ, ϕ)}^{*} sin θ d θ d ϕ, \end{matrix}

(7)

where * is the complex conjugate.

3.2. Algorithm of Proposed Method

We explain the procedures of the proposed method to generate an accurate 3-D sound image.

1.: Generation of a target directivity pattern $D^{Tar} (θ, ϕ)$ .
The proposed method requires a target directivity pattern $D^{Tar} (θ, ϕ)$ including a peak at the target sound image position. This directivity pattern affects the clarity of the 3-D sound image at the target position. In this paper, a provisional directivity pattern is generated by multiple signal classification (MUSIC) [16]. MUSIC estimates the direction of arrival of a sound source and generates the sharp spatial spectrum to the direction of the sound source. In Step 1, we generate the spatial spectrum $P (θ, ϕ)$ for the target sound image at the position of $(r^{Tar}, θ^{Tar}, ϕ^{Tar})$ by using MUSIC. Then, the generated spatial spectrum $P (θ, ϕ)$ is normalized to the target directivity pattern $D^{Tar} (θ, ϕ)$ as:

$\begin{matrix} D^{Tar} (θ, ϕ) & = \frac{P (θ, ϕ) - P (θ^{Min}, ϕ^{Min})}{P (θ^{Tar}, ϕ^{Tar}) - P (θ^{Min}, ϕ^{Min})}, \end{matrix}$

(8)

where $(θ^{Min}, ϕ^{Min})$ is the position with the smallest spatial spectrum. An example of the target directivity pattern $D^{Tar} (θ, ϕ)$ is shown in Figure 8.

Figure 8. Example of directivity pattern $D^{tar} (θ, ϕ)$ obtained by multiple signal classification (MUSIC).
2.: Calculation of the target directivity pattern in the SH domain $A_{n m}^{tar}$ .
We calculate the target directivity pattern in the SH domain $A_{n m}^{Tar}$ using Equations (5) and (7) as:

$\begin{matrix} A_{n m}^{Tar} & = \int_{0}^{π} \int_{0}^{2 π} D^{Tar} (θ, ϕ) Y_{n}^{m} {(θ, ϕ)}^{*} sin θ d θ d ϕ . \end{matrix}$

(9)
3.: Calculation of the weighting factor $w_{i} (k)$ .
We calculate the weighting factor $w_{i} (k)$ for ith loudspeaker at the position of $(r_{i}^{LS}, θ_{i}^{LS}, ϕ_{i}^{LS})$ to reproduce the target directivity pattern $A_{n m}^{Tar}$ in the real sound field. The weighting factor $w_{i} (k)$ is obtained by:

$\begin{matrix} w_{i} (k) & = \sum_{n = 0}^{\infty} \sum_{m = - n}^{n} w_{n m, i} (k) Y_{n}^{m} (θ_{i}^{LS}, ϕ_{i}^{LS}), \end{matrix}$

(10)

where k is the wave number and $i \in \{1, 2, \dots, 22\}$ is the loudspeaker index. On the basis of the mode matching, the target directivity pattern $A_{n m}^{Tar}$ has a relationship between the weighting factor $w_{n m, i} (k)$ in the SH domain as:

$\begin{matrix} w_{n m, i} (k) & = \frac{A_{n m}^{Tar}}{b_{n}^{S} (k, r_{i}^{LS})}, \end{matrix}$

(11)

$\begin{matrix} b_{n}^{S} (k, r_{i}^{LS}) & ≜ \frac{4 π}{{(r_{i}^{LS})}^{2} k} \frac{j_{n} (k r^{Tar})}{{j_{n}}^{^{'}} (k r_{i}^{LS})}, \end{matrix}$

(12)

where $j_{n} (\cdot)$ and $j_{n}^{^{'}} (\cdot)$ are the spherical Bessel function and its derivative, respectively, and $b_{n}^{S} (\cdot)$ is the mode strength [17]. The mode strength $b_{n}^{S} (\cdot)$ theoretically represents the radial strength of the directivity. Substituting Equation (11) into Equation (10), the weighting factor $w_{i} (k)$ can be obtained as follows:

$\begin{matrix} w_{i} (k) & = \sum_{n = 0}^{\infty} \sum_{m = - n}^{n} \frac{A_{n m}^{Tar}}{b_{n}^{S} (k, r_{i}^{LS})} Y_{n}^{m} (θ_{i}^{LS}, ϕ_{i}^{LS}) . \end{matrix}$

(13)
4.: Generation of the input signals for 22 loudspeakers.
The weight factor $w_{i} (k)$ in the spatial domain can be used as the frequency domain filter, that is,

$\begin{matrix} W_{i} (ω) & = w_{i} (\frac{ω}{c}), \end{matrix}$

(14)

$\begin{matrix} k & = \frac{ω}{c} . \end{matrix}$

(15)

where $ω$ is the angular frequency and c is the speed of sound. Then, the input signal for ith loudspeaker $y_{i} (t)$ is obtained by:

$\begin{matrix} y_{i} (t) & = IDTFT [W_{i} (ω) X (ω)], \end{matrix}$

(16)

$\begin{matrix} X (ω) & = DTFT [x (t)], \end{matrix}$

(17)

where $x (t)$ is the sound source, $IDTFT [\cdot]$ and $DTFT [\cdot]$ represent the inverse discrete time Fourier transform and discrete time Fourier transform operators, respectively.

By inputting

y_{i} (t)

to each ith loudspeaker, the 3-D sound image can be reproduced with consideration of not only the direction but also the distance of the sound image. For the 3-D sound image reproduction, the maximum order of the SH expansion is generally limited as:

\begin{matrix} D (θ, ϕ) & = \sum_{n = 0}^{N} \sum_{m = - n}^{n} A_{n m} Y_{n}^{m} (θ, ϕ), \end{matrix}

(18)

\begin{matrix} N & = ⌈ k_{\max} R ⌉, \end{matrix}

(19)

where

k_{\max} = ω_{\max} / c

is the maximum value of the wave number, R is the radius of the reproduced sound field, and

⌈ \cdot ⌉

is the ceiling function. Here,

k_{\max}

represents the largest value of the wave number for the target sound. Hence, the order n in Equation (11) is also limited to N.

Regarding the computational complexity, the proposed method requires the calculation of the directivity pattern of the sound image, coefficients of SH expansion, and mode strength for obtaining the filter coefficients. On the other hand, the conventional VBAP requires only vector calculation. Hence, the proposed method has higher computational complexity than that of the conventional method.

4. Evaluation Experiment

We conducted a number of experiments to evaluate the effectiveness of the proposed method. In the following experiments, we recorded the outputs of the 22.2 multichannel audio using the signal generated by VBAP and the proposed SH expansion-based method. From hereafter, we use the notation “conventional method” for VBAP and “proposed method” for the proposed SH expansion-based method.

The experimental environment and equipment are shown in Table 2. The loudspeakers were placed at the positions shown in Table 3. Here, the listening point was the center of the surrounded area by the loudspeakers, 1.2 m in height from the floor. In addition, each loudspeaker was 1.9 m away from the listening point.In these experiments, we used band-limited white noise with a duration of 3.0 s as the sound source for the conventional and proposed methods. The frequency band of the band-limited white noise was set to 0–8000 Hz. This is because the signal with high frequency components should be represented with the higher-order spherical harmonic expansion, however, it is difficult to obtain the coefficients for SH expansion. Furthermore, in accordance with the highest frequency of the sound source, the maximum order of the SH expansion N in Equation (18) was set to 5 for all experiments. The positions of each target sound image are shown in Table 4. Here, (a) and (b), (c) and (d), (e) and (f), and (g) and (h) are, respectively, the same direction and different distance of each target sound image. The emitted sound was recorded by the dummy head microphone placed at the position shown in Table 5.

Table 2. Experimental environment and equipment.

Table 3. Loudspeaker arrangement of 22.2 multichannel audio used in the experiment.

Table 4. Position of target sound image

(r^{Tar}, θ^{Tar}, ϕ^{Tar})

.

Table 5. Position of the dummy head microphone

(r, θ, ϕ)

.

4.1. Experiment 1: Evaluation of Sound Image Localization Accuracy

In this experiment, the sound image localization accuracy was evaluated on the basis of [18]. The evaluation method [18] utilizes the head-related transfer function (HRTF) database and the direction of the sound image

(\hat{θ}, \hat{ϕ})

can be estimated accurately. However, the distance between the recorded position and sound image cannot be obtained directly by this evaluation method. Hence, we estimated the position of the sound image

(\hat{r}, \hat{θ}, \hat{ϕ})

by using the image’s estimated direction and the recorded sounds at five positions shown in Table 5. The evaluation procedure is shown as follows:

1.: Calculation of inter-aural level difference (ILD) and inter-aural phase difference (IPD)
The ILD and IPD were calculated for each recorded sound at positions A to E shown in Table 5 [19]. Here, the ILD and IPD are known as factors for the sound localization of humans. The ILD and IPD can be calculated by:

$\begin{matrix} {ILD}_{Ω} (ω) & = 20 {log}_{10} |\frac{C_{Ω} (ω)}{P_{Ω} (ω)}|, \end{matrix}$

(20)

$\begin{matrix} {IPD}_{Ω} (ω) & = {tan}^{- 1} (\frac{Im (C_{Ω} (ω))}{Re (P_{Ω} (ω))}), \end{matrix}$

(21)

where $C_{Ω} (ω)$ is the cross spectrum between the signals obtained at the left and right ears of the dummy head microphone, $P_{Ω} (ω)$ is the power spectrum of the signal obtained at the left ear of the dummy head microphone, and $Re (\cdot)$ and $Im (\cdot)$ represent the real and imaginary parts of the complex value, respectively. $Ω \in \{A, B, C, D, E\}$ is the index for the recording position.
In addition, ${ILD}_{HRTF} (ω, θ, ϕ)$ and ${IPD}_{HRTF} (ω, θ, ϕ)$ were calculated by using HRTF in the CIPIC database [20]. ${ILD}_{HRTF} (ω, θ, ϕ)$ and ${IPD}_{HRTF} (ω, θ, ϕ)$ represent the value for which the sound image perfectly localizes at the desired position.
2.: Calculation of the differences of the ILD and IPD between recorded sound and HRTF database.
The differences of the ILD and IPD between the recorded sound and HRTF database were calculated. If the differences are small, it can be said that the sound image localizes at the direction $(θ, ϕ)$ . The difference between the inter-aural time difference (ITD) and IPD are defined as:

$\begin{matrix} E_{Ω, ILD} (ω, θ, ϕ) & = |{ILD}_{Ω} (ω) - {ILD}_{HRTF} (ω, θ, ϕ)|, \end{matrix}$

(22)

$\begin{matrix} E_{Ω, IPD} (ω, θ, ϕ) & = |{IPD}_{Ω} (ω) - {IPD}_{HRTF} (ω, θ, ϕ)| . \end{matrix}$

(23)

Then, $E_{Ω, ILD} (ω, θ, ϕ)$ and $E_{Ω, IPD} (ω, θ, ϕ)$ were combined as the following form.

$\begin{matrix} E_{Ω} (ω, θ, ϕ) & = β (ω) E_{Ω, IPD} (ω, θ, ϕ) + (1 - β (ω)) E_{Ω, ILD} (ω, θ, ϕ), \end{matrix}$

(24)

$\begin{matrix} β (ω) & = \{\begin{matrix} 1 & (ω \leq ω_{L}) \\ 1 - \frac{ω - ω_{L}}{ω_{H} - ω_{L}} & (ω_{L} < ω < ω_{H}) \\ 0 & (ω \geq ω_{H}) \end{matrix}, \end{matrix}$

(25)

where $β (ω)$ is the weighting function that controls the ratio between $E_{Ω, IPD} (ω, θ, ϕ)$ and $E_{Ω, ILD} (ω, θ, ϕ)$ in Equation (24). $β (ω)$ has a characteristic shown in Figure 9. The reason for using $β (ω)$ is that the IPD affects the sound localization of humans below 1500 Hz and is dominant above 1500 Hz [21]. Hence, we set $ω_{L} = 6283$ rad/s and $ω_{H}$ = 12,566 rad/s with consideration of the crossover. These values are related to 1000 and 2000 Hz in terms of frequency. From hereafter, $E_{Ω} (ω, θ, ϕ)$ is called the error function.

Figure 9. Weighting function $β (ω)$ and $1 - β (ω)$ .
3.: Estimation of the direction of reconstructed 3-D sound image.
The direction of the reconstructed 3-D sound image can be estimated by obtaining $({\hat{θ}}_{Ω}, {\hat{ϕ}}_{Ω})$ of which the error function $E_{Ω} (ω, θ, ϕ)$ is the smallest value. Here, the directions of the sound image were estimated for each recorded sound as:

$\begin{matrix} ({\hat{θ}}_{Ω}, {\hat{ϕ}}_{Ω}) & = \underset{\begin{matrix} 0 \leq θ \leq 180^{\circ} \\ 0 \leq ϕ \leq 360^{\circ} \end{matrix}}{\arg \min} |\int_{0}^{ω_{\max}} E_{Ω} (ω, θ, ϕ) d ω|, \end{matrix}$

(26)

where $ω_{\max}$ is the highest angular frequency of the recorded sound. In this experiment, the frequency of the sound is up to 8000 Hz and $ω_{\max} = 50265$ rad/s.
4.: Estimation of the position of the reconstructed 3-D sound image.
Finally, the position of the reconstructed 3-D sound image was estimated by using the estimated direction of the sound image $({\hat{θ}}_{Ω}, {\hat{ϕ}}_{Ω})$ . As shown in Figure 10, the position was estimated by drawing a straight line from each recording position to the estimated direction. Then, the center position of the surrounding area by the five lines was treated as the estimated position of the sound image $(\hat{r}, \hat{θ}, \hat{ϕ})$ .

Figure 10. Estimation of the position of the reconstructed 3-D sound image $(\hat{r}, \hat{θ}, \hat{ϕ})$ .

From the four procedures, the position of the 3-D sound image was estimated and the accuracy of the sound image localization for the proposed method was evaluated by conducting two trials of Experiment 1.

Moreover, the error between the positions of the reconstructed sound image and that of the target sound was evaluated. The error

D_{Err}

is defined as:

\begin{matrix} D_{Err} & = \sqrt{{(Δ X)}^{2} + {(Δ Y)}^{2} + {(Δ Z)}^{2}}, \end{matrix}

(27)

\begin{matrix} Δ X & = \hat{r} sin \hat{θ} cos \hat{ϕ} - r^{Tar} sin θ^{Tar} cos ϕ^{Tar}, \end{matrix}

(28)

\begin{matrix} Δ Y & = \hat{r} sin \hat{θ} sin \hat{ϕ} - r^{Tar} sin θ^{Tar} sin ϕ^{Tar}, \end{matrix}

(29)

\begin{matrix} Δ Z & = \hat{r} cos \hat{θ} - r^{Tar} cos θ^{Tar} . \end{matrix}

(30)

By Equation (27), the error in terms of the Euclidean distance can be evaluated. The error

D_{Err}

was calculated for each trial. Then, the average of the errors was evaluated.

Figure 11 shows the estimated position of the reconstructed sound image for each condition (a)–(h) on the first trial and Figure 12 shows the error

D_{Err}

for each position. In Figure 11, the closer the red and black markers, the higher the accuracy of the sound localization. From Figure 11 and Figure 12, the proposed method can reconstruct the sound image close to the target position. The distance error

D_{Err}

of the proposed method is 0.17 m smaller than that of the conventional method on average. These results show the effectiveness of the proposed method, which includes not only the elevation angle

θ

and the azimuth angle

ϕ

but also the distance r. Here, the error

D_{Err}

of the proposed method is about 0.87 m at positions (a), (c), (e), and (g) at which the distance of the target sound image is 1.0 m, although the error

D_{Err}

of the conventional method is about 1.02 m. However, the error

D_{Err}

of the proposed method is about 1.30 m at positions (b), (d), (f), and (h) at which the distance of the target sound image is 1.5 m, although the error

D_{Err}

of the conventional method is about 1.49 m. These results indicate that the sound pressure and phase largely affect the sound localization in the case that the recording position, that is, the listening position, is close to the position of the sound image. Hence, the improvement of sound localization is more significant at positions (a), (c), (e), and (g) than that at positions (b), (d), (f), and (h).

Figure 11. Estimated position of 3-D sound image

(\hat{r}, \hat{θ}, \hat{ϕ})

.

Figure 12. Error between the position of the estimated position and that of the target sound image

D_{err}

.

Focusing on the direction of the sound image, the error

D_{Err}

of the proposed method is about 0.90 m at positions (a)–(d), although the error

D_{Err}

of the conventional method is about 1.13 m. In other words, the proposed method accurately reconstructs the 3-D sound image in front of the listening position. However, the error

D_{Err}

of the proposed method is about 1.27 m at positions (e)–(h), although the error

D_{Err}

of the conventional method is about 1.39 m. In other words, the accuracy of the reconstruction of the 3-D sound image behind the listening position degrades for both the conventional and proposed methods. This is because the dummy head microphone represents a human auditory system, that is, the accuracy of the sound localization behind the listening point is lower than that in front of the listening point. However, the error

D_{Err}

of the proposed method is lower than that of the conventional method in all conditions except for position (h), where it is inferred that the reverberation of the room affects the sound localization. From these results, it can be said that the proposed method can reproduce the 3-D sound image well in terms of distance and direction compared with the conventional method. Additionally, the error of the distance is still large because the proposed method does not consider the room reverberation [22]. Hence, we will improve the proposed method to overcome this problem.

4.2. Experiment 2: Clarity of Reconstructed Sound Image for Given Directivity in Proposed Method

In this experiment, the clarity of the reconstructed sound image for the given directivity in the proposed method was evaluated. The directivities were generated by four different methods; (i) giving only the unit pulse at the target position, (ii) MUSIC, (iii) minimum variance (MV) method [23], and (iv) delay-and-sum (DS) method [24]. These directivities are shown in Figure 13. In this experiment, the directivity patterns were obtained for each position shown in Table 4. Hence, the 32 directivity patterns were used in the experiment. Then, the weight

w_{i} (k)

was calculated for each directivity pattern. After that, the 3-D sound image was reconstructed by 22.2 multichannel audio and the emitted sound was recorded by the dummy head microphone. Using the recorded sound obtained at the left and right ears of the dummy head microphone, the inter-aural cross correlation (IACC) [25] was calculated as the evaluation metric. The IACC is obtained by:

\begin{matrix} IACC & = max_{- 1 \leq τ \leq 1} IACF (τ), \end{matrix}

(31)

\begin{matrix} IACF (τ) & = \frac{\sum_{t = 0}^{t_{\max} - 1} s_{R} (t) s_{L} (t + τ)}{\sqrt{\sum_{t = 0}^{t_{\max} - 1} s_{R}^{2} (t) \sum_{t = 0}^{t_{\max} - 1} s_{L}^{2} (t)}}, \end{matrix}

(32)

where

s_{L} (t)

,

s_{R} (t)

are the recorded sounds at the left and right ears,

t_{\max}

is the length of the signal,

τ

is the time index related to the ITD, and

IACF (τ)

is the inter-aural cross function (IACF) for

τ

. The larger the IACC, the higher the clarity of the reconstructed sound image. In this experiment, we conducted the three trials and the IACC was calculated as the average of three.

Figure 13. Generated directivities.

Figure 14 shows the IACCs for each position (a)–(h). In Figure 14, “Unit pulse” represents the IACC for the sound image by using the unit pulse at the target position, “MUSIC,” “MV,” and “DS” represent the IACC for the sound images by using MUSIC, the MV method, and DS, respectively. From Figure 14, the IACCs for Unit pulse, MUSIC, MV, and DS are 0.33, 0.37, 0.36, and 0.35, respectively. This result indicates that the directivity with not only the peak but also the sidelobe affects the high clarity of the sound image in the proposed method. In addition, the clarities of the sound image in front of the listening point are higher than those behind the listening point. This is because the many loudspeakers are placed at the front of the 22.2 multichannel audio and it is easy to reproduce the clear sound image to the front side of the 22.2 multichannel audio. From these results, it can be said that the proposed method is effective for reproducing the 3-D sound image in the 22.2 multichannel audio by using the given directivity pattern with the sidelobe.

Figure 14. Inter-aural cross correlation (IACC) for each generated directivity.

5. Conclusions

In this paper, we proposed a novel 3-D sound image reconstruction method for 22.2 multichannel audio based on SH expansion. The proposed method can consider the directivity of the sound image and calculate the filter coefficients by using the mode strength. Hence, it can consider not only the direction but also the distance of the sound image. The experimental results showed that the proposed method can reproduce the 3-D sound image closer to the target position compared with VBAP in terms of direction and distance.

In future, we will develop the proposed method for sounds with higher frequency components above 8000 Hz. This is because the order of the SH expansion depends on the highest frequency of the sound, and the higher the order, the more difficult the SH expansion of the directivity. According to [22], it is effective to consider the reverberation of the room for reproducing a sound image. Hence, we will consider the reverberation of a room in the proposed method to improve the accuracy of sound localization.Then, we will develop the proposed method to reproduce the 3-D sound images located both inside and outside the 22.2 multichannel audio simultaneously. Thereafter, we will investigate the effectiveness of the proposed method for moving sound images.

Author Contributions

T.N. and H.S. conceived the proposed method. H.S. developed the method and conducted the experiments. K.I. wrote this manuscript and modified the figures and expressions of the equations. All authors discussed the results and contributed to the final manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by JSPS KAKENHI, grant number 19H04142, 21H03488, and 21K18372, and the Ritsumeikan Global Innovation Research Organization (R-GIRO).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Roginska, A.; Geluso, P. Immersive Sound: The Art and Science of Binaural and Multi-Channel Audio, 1st ed.; Routledge: London, UK, 2017. [Google Scholar]
Bando, K.; Haneda, Y. Interactive directivity control using dodecahedron loudspeaker array. J. Signal Process. 2016, 20, 209–212. [Google Scholar] [CrossRef][Green Version]
Okamoto, T. 2D multizone sound field systhesis with interior-exterior Ambisonics. In Proceedings of the 2021 IEEE Workshop Applications of Signal Processing to Audio and Acoustics, Virtual Event, 18–21 October 2021; pp. 276–280. [Google Scholar]
Hamasaki, K.; Nishiguchi, T.; Okumura, R.; Nakayama, Y.; Ando, A. A 22.2 Multichannel Sound System for Ultrahigh-Definition TV (UHDTV). SMPTE Motion Imaging J. 2008, 117, 40–49. [Google Scholar] [CrossRef]
Institutional Telecommunication Union Radiocommunication Sector (ITU-R) Recommendation BS. 2051-2. Advanced Sound System for Programme Production; 2018; Available online: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiIuf-D8_71AhU363MBHQs-CSoQFnoECAMQAQ&url=https%3A%2F%2Fwww.itu.int%2Fdms_pubrec%2Fitu-r%2Frec%2Fbs%2FR-REC-BS.2051-2-201807-I!!PDF-E.pdf&usg=AOvVaw07oyUacd5OYXmikUvzCfj2 (accessed on 20 January 2022).
Camras, M. Approach to recreating a sound field. J. Acoust. Soc. Am. 1968, 43, 172–178. [Google Scholar] [CrossRef]
Berkhout, A.J.; de Vries, D.; Vogel, P. Acoustic control by wave field synthesis. J. Acoust. Soc. Am. 1993, 93, 2764–2778. [Google Scholar] [CrossRef]
Omoto, A.; Ise, S.; Ikeda, Y.; Ueno, K.; Enomoto, S.; Kobayashi, M. Sound field reproduction and sharing system based on the boundary surface control principle. Acoust. Sci. Technol. 2015, 36, 1–11. [Google Scholar] [CrossRef][Green Version]
Institutional Telecommunication Union Radiocommunication Sector (ITU-R) Recommendation BS.775-2. Multichannel Stereophonic Sound System with and without Accompanying Picture; 2006; Available online: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwiYg_y58_71AhXY5nMBHeGGBtEQFnoECAUQAQ&url=https%3A%2F%2Fwww.itu.int%2Fdms_pub%2Fitu-r%2Fopb%2Frec%2FR-REC-LS-2007-E02-PDF-E.pdf&usg=AOvVaw3YzFGDUR8es0kBuKwOa9m6 (accessed on 20 January 2022).
Rumsey, F. Spatial Audio; Focal Press: Waltham, MA, USA, 2001. [Google Scholar]
Pulkki, V. Virtual Sound Source Positioning Using Vector Base Amplitude Panning. J. Audio Eng. Soc. 1997, 45, 456–466. [Google Scholar]
Ando, A.; Hamasaki, K. Sound intensity based three-dimensional panning. In Proceedings of the Audio Engineering Society 126th Convention, Munich, Germany, 7–10 May 2009. [Google Scholar]
Suzuki, H.; Iwai, K.; Nishiura, T. 3-D sound image panning based on spherical harmonics expansion for 22.2 multichannel audio. In Proceedings of the INTER-NOISE 2020, E-Congress, Seoul, Korea, 23–26 August 2020; pp. 4170–4180. [Google Scholar]
Meyer, J.; Elko, G. A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2002), Orlando, FL, USA, 13–17 May 2002; pp. II–1781–II–1784. [Google Scholar]
Müller, C. Spherical Harmonics; Springer: Heidelberg/Berlin, Germany, 2006. [Google Scholar]
Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Williams, E. Fourier Acoustics: Sound Radiation and Nearfield Acoustical Holography; Springer: Heidelberg/Berlin, Germany, 1999. [Google Scholar]
Nakashima, H.; Chisaki, Y.; Usagawa, T.; Ebata, M. Frequency domain binaural model based on interaural phase and level differences. Acoust. Sci. Technol. 2003, 24, 172–178. [Google Scholar] [CrossRef][Green Version]
Roman, N.; Wang, D.; Brown, G.J. Sound field reproduction and sharing system based on the boundary surface control principle. J. Acoust. Soc. Am. 2003, 114, 2236–2252. [Google Scholar] [CrossRef] [PubMed]
Algazi, V.R.; Duda, R.O.; Thompson, D.M.; Avendano, C. The CIPIC HRTF Database. In Proceedings of the IEEE Workshop Applications of Signal Processing to Audio and Electroacoustics, Mohonk Mountain House, New Paltz, NY, USA, 21–24 October 2001; pp. 99–102. [Google Scholar]
Woodworth, R. Experimental Psychology; Holt, Rinehart and Winston: Ballwin, MO, USA, 1938. [Google Scholar]
Zheng, K.; Otsuka, M.; Nishiura, T. 3-D Sound image localization in reproduction of 22.2 multichannel audio based on room impulse response generation with vector composition. In Proceedings of the International Congress on Acoustics (ICA 2019), Aachen, Germany, 9–13 September 2019; pp. 5274–5281. [Google Scholar]
Trees, H.L.V. Optimum Array Processing; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
Johnson, D.H.; Dudgeon, D.E. Array Signal Processing; Prentice Hall: Hoboken, NJ, USA, 1993. [Google Scholar]
Omologo, M.; Svaizer, P. Acoustic event localization using a crosspower-spectrum phase based technique. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1994), Adelaide, SA, Australia, 19–22 April 1994; Volume 2, pp. II/273–II/276. [Google Scholar]

Figure 1. Loudspeaker arrangement of 22.2 multichannel audio.

Figure 2. Labels and installation intervals of loudspeakers.

Figure 3. Overview of vector base amplitude panning (VBAP).

Figure 4. Examples of Fourier series expansion and spherical harmonic expansion (SH) expansion.

Figure 5. Difference between the position of the target sound image on each system.

Figure 6. Spherical coordinate system.

Figure 7. Shapes of SH function at order

n = 2

and degree

m = 2

.

Figure 8. Example of directivity pattern

D^{tar} (θ, ϕ)

obtained by multiple signal classification (MUSIC).

Figure 9. Weighting function

β (ω)

and

1 - β (ω)

.

Figure 10. Estimation of the position of the reconstructed 3-D sound image

(\hat{r}, \hat{θ}, \hat{ϕ})

.

Figure 11. Estimated position of 3-D sound image

(\hat{r}, \hat{θ}, \hat{ϕ})

.

Figure 12. Error between the position of the estimated position and that of the target sound image

D_{err}

.

Figure 13. Generated directivities.

Figure 14. Inter-aural cross correlation (IACC) for each generated directivity.

Table 1. Requirement of loudspeaker arrangement in a 22.2 multichannel audio.

Layer	Channel No.	Channel Name	Setting Range
Layer	Channel No.	Channel Name	Azimuth ( $ϕ$ [degs.])	Elevation ( $θ$ [degs.])
Middle	1	Front left (FL)	135 ≤ $ϕ$ ≤ 150	85 ≤ $θ$ ≤ 90
	2	Front right (FR)	30 ≤ $ϕ$ ≤ 45	85 ≤ $θ$ ≤ 90
	3	Front center (FC)	90	85 ≤ $θ$ ≤ 90
Lower	4	Low frequency effects-1	120 ≤ $ϕ$ ≤ 180	105 ≤ $θ$ ≤ 120
Middle	5	Back left (BL)	200 ≤ $ϕ$ ≤ 225	75 ≤ $θ$ ≤ 90
	6	Back right (BR)	315 ≤ $ϕ$ ≤ 340	75 ≤ $θ$ ≤ 90
	7	Front left center (FLc)	112.5 ≤ $ϕ$ ≤ 120	85 ≤ $θ$ ≤ 90
	8	Front right center (FRc)	60 ≤ $ϕ$ ≤ 67.5	85 ≤ $θ$ ≤ 90
	9	Back center (BC)	270	75 ≤ $θ$ ≤ 90
Lower	10	Low frequency effects-2	0 ≤ $ϕ$ ≤ 60	105 ≤ $θ$ ≤ 120
Middle	11	Side left (SiL)	180	75 ≤ $θ$ ≤ 90
Middle	12	Side right (SiR)	0	75 ≤ $θ$ ≤ 90
Upper	13	Top front left (TpFL)	135 ≤ $ϕ$ ≤ 150	45 ≤ $θ$ ≤ 60
	14	Top front right (TpFR)	30 ≤ $ϕ$ ≤ 45	45 ≤ $θ$ ≤ 60
	15	Top front center (TpFC)	90	45 ≤ $θ$ ≤ 60
	16	Top center (TpC)	N/A	0
	17	Top back left (TpBL)	200 ≤ $ϕ$ ≤ 225	45 ≤ $θ$ ≤ 60
	18	Top back right (TpBR)	315 ≤ $ϕ$ ≤ 340	45 ≤ $θ$ ≤ 60
	19	Top side left (TpSiL)	180	45 ≤ $θ$ ≤ 60
	20	Top side right (TpSiR)	0	45 ≤ $θ$ ≤ 60
	21	Top back center (TpBC)	270	45 ≤ $θ$ ≤ 60
Lower	22	Bottom front center (BtFC)	90	105 ≤ $θ$ ≤ 120
	23	Bottom front left (BtFL)	135 ≤ $ϕ$ ≤ 150	105 ≤ $θ$ ≤ 120
	24	Bottom front right (BtFR)	30 ≤ $ϕ$ ≤ 45	105 ≤ $θ$ ≤ 120

Table 2. Experimental environment and equipment.

Environment	Experiment room ( $T_{60} = 300$ ms)
Ambient noise level	37.0 dBA
Sound pressure level	70.0 dB at the listening point
Dummy head	3Dio, Free Space Pro II
Loudspeaker	YAMAHA, VXS5
Loudspeaker (LFE)	YAMAHA, VXS10S
Loudspeaker amplifier	YAMAHA, XMV8280
Analog-to-digital converter	RME, Fireface UFX
Digital-to-analog converter	RME, M-32 DA

Table 3. Loudspeaker arrangement of 22.2 multichannel audio used in the experiment.

Layer	Channel No.	Channel Name	Setting Position
Layer	Channel No.	Channel Name	Azimuth ( $ϕ$ [degs.])	Elevation ( $θ$ [degs.])
Middle	1	Front left (FL)	150	90
	2	Front right (FR)	30	90
	3	Front center (FC)	90	90
Lower	4	Low frequency effects-1	150	118.3
Middle	5	Back left (BL)	210	90
	6	Back right (BR)	330	90
	7	Front left center (FLc)	120	90
	8	Front right center (FRc)	120	90
	9	Back center (BC)	270	90
Lower	10	Low frequency effects-2	30	118.3
Middle	11	Side left (SiL)	180	90
Middle	12	Side right (SiR)	0	90
Upper	13	Top front left (TpFL)	150	52
	14	Top front right (TpFR)	30	52
	15	Top front center (TpFC)	90	52
	16	Top center (TpC)	-	0
	17	Top back left (TpBL)	210	52
	18	Top back right (TpBR)	330	52
	19	Top side left (TpSiL)	180	52
	20	Top side right (TpSiR)	0	52
	21	Top back center (TpBC)	270	52
Lower	22	Bottom front center (BtFC)	90	118.3
	23	Bottom front left (BtFL)	150	118.3
	24	Bottom front right (BtFR)	30	118.3

Table 4. Position of target sound image

(r^{Tar}, θ^{Tar}, ϕ^{Tar})

.

Table 4. Position of target sound image

(r^{Tar}, θ^{Tar}, ϕ^{Tar})

.

(a)	$(1.0, 45^{\circ}, 45^{\circ})$
(b)	$(1.5, 45^{\circ}, 45^{\circ})$
(c)	$(1.0, 45^{\circ}, 135^{\circ})$
(d)	$(1.5, 45^{\circ}, 135^{\circ})$
(e)	$(1.0, 45^{\circ}, 225^{\circ})$
(f)	$(1.5, 45^{\circ}, 225^{\circ})$
(g)	$(1.0, 45^{\circ}, 315^{\circ})$
(h)	$(1.5, 45^{\circ}, 315^{\circ})$

Table 5. Position of the dummy head microphone

(r, θ, ϕ)

.

Table 5. Position of the dummy head microphone

(r, θ, ϕ)

.

A	$(0.5, 90^{\circ}, 180^{\circ})$
B	$(0.5, 90^{\circ}, 225^{\circ})$
C	$(0.5, 90^{\circ}, 270^{\circ})$
D	$(0.5, 90^{\circ}, 315^{\circ})$
E	$(0.5, 0^{\circ}, 90^{\circ})$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

3-D Sound Image Reproduction Method Based on Spherical Harmonic Expansion for 22.2 Multichannel Audio

Abstract

1. Introduction

2. Conventional 3-D Sound Image Reproduction Based on VBAP

3. Proposed 3-D Sound Image Reproduction Based on Spherical Harmonic Expansion

3.1. Spherical Harmonic Expansion

3.2. Algorithm of Proposed Method

4. Evaluation Experiment

4.1. Experiment 1: Evaluation of Sound Image Localization Accuracy

4.2. Experiment 2: Clarity of Reconstructed Sound Image for Given Directivity in Proposed Method

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics