#### 3.2. Azimuth-Frequency Analysis Using Time Delay

A stereo signal recorded using a stereo omni-directional microphone array,

${x}_{L}(n)$ and

${x}_{R}(n)$, can be represented as a delayed and attenuated version of the desired signal,

$s(n)$, such as [

21]

where

${v}_{L}(n)$ and

${v}_{R}(n)$ are ambient noise recorded by the left and right microphones, respectively. In addition,

${a}_{L}$ and

${a}_{R}$ are the respective attenuation factors, and

$\tau $ is the relative time delay measured between the left and right microphones. Note here that Equation (3) is designed using the far-field model [

18,

19], because the spacing between the stereo omni-directional microphones is small. Moreover, we can assume

${a}_{L}={a}_{R}\approx 1$ [

22]. Applying an

N-point STFT to Equation (3) provides the following relationship:

where

${X}^{T}=\left[\begin{array}{cc}{X}_{L}(k)& {X}_{R}(k)\end{array}\right]$ and

${V}^{T}=\left[\begin{array}{cc}{V}_{L}(k)& {V}_{R}(k)\end{array}\right].$ In addition,

$S(k)$ is the

k-th spectral component of

$s(n)$, and

$d$ is a steering vector of

where

$\tau $ can be determined by the speed of sound

$c$, the spacing between the microphones

$l$, and the direction of the source

$\theta $, as

$\tau =({f}_{s}/c)l\mathrm{sin}\theta $, where

${f}_{s}$ is the sampling rate. Thus, we have the following equation:

If

$\theta $ is known, we can separate

$S(k)$ and

$d$ from Equation (4). Then, we can modify

$d$ by replacing

$\theta $ with another value to improve the obtained stereophonic images. This is because the listener cannot feel the actual direction of

$S(k)$ when two stereo microphones are placed very close together. In practice, it is difficult to separate the direction

$d$ and source

$S(k)$, and it is even more difficult to do so under ambient noise conditions and/or with multiple sound sources [

23]. Therefore, instead of separating the sound source and its steering vector in this paper, we apply a panning law to the recorded signal

$X$, with the estimated direction. To estimate the source direction, we consider the time delay

$\tau $ in Equation (3) using the stereo signal

${x}_{L}(n)$ and

${x}_{R}(n)$, where we have assumed that

${v}_{L}(n)$ and

${v}_{R}(n)$ are negligible under high signal-to-noise ratio (SNR) conditions. In other words, the time delay is estimated as

$\widehat{\tau}=\mathrm{arg}{\mathrm{min}}_{\tau}\left|{x}_{L}(n)-{x}_{R}(n-\tau )\right|$. We can then extend this concept in the frequency domain, as:

In this paper,

$\tau $ in Equation (7) can be considered as a function of the direction

$\theta $. Therefore, the right-hand side of Equation (7), which is a function of the frequency

k, and the direction

$\theta $, is referred to as an A-F plane and defined as [

18].

We can estimate the direction

$\widehat{\theta}(k)$ so that

$AF(k,\theta )$ is minimized at the

k-th frequency bin. However, when

$AF(k,\theta )$ is used for estimating

$\widehat{\theta}(k)$, many local minima exist. To mitigate this problem, a smoothing window is applied to

$AF(k,\theta )$ prior to estimating

$\widehat{\theta}(k)$, such that:

where

$B(k)$ corresponds to a critical bandwidth of the auditory filter [

2]. For example,

$B(k)=6$ (150 Hz) when

$k=43$ (1 kHz). Thus, the direction at each frequency bin is estimated so that

$A{F}_{s}(k,\theta )$ is minimized, such that:

Figure 4 illustrates an A-F plane,

$AF(k,\theta )$, and a smoothed A-F plane,

$A{F}_{s}(k,\theta )$, computed for a stereo signal that is recorded from a stereo microphone array in an anechoic room, where a white noise source is angled at 15° and placed 1.5 m from the center of the microphone array. In the figure, a 2048-point STFT is applied to each frame of white noise, and

$\theta $ is changed from −90° to 90° at 1° steps. In addition, the distance between the two microphones is

$l$ = 3 cm and

${f}_{s}$ = 48 kHz. As shown in the figure, the direction of the white noise is easily estimated at low frequencies, but there are multiple minima at mid-to-high frequencies. As shown in

Figure 4c, the estimated direction of the white noise is 15°, which is identical to the direction at which the white noise is located for recording.

Next, we repeat the experiment above by recording white noise in a reverberant room whose reverberation time (

$R{T}_{60}$) is measured as 230 ms, and the A-F planes and estimated direction are shown in

Figure 4. Comparing

Figure 5a with

Figure 4a, the A-F plane in the reverberant room is more blurred than that in the anechoic room. This is because the reverberation muddles the direction of the sound source, making it seem as though multiple sound sources are being recorded by the stereo microphones. Owing to the smoothing window, the smoothed A-F plane shown in

Figure 5b becomes similar to that in

Figure 4b. Therefore, as shown in

Figure 5c, the direction of white noise can be estimated correctly, especially at mid-to-high frequencies, while there are some errors at low frequencies. Since it is known that stereophonic images are mostly affected by mid-to-high frequencies, the quality of stereophonic images is not significantly affected by such errors at low frequencies [

24].

**Figure 4.**
A-F planes and estimated direction for white noise in an anechoic room: (**a**) $AF(k,\theta )$; (**b**) $A{F}_{s}(k,\theta )$; (**c**) estimated direction using $A{F}_{s}(k,\theta ).$

**Figure 4.**
A-F planes and estimated direction for white noise in an anechoic room: (**a**) $AF(k,\theta )$; (**b**) $A{F}_{s}(k,\theta )$; (**c**) estimated direction using $A{F}_{s}(k,\theta ).$

#### 3.3. Frequency-Dependent Amplitude Panning

This subsection describes how the estimated direction in each frequency bin is used for stereophonic image enhancement.

Figure 6 illustrates the concept of the process described in this subsection. As shown in

Figure 6a, a sound source is located at an angle of

$\theta $. However, the close spacing between the stereo microphones could mean that it is perceived as being at a lesser angle—

i.e.,

${\theta}_{p}<<\theta .$ Thus, we have to increase the perceived angle by applying frequency-dependent amplitude panning such that

${\theta}_{0}\approx \theta >>{\theta}_{p}.$

**Figure 5.**
A-F planes and estimated direction for white noise in a reverberant room with $R{T}_{60}$ = 230 ms: (**a**) $AF(k,\theta )$; (**b**) $A{F}_{s}(k,\theta )$; (**c**) estimated direction using $A{F}_{s}(k,\theta ).$

**Figure 5.**
A-F planes and estimated direction for white noise in a reverberant room with $R{T}_{60}$ = 230 ms: (**a**) $AF(k,\theta )$; (**b**) $A{F}_{s}(k,\theta )$; (**c**) estimated direction using $A{F}_{s}(k,\theta ).$

**Figure 6.**
Illustrations of stereophonic image enhancement: (**a**) Original sound source; (**b**) perceived sound source without any enhancement technique; and (**c**) perceived sound source after applying the proposed method.

**Figure 6.**
Illustrations of stereophonic image enhancement: (**a**) Original sound source; (**b**) perceived sound source without any enhancement technique; and (**c**) perceived sound source after applying the proposed method.

Many panning methods have been reported [

19,

25,

26]. Among them, the stereophonic law of sines [

19] has been popularly used to reproduce a source using two loudspeakers, and it is realized as:

where

${\theta}_{0}$ is the physical angle between stereo loudspeakers and

$\theta $ is the desired angle at which the sound source should be located in terms of perception. Thus,

${g}_{L}$ and

${g}_{R}$ become the respective scale factors that are multiplied with the sound source according to the desired angle, as:

and

where

$s(n)$ is the sound source, and

${y}_{L}(n)$ and

${y}_{R}(n)$ are respectively the panned signals of the left and right channel.

In this paper, we extend the stereophonic law of sines so that it is applied in the frequency domain. For a given direction at the

k-th frequency bin

$\widehat{\theta}(k)$, as described in

Section 3.2, the frequency-dependent scale factors,

${g}_{L}(k)$ and

${g}_{R}(k)$, are obtained using the following equation:

where

${\theta}_{0}$ is also the physical angle between stereo loudspeakers, as described in Equation (10). As in Equations (11) and (12), the scale factors to Equation (13) are multiplied to the

k-th spectral magnitude of the sound source as:

and

Here, while $S(k)$ should be separated from $\mathbf{X}$ according to Equation (4), the spectral magnitude of the sound source is approximated as the mid signal of the recorded sound. That is $S(k)\approx ({X}_{L}(k)+{X}_{R}(k))/2$. Finally, by applying an inverse STFT followed by the overlap-add method, the output signal with an enhanced stereophonic image is obtained.