Frequency-Dependent Amplitude Panning for the Stereophonic Image Enhancement of Audio Recorded Using Two Closely Spaced Microphones

In this paper, we propose a new frequency-dependent amplitude panning method for stereophonic image enhancement applied to a sound source recorded using two closely spaced omni-directional microphones. The ability to detect the direction of such a sound source is limited due to weak spatial information, such as the inter-channel time difference (ICTD) and inter-channel level difference (ICLD). Moreover, when sound sources are recorded in a convolutive or a real room environment, the detection of sources is affected by reverberation effects. Thus, the proposed method first tries to estimate the source direction depending on the frequency using azimuth-frequency analysis. Then, a frequency-dependent amplitude panning technique is proposed to enhance the stereophonic image by modifying the stereophonic law of sines. To demonstrate the effectiveness of the proposed method, we compare its performance with that of a conventional method based on the beamforming technique in terms of directivity pattern, perceived direction, and quality degradation under three different recording conditions (anechoic, convolutive, and real reverberant). The comparison shows that the proposed method gives us better stereophonic images in a stereo loudspeaker reproduction than the conventional method without any annoying effects.


Introduction
Stereo loudspeaker reproduction is widely used to provide a more natural listening experience because of the distinguished relative positions of objects and events in the horizontal plane [1].In fact, a stereo audio system can deliver a more immersive illusion than a mono system, because spatial information (e.g., inter-channel time difference (ICTD) and inter-channel level difference (ICLD) [2]) helps listeners perceive a horizontal direction [3].In addition, according to duplex theory [2], the ICTD and ICLD are dominant for horizontal sound localization at low frequencies (below 1-2 kHz) and high frequencies (above 1-2 kHz), respectively.
Stereo audio recording techniques can be classified into three different categories depending on the placement and characteristics of microphones: coincident, near-coincident, and spaced recording techniques [4,5].Coincident recording techniques such as the XY and mid-side (MS) techniques [5] place stereo microphones as close together as possible at different angles to capture a stereophonic image, where the stereophonic image is about sound recording and reproduction concerning the perceived spatial locations of the sound source.Thus, a good stereophonic image means that the location of the sound source can be clearly perceived, while a poor one means that the location of the source is difficult to be perceived [6].In addition, near-coincident recording techniques such as the Office de Radiodiffusion Télévision Française (ORTF) and Nederlandse Omroep Stichting (NOS) techniques [4,7] place microphones slightly apart.In the ORTF technique, the microphone spacing is similar to the human ear spacing, while the spacing for the NOS technique is approximately 30 cm.In general, both coincident and near-coincident recording techniques utilize directional microphones to realize good directional characteristics [4].However, with spaced recording techniques, including AB techniques [4,8], stereophonic images can be obtained by the ICTD between stereo microphones, because omni-directional microphones are often used in such techniques [4,9].
To date, numerous portable video and audio capture devices have been released to the market.These devices usually capture stereo audio as well as high-quality video.Unfortunately, because most portable devices are limited in size, and coincident or near-coincident recording techniques are most appropriate for such devices.However, when the audio signals captured by these recording techniques are reproduced, the stereophonic images often do not feel sufficient [10][11][12][13].This is because the body of a portable device equipped with directional microphones acts as wall reflection in a recording, which is referred to as shadowed directivity [10].As an alternative, a spaced recording technique can be applied to capture stereo audio from such portable devices [11][12][13].In this case, the width or length of the portable device body such as a smart phone or digital camera is approximately 10 cm, thus the allowable distance between two microphones is less than 3 cm.This hardware limitation makes it difficult to match the perceived azimuth angle of the reproduced sound source to the actual that of the original sound source, when a spaced recording technique is applied [8].
To mitigate this, a stereophonic image enhancement method using head-related transfer functions (HRTFs) was proposed in [11].This method was more successful at enhancing stereophonic images than original stereo signals, but it had a somewhat limited sweet spot, and it was difficult to deliver reliable stereo quality to a listener located beyond this sweet spot.If the original stereo signals were nearly monaural signals, it is difficult to enhance a stereophonic image properly, even though the illusion of wider stereo loudspeaker spacing is created.Umayahara et al. proposed a stereophonic image control method that linearly interpolated the spectra of the left and right channels in the frequency domain [12].This method used the same interpolation factor for all frequencies; thus, its performance for enhancing a stereophonic image might have been limited when the direction of the input stereo signals was dependent on the frequency [14].In another study, a delay-and-sum (DS) beamformer was utilized to convert AB stereo signals into XY stereo signals [13].However, in real reverberant recording environments, the DS beamformer changed the direction of stereophonic images due to the reverberant effects [15].This was because the reverberation time changed according to the frequency, and the DS beamforming weights could not be adapted to this reverberation time change [14].These results suggest that frequency-dependent amplitude panning for stereophonic image enhancement is necessary for real reverberant environments.
Accordingly, we propose frequency-dependent amplitude panning for stereophonic image enhancement when two omni-directional microphones are closely spaced, as deployed in portable devices.In [16,17], frequency-dependent amplitude panning was also used to enhance the stereophonic image for portable devices equipped with closely spaced stereo microphones, where the ratio of spectral magnitudes between the left-and right-channel signal was used for the panning.On the other hand, the proposed method first introduces azimuth-frequency (A-F) analysis [18] to estimate the direction of the input audio according to frequency.In other words, we first apply short-time Fourier transform (STFT) to the input stereo audio signal, and then project the spectral component of each frequency bin on an azimuth plane that is generated by converting the time difference between two microphones into their level difference.Next, the direction at each frequency bin is assigned as the azimuth at which the projected magnitude is minimized.Finally, a frequency-dependent amplitude panning technique is proposed to enhance the stereophonic image by modifying the stereophonic law of sines [19].
To evaluate the performance of the proposed method, three different recording environments are considered: anechoic, convolutive, and real reverberant.First, the directivity pattern of the proposed method is compared with that of a conventional method based on the DS beamformer [13] in the three recording environments.Second, the directional accuracy of the stereo audio processed by the proposed method is compared to that of the audio processed by the conventional method by measuring the subjective directions of listeners depending on the horizontal direction of the sources.Finally, a degraded mean opinion score (DMOS) assessment [20] is carried out to evaluate the quality as an aspect of the audio distortion after the proposed method has been applied.
The remainder of this paper is organized as follows: a conventional method based on the DS beamformer [13] is described in Section 2. Section 3 describes the proposed stereophonic image enhancement method based on A-F analysis and a frequency-dependent amplitude panning technique.Section 4 then evaluates the performance of the proposed method applied to audio signals recorded in anechoic, convolutive, and real reverberant environments.Finally, Section 5 concludes this paper.

Conventional Stereophonic Image Enhancement
In this section, we describe a conventional stereophonic image enhancement method applied to closely spaced omni-directional microphones based on a DS beamforming technique [13,21].Figure 1 shows a block diagram of the conventional method.As shown in the figure, the DS beamformer of the conventional method compensates for the delay between the two channels, where the delay, n d , is determined depending on the distance, l, between the two microphones.After that, a free-field response filter, hpnq, [13] is applied to the beamformed signals to obtain the enhanced stereo signals, y L pnq and y R pnq, respectively by: y R pnq " hpnq ˚px L pnq ´xR pn ´nd qq where ˚indicates the linear convolution operator, and x L pnq and x R pnq are the stereo audio sequences obtained by the omni-directional microphones.The directivity patterns for the beamformed signal of the left channel at 2.5 kHz and 4 kHz are depicted in Figure 2a,b, respectively.Note that the directivity patterns for the right channel are exactly the opposite of those for the left channel.As illustrated in the figure, the conventional method provides different directional responses depending on the frequency.That is, it has a cardioid and a super-cardioid directivity pattern for 2.5 and 4 kHz, respectively.It is expected that the stereophonic image of the audio signal at 2.5 kHz should be enhanced but that at 4 kHz could be distorted due to the negative rear lobe of the super-cardioid pattern [13].To remedy this problem, a Wiener filter has been applied to the beamformed signal to reduce the negative rear-lobe effects [13].Nevertheless, it was reported that the DS beamformer with a Wiener filter could not change the direction of a stereophonic image when audio recording was performed in a reverberant environment [13].This is because the reverberation time differs from the frequency and the DS beamformer cannot be adapted to such reverberation time changes [14].Therefore, we propose a method that operates differently depending on the frequency, referred to as the frequency-dependent amplitude panning for stereophonic image enhancement (FDAP-SIE) method and compare the performance of our proposed method with that of the DS beamformer described in this section.

Overview
In this section, we propose an FDAP-SIE method applied to audio recording with two closely spaced omni-directional microphones and illustrate the block diagram of the proposed method in Figure 3. First, the left-and right-channel input signals, respectively designated x L pnq and x R pnq, are each segmented into a sequence of frames of 2048 samples by applying a Hanning window, and each frame is overlapped with 1024 samples of the previous frame.Then, a 2048-point STFT is applied to each segment to obtain X L pkq and X R pkq.Next, A-F analysis is carried out to estimate the direction of the sound sources in each frequency bin.After that, frequency-dependent amplitude panning is applied to X L pkq and X R pkq according to the estimated direction for the k-th frequency bin.Finally, an inverse STFT followed by an overlap-add method is applied to obtain the enhanced stereophonic signal.

Azimuth-Frequency Analysis Using Time Delay
A stereo signal recorded using a stereo omni-directional microphone array, x L pnq and x R pnq, can be represented as a delayed and attenuated version of the desired signal, spnq, such as [21] where v L pnq and v R pnq are ambient noise recorded by the left and right microphones, respectively.In addition, a L and a R are the respective attenuation factors, and τ is the relative time delay measured between the left and right microphones.Note here that Equation ( 3) is designed using the far-field model [18,19], because the spacing between the stereo omni-directional microphones is small.Moreover, we can assume a L " a R « 1 [22].Applying an N-point STFT to Equation (3) provides the following relationship: where In addition, Spkq is the k-th spectral component of spnq, and d is a steering vector of where τ can be determined by the speed of sound c, the spacing between the microphones l, and the direction of the source θ, as τ " p f s {cqlsinθ, where f s is the sampling rate.Thus, we have the following equation: If θ is known, we can separate Spkq and d from Equation (4).Then, we can modify d by replacing θ with another value to improve the obtained stereophonic images.This is because the listener cannot feel the actual direction of Spkq when two stereo microphones are placed very close together.In practice, it is difficult to separate the direction d and source Spkq, and it is even more difficult to do so under ambient noise conditions and/or with multiple sound sources [23].Therefore, instead of separating the sound source and its steering vector in this paper, we apply a panning law to the recorded signal X, with the estimated direction.To estimate the source direction, we consider the time delay τ in Equation (3) using the stereo signal x L pnq and x R pnq, where we have assumed that v L pnq and v R pnq are negligible under high signal-to-noise ratio (SNR) conditions.In other words, the time delay is estimated as τ " argmin τ |x L pnq ´xR pn ´τq|.We can then extend this concept in the frequency domain, as: In this paper, τ in Equation ( 7) can be considered as a function of the direction θ.Therefore, the right-hand side of Equation (7), which is a function of the frequency k, and the direction θ, is referred to as an A-F plane and defined as [18].
We can estimate the direction θpkq so that AFpk, θq is minimized at the k-th frequency bin.However, when AFpk, θq is used for estimating θpkq, many local minima exist.To mitigate this problem, a smoothing window is applied to AFpk, θq prior to estimating θpkq, such that: AFpm, θq, ( where Bpkq corresponds to a critical bandwidth of the auditory filter [2].For example, Bpkq " 6 (150 Hz) when k " 43 (1 kHz).Thus, the direction at each frequency bin is estimated so that AF s pk, θq is minimized, such that: θpkq " argmin θ AF s pk, θq .
Figure 4 illustrates an A-F plane, AFpk, θq, and a smoothed A-F plane, AF s pk, θq, computed for a stereo signal that is recorded from a stereo microphone array in an anechoic room, where a white noise source is angled at 15 ˝and placed 1.5 m from the center of the microphone array.In the figure, a 2048-point STFT is applied to each frame of white noise, and θ is changed from ´90 ˝to 90 ˝at 1 steps.
In addition, the distance between the two microphones is l = 3 cm and f s = 48 kHz.As shown in the figure, the direction of the white noise is easily estimated at low frequencies, but there are multiple minima at mid-to-high frequencies.As shown in Figure 4c, the estimated direction of the white noise is 15 ˝, which is identical to the direction at which the white noise is located for recording.
Next, we repeat the experiment above by recording white noise in a reverberant room whose reverberation time (RT 60 ) is measured as 230 ms, and the A-F planes and estimated direction are shown in Figure 4. Comparing Figure 5a with Figure 4a, the A-F plane in the reverberant room is more blurred than that in the anechoic room.This is because the reverberation muddles the direction of the sound source, making it seem as though multiple sound sources are being recorded by the stereo microphones.Owing to the smoothing window, the smoothed A-F plane shown in Figure 5b becomes similar to that in Figure 4b.Therefore, as shown in Figure 5c, the direction of white noise can be estimated correctly, especially at mid-to-high frequencies, while there are some errors at low frequencies.Since it is known that stereophonic images are mostly affected by mid-to-high frequencies, the quality of stereophonic images is not significantly affected by such errors at low frequencies [24].

Frequency-Dependent Amplitude Panning
This subsection describes how the estimated direction in each frequency bin is used for stereophonic image enhancement.Figure 6 illustrates the concept of the process described in this subsection.As shown in Figure 6a, a sound source is located at an angle of θ.However, the close spacing between the stereo microphones could mean that it is perceived as being at a lesser angle-i.e., θ p ăă θ.Thus, we have to increase the perceived angle by applying frequency-dependent amplitude panning such that θ 0 « θ ąą θ p .Many panning methods have been reported [19,25,26].Among them, the stereophonic law of sines [19] has been popularly used to reproduce a source using two loudspeakers, and it is realized as: where θ 0 is the physical angle between stereo loudspeakers and θ is the desired angle at which the sound source should be located in terms of perception.Thus, g L and g R become the respective scale factors that are multiplied with the sound source according to the desired angle, as: and y R pnq " g R spnq, (13) where spnq is the sound source, and y L pnq and y R pnq are respectively the panned signals of the left and right channel.
In this paper, we extend the stereophonic law of sines so that it is applied in the frequency domain.For a given direction at the k-th frequency bin θpkq, as described in Section 3.2, the frequency-dependent scale factors, g L pkq and g R pkq, are obtained using the following equation: where θ 0 is also the physical angle between stereo loudspeakers, as described in Equation (10).As in Equations ( 11) and ( 12), the scale factors to Equation ( 13) are multiplied to the k-th spectral magnitude of the sound source as: and Y R pkq " g R pkqSpkq.
Here, while Spkq should be separated from X according to Equation ( 4), the spectral magnitude of the sound source is approximated as the mid signal of the recorded sound.That is Spkq « pX L pkq `XR pkqq{2.Finally, by applying an inverse STFT followed by the overlap-add method, the output signal with an enhanced stereophonic image is obtained.

Performance Evaluation
To demonstrate the effectiveness of the proposed method, three different recording environments were considered: anechoic, convolutive, and real reverberant.Figure 7 illustrates the configuration for the room impulse response (RIR) filter design.The dimensions of the room were 4.5 m ˆ7.5 m ˆ2.5 m, and a stereo microphone array with 3-cm spacing was located in the room at the coordinates denoted in the figure .To simulate the convolutive environment, a RIR filter was designed based on the image method [27], and the response of the left channel is shown in Figure 8.As shown in the figure, RT 60 of this RIR was measured as 230 ms.Experimental setup for simulating the room impulse response to simulate a reverberant environment.The performance of the proposed method was evaluated in terms of three different measurements.First, the directivity pattern of the proposed method was compared with that of a conventional method based on a DS beamformer [13] in the three recording environments.Second, the accuracy of direction estimates for the stereo audio processed by the proposed method was compared with that processed by the conventional method by measuring the perceived directions of listeners depending on the horizontal directions of sources.Third, a DMOS assessment [20] was carried out to evaluate the audio quality degradation after the proposed method had been applied.

Directivity Pattern Performance
Figure 9 shows an experimental setup for evaluating the performance of the directivity patterns.A stereo microphone array was placed with 3-cm spacing, and one loudspeaker was located at 60 ˝from the center of the microphone array at a distance of 1.5 m.The white noise at a sampling rate of 48 kHz was played out via loudspeaker and recorded by the microphone array.The recorded signal was processed by both the DS beamformer and the proposed method.After that, the recorded and processed white noise signals were all played through stereo loudspeakers that were configured according to International Telecommunication Union Radiocommunication Sector (ITU-R) Recommendation BS.775-1 [28].Then, a dummy head [29] was rotated from 0 ˝to 350 ˝at 10 ˝steps to measure the directivity patterns.Figure 10 compares the directivity patterns of the original source with those obtained by the DS beamformer and the proposed method in three different environments (anechoic, convolutive, and real reverberant room).As shown in Figure 10a, the directivity for the original signal was towards 0 in the anechoic environment, while the actual directivity was set to 60 ˝.However, the directivities of the signals processed by the conventional and proposed methods were approximately 30 ˝, which was the same angle of the loudspeakers against the dummy head.Consequently, we concluded that the proposed and conventional methods significantly enhanced the originally recorded signal.Next, when comparing the directivity in and real reverberant environments, it was clear that the proposed method could locate the sound source to approximately 30 ˝, while the DS beamformer failed to do so.This was because the simulated and real reverberation limited the stereophonic image enhancement of the DS beamformer.However, the proposed method was not affected by the reverberation, due to the frequency-dependent direction estimation and panning.

Perceived Direction Performance
Figure 11 shows the experimental setup for evaluating listeners' directional perception.To record stereo signals, a stereo microphone array was placed with 3-cm spacing, and a sound source was played through one loudspeaker from the center of the microphone array at a distance of 1.5 m.For the evaluation, each listener was sitting in an anechoic room of dimensions 2130 mm ˆ3370 mm ˆ3000 mm, in which two loudspeakers had been placed as shown in the figure .Note that here the model of all the loudspeakers was Genelec 6010A.In this experiment, we prepared five audio clips that were excerpted from the sound quality assessment material (SQAM) [30]; Table 1 describes the genre and musician of each audio clip.Note that since all audio clips were sampled at 44.1 kHz, we upsampled the audio clips to 48 kHz to ensure a consistent experimental environment.Then, we recorded five audio clips in three different environments where the loudspeaker was rotated from 0 ˝to 90 ˝at a 15 ˝step towards the right direction, resulting in seven different directions.After that, the recorded signals were processed by the DS beamformer and the proposed method (some audio samples can be found at [31]).After the processed clips were played at sound pressure level (SPL) 90 dB, eight participants (four males and four females) with no auditory diseases were asked to indicate their perceived directions for the original and processed signals.Note that the participants were allowed head movement.Figure 12 compares the perceived azimuths averaged over five audio clips and eight participants for each source direction in three different recording environments, where the dashed line indicates the target direction, and the vertical bar on each bar chart corresponds to the standard deviation.As shown in the figure, the originally recorded signals were all perceived at around 0 ˝-10 ˝for all environments, even though the actual source angles were above 15 ˝.By applying the DS beamformer to enhance the image, the perceived angles increased.However, errors between the actual angle (dashed straight line) and the perceived angle increased as the actual angle increased, especially in anechoic and real reverberant environments.The proposed method provided smaller perceived errors than the DS beamformer, which implies that it could enhance stereophonic images for all recording environments compared to the conventional method.

Audio Quality Degradation
To evaluate the quality degradation of audio signals processed by the proposed method, we performed a DMOS assessment test according to ITU Telecommunication Standardization Sector (ITU-T) Recommendation P.800 [20].The experimental conditions such as audio clips, participants, and listening room are identical to those of the experiment described in Section 4.2.Each participant listened to a pair of audio clips composed of an original and processed version by either the DS beamformer or the proposed method.Then, each was asked to rate the degree of quality degradation from five to one.Table 2 describes the scores and their meanings for DMOS assessment.Degradation is very annoying Table 3 compares the results of DMOS assessment between the conventional DS beamformer-based method and the proposed method in three different recording environments.We conducted a statistical analysis and indicated the 95% confidence intervals (CIs) as numbers in parentheses in Table 3.As shown in the table, the proposed method provided average DMOS scores of approximately four for all environments, which implied that there were no annoying effects [32].However, there was significant quality degradation in the conventional method, especially in the real reverberant environment.It was revealed from statistical analysis that the quality degradation of the audio signals enhanced by the proposed method was statistically less than those enhanced by the DS beamformer.

Conclusions
In this paper, proposed a frequency-dependent stereophonic image enhancement method that could be applied to two closely spaced omni-directional microphones available for portable audio recording devices.First, the A-F plane was obtained from the spectral magnitudes of stereo audio signals.Next, the direction at each frequency bin was estimated as the azimuth at which the A-F plane was minimized.Finally, a frequency-dependent amplitude panning technique was also proposed to enhance the stereophonic image from the stereophonic law of sines.The performance of the proposed method was evaluated in three different recording environments: anechoic, convolutive, and real reverberant.First, the directivity pattern of the proposed method was compared to that of a conventional method based on a DS beamformer.Second, the directional accuracy of the stereo audio processed by the proposed method was compared to that processed by a conventional method by the measurement of listeners' perceived directions.Finally, a DMOS assessment test was carried out to evaluate quality degradation after the proposed method had been applied.Consequently, it was revealed that the proposed method gave better directivity, higher directional accuracy, and less quality degradation than the conventional method.It was argued here that, compared to the conventional method, the proposed method could improve performance with the help of frequency-dependent processing.
We have only experimented with a single source throughout this study, so we are planning to examine what happens when multiple sources are available.One possible approach will be to detect multiple directions from the A-F analysis and propose an appropriate panning method that can treat multiple angles.

Figure 1 .
Figure 1.Block diagram of a conventional stereophonic image enhancement method.

Figure 3 .
Figure 3. Block diagram of the proposed frequency-dependent amplitude panning for stereophonic image enhancement.

Figure 6 .
Figure 6.Illustrations of stereophonic image enhancement: (a) Original sound source; (b) perceived sound source without any enhancement technique; and (c) perceived sound source after applying the proposed method.

Figure 7 .
Figure 7.Experimental setup for simulating the room impulse response to simulate a reverberant environment.

Figure 8 .
Figure 8. Simulated room impulse response of the left channel based on the image method, where RT 60 was measured as 230 ms.

Figure 9 .
Figure 9. Experimental setup for evaluating the directivity patterns.

Figure 10 .
Figure 10.Comparison of directivity pattern of the original source with those obtained by the DS beamformer and the proposed method in three different environments: (a) anechoic; (b) convolutive; and (c) real reverberant room.

Figure 11 .
Figure 11.Experimental setup for evaluating the perceived directions.

Figure 12 .
Figure 12.Comparison of the perceived azimuth depending on the direction of the source in three different environments: (a) anechoic; (b) convolutive; and (c) real reverberant room.

Table 1 .
Detailed information on five audio clips used for the evaluation of perceived direction.

Table 2 .
Score and description of degraded mean opinion score (DMOS) assessment.

Table 3 .
Comparison of DMOS assessment results of the conventional and proposed methods for three different recording environments where the numbers in parentheses indicate 95% CIs.