Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones

Takahashi, Toru; Kanbayashi, Taiki; Aoki, Ryota; Ochi, Yuta; Lee, Akira; Nakayama, Masato

doi:10.3390/jeta3040034

Open AccessArticle

Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones

by

Toru Takahashi

^1,*

,

Taiki Kanbayashi

²,

Ryota Aoki

²,

Yuta Ochi

²,

Akira Lee

¹ and

Masato Nakayama

³

¹

Faculty of Information Design Technology, Osaka Sangyo University, 3-1-1 Nakagaito Daito, Osaka 574-8530, Japan

²

Graduate School of Engineering, Osaka Sangyo University, 3-1-1 Nakagaito Daito, Osaka 574-8530, Japan

³

College of Information Science and Engineering, Ritsumeikan University, 2-150 Iwakura-cho Ibaraki, Osaka 567-8570, Japan

^*

Author to whom correspondence should be addressed.

J. Exp. Theor. Anal. 2025, 3(4), 34; https://doi.org/10.3390/jeta3040034

Submission received: 31 August 2025 / Revised: 16 October 2025 / Accepted: 29 October 2025 / Published: 31 October 2025

Download

Browse Figures

Versions Notes

Abstract

We describe the problem of estimating the speaker’s head orientation from the asynchronous multi-channel waveforms observed by microphones distributed in a room. In particular, we address a novel problem of estimating head orientation from sound captured by fewer microphones than the number of distinct head orientations to be distinguished. This is because the head orientation is an important clue indicating the speaker’s intended conversational partners. Head orientation estimation technology is an essential technology within environmental intelligence technology, which uses sensors embedded in rooms to monitor and support people’s activities. We propose a head orientation estimation method that aims to achieve high angular resolution using a small number of microphones. The proposed method achieves high estimation accuracy by using the spatial radiation pattern of the sound source as clues and by integrating information from multiple frequency bands. We conducted an experiment to estimate head orientation with an angular resolution of

15 degrees

under observation conditions using six microphones. Experimental results showed that higher estimation accuracy was obtained than the conventional method using distributed microphone arrays (Oriented Global Coherence Field method) and the conventional method using distributed microphones (Radiation Pattern Matching method). The proposed method utilizing multiple frequency bands achieved the best performance with a mean absolute error of

10.58 degrees

in the task of classifying 24 distinct head orientations.

Keywords:

acoustic signal processing; asynchronous processing; head orientation estimation; radiation characteristics of acoustic energy; distributed microphone; dialogue monitoring

1. Introduction

Many people participate in large-scale events, such as international conferences and exhibitions. They interact with various people at the event venue. If we can analyze their interactive activities, various applications are realized based on these analysis results. For example, it is possible to identify booths and times when dialogue activities are most likely to occur. It is also possible to identify a key person who generates dialogue groups and facilitates them. Based on the number of people who participated in each dialogue, it is possible to estimate which dialogues covered topics that are of interest to many people.

The analysis of people’s dialogic activities requires the ability to monitor them. Research of monitoring dialogue activities is progressing due to research on the position of each speaker (sound source localization [1]), the orientation of each speaker’s head (head orientation estimation [2,3,4]), the utterance period (Voice Activity Detection [5]), the content of the utterance (Automatic Speech Recognition [6]), the identification of the speaker (Speaker Identification [7]), sound source separation [8], etc. With the development of these elemental technologies for acoustic signal processing and speech signal processing, it is becoming possible to analyze actual dialog activities in the real world. The elemental technologies are related to each other, so if the speaker’s position and head direction are known, it becomes easier to separate the spoken sounds. If specific utterances can be separated from other utterances, background noise and environmental sounds, utterance intervals can be easily detected. Furthermore, by linking with downstream applications such as automatic speech recognition and speaker identification, it is possible to recognize the utterances of separated speech and identify the speaker. The high-level analysis function of a dialogue activity monitoring system is realized by the function of clarifying the relationships between speakers, that is, who is speaking with whom and, therefore, requires the cooperation of many elemental technologies. Analysis from a higher-order meta-perspective, such as analyzing multiple conversations simultaneously, requires the collaboration of even more elemental technologies.

Although many of the functions [1,9] expected of dialogue activity monitoring systems are attractive, these functions have not been sufficiently developed. Although we have so far been able to estimate the speaker’s position [1], we have not been able to integrate other elemental technologies such as estimating the speaker’s head direction. This is because we are proceeding with development while satisfying the requirement of being able to cover a wide observation range that conventional methods have not considered. In the analysis of real dialogue, the positions of the speakers are not determined in advance, so it is necessary to cover a wide observation range in order to deal with the problem of not knowing when and where the dialogue will begin. In the case of a simulated dialogue, it is possible to teach the speakers to start a dialogue within the observation range, but the speakers will be analyzing their speech activities according to a script, which poses different problems than analyzing actual dialogue activities.

We believe that it is necessary to develop a real-time system that assumes that microphones are distributed over a wide area and that they collect sound all the time. Based on this belief, we proposed a dialogue monitoring method [1] in which microphone arrays are distributed at each vertex of a repeating regular hexagon with sides of

1.5 m

. The system has a distributed arrangement of 4-channel microphone arrays and has a structure that does not require time synchronization between the microphone arrays. This system has a structure that allows the addition of an unlimited number of microphone arrays, so the observation range can be expanded by the number of microphone arrays added. This system makes it possible to simultaneously monitor the positions of multiple sound sources. In other words, it is now possible to capture the actual dialogue of the speakers whose sound will be captured without the need to alert them to the existence of the sound collection system.

In this work, we expand the speaker position estimation system developed by the authors that can observe a wide range of areas and describe the results of our study to add a function for estimating the speaker’s head direction. Specifically, we focus on head orientation estimation based on a hexagonal distributed arrangement of microphones. We propose a head orientation estimation method that can be integrated into our sound source localization system [1]. We also propose a method that handles multiple frequency bands for head orientation estimation and experimentally demonstrate effective band combinations for estimation.

The following sections are organized as follows. Section 2 provides a head orientation estimation and conventional methods for estimating head orientation. Section 3 details the proposed methodology, including theoretical framework. Section 4 details experimental design, and implementation specifics. Section 5 represents the experimental results, with a focus on performance metrics and comparative analysis against conventional approaches. Finally, Section 6 concludes the paper with a summary of contributions and future works.

2. Method of Estimating Head Orientation

An objective of head orientation estimation is to find a direction where energy radiated from a sound source is maximum. When microphones are placed in all directions, we can find head orientation by selecting a microphone with maximum energy observed among all microphones. However, a required angular resolution is limited and the number of microphones M is also limited. When the angular resolution is

R degrees

, we have to discriminate

J = 360 / R

kinds of directions. When

M > J

, it is easy to estimate accurately. In this paper, we describe head orientation estimation under the condition of

M < J

(i.e.,

M = 6

and

J = 24

). Figure 1 shows alignment of directional sound source and six microphone arrays. The black solid lines formed a hexagon show the alignment of microphones. The gray dashed lines are shown to represent that the directional sound source is located at the coordinate origin for convenience. Like this figure, microphone arrays, instead of microphones, are often used for head orientation estimation. Although the proposed method is based on six microphones, we compare our method with both microphone-based and microphone-array-based methods.

Conventional head orientation estimation methods include the Oriented Global Coherence Field (OGCF) [10] method and the RAdiation Pattern Matching (RAPM) [11] method. OGCF is a head orientation estimation method that uses distributed microphone arrays without prior knowledge. OGCF is known to perform with high accuracy among estimation methods using microphone arrays. On the other hand, RAPM is a head orientation estimation method that uses distributed microphones like the proposed method. RAPM is a method that uses the radiation pattern of sound source as prior knowledge and is, therefore, performed with high estimation performance without using microphone arrays. We compare the estimation performance of these two methods and the proposed method and demonstrate the effectiveness of the proposed method. In the following sections, we explain the conventional methods OGCF and RAPM.

2.1. Oriented Global Coherence Field (OGCF) Method

Oriented Global Coherence Field (OGCF) [10] is one of the acoustic processing based on the use of a coherence measure derived from the cross-power spectrum phase analysis. Although the cross-power spectrum is typically used for speaker localization, the coherence measure is used for head orientation estimation with microphone arrays.

We describe classifying J kinds of orientations by using M microphone arrays. OGCF method estimates head orientation by maximizing OGCF value (

O G C F_{j} (S)

) with respect to index j of hypothesis orientation

θ_{j} = - π + \frac{2 π (j - 1)}{J}

.

\hat{j} = arg max_{j} O G C F_{j} (S)

(1)

Once

\hat{j}

is calculated, head orientation is estimated as

θ_{\hat{J}}

. Let J be equispaced points

P_{j}, j = 1, 2, \dots, J

on a circle C, centered at a sound source location S and radius r, and M equispaced points

Q_{m}, m = 1, 2, \dots, M

on the circle C.

Q_{m}

is the location of m-th microphone array.

θ_{j}

is orientation to the

P_{j}

at a point of the sound source S.

OGCF of orientation to the

P_{j}

is defined at a point of the sound source S as

O G C F_{j} (S) = \sum_{m = 1}^{M} G C F (Q_{m}) ω (θ_{m j}),

(2)

G C F (Q_{m}) = \sum_{(i, k) \in G} C S P_{i, k} (t, τ (i, k, Q_{m})),

(3)

where G is a set which contains all possible microphone element pairs on a microphone array, and

C S P_{i k} (t, τ (i, k, Q_{m}))

is Crosspower Spectrum Phase (CSP [12]) coefficient at time t and lag

τ (i, k, Q_{m})

.

τ (i, k, Q_{m})

is theoretically derived from a distance between S and

Q_{m}

.

θ_{m j} \in [- π, π]

is the angle between

P_{j}

and

Q_{m}

.

ω (θ_{m j})

is a weight computed from Gaussian function:

ω (θ) = \frac{1}{\sqrt{2 π}} exp (- \frac{θ^{2}}{2}) .

(4)

As a result, the weights

ω (θ_{m j})

related to the j-the orientation emphasize the contributions of GCFs in points

Q_{m}

closer to

P_{j}

and suppress the contributions corresponding to points in the opposite direction. A estimation performance of the OGCF method basically depends on Equation (4). When we use Equation (4), we implicitly assume there are more M than J. This is because, unless it is assumed that at least one microphone array is located near the extension line in the head orientation, there will be no GCF contributing to the OGCF.

2.2. Radiation Pattern Matching (RAPM) Method

Radiation Pattern Matching (RAPM) [11] is one of the acoustic processing based on the similarity between predefined radiation pattern of sound source and observed pattern captured by distributed M microphones. These patterns are M dimensional vector and consist of short-time-frame energy

ϵ (n_{t})

at time index

n_{t}

with frame length N defined as

ϵ (n_{t}) = \sum_{n = 0}^{N - 1} x^{2} (n_{t} - N / 2 + n),

(5)

where

x (n)

is time series observed by one microphone element.

We describe classifying J orientations by using M microphones. RAPM estimates head orientation by maximizing similarity function

J (θ)

with respect to index j of hypothesis orientation

θ_{j} = - π + \frac{2 π (j - 1)}{J}

.

\hat{j} = arg max_{j} J (θ_{j})

(6)

Once

\hat{j}

is calculated, head orientation is estimated as

θ_{\hat{J}}

. The similarity function is based cosine similarity. It is defined as

J (θ) = \frac{1}{| B |} \sum_{b \in B} \frac{ϕ_{b}^{T} P_{b} (θ)}{| | ϕ_{b} | | \cdot | | P_{b} (θ) | |}

(7)

where B is a set of 1/3-octave band b,

| | | |

represents L2 norm, and superscript T represents vector transpose. And a frequency predefined radiation pattern and an observed pattern are

ϕ_{b} = {[ϕ_{1, b}, \dots, ϕ_{M, b}]}^{T},

(8)

P_{b} = {[P_{1, b} (θ), \dots, P_{M, b} (θ)]}^{T} .

(9)

ϕ_{m, b}

is the short-time power spectral density of each microphone m in 1/3-octave bands, defined as

ϕ_{m, b} = \frac{1}{| K_{b} |} \sum_{k \in K_{b}} ϕ_{m} (k),

(10)

ϕ_{m} (k)

is power spectrum density calculated from observed signal captured from m-th microphone.

K_{b}

is a set containing all frequency bins of 1/3-octave band b.

P_{m, b} (θ) = {| D_{b} (θ_{m} - θ) |}^{2}

, and

D_{b} (θ_{m} - θ)

is the speech directivity towards microphone m with azimuth

θ_{m}

.

3. Head Orientation Estimation Method Using Multi-Frequency Bands (Proposed Method)

The information that the proposed method uses to estimate head orientation is information about how energy is diffused in space after sound is radiated. With directional sound sources, the amount of energy diffused is biased toward the front, so if you know the orientation in which the strongest energy is radiated, you can determine the orientation of the sound source. Therefore, by distributing microphones around the sound source and observing the energy, the orientation of the sound source can be estimated from the position of the microphone where the maximum energy is observed. For example, if you want to distinguish between 72 different audio directions, you can easily estimate the orientation by using 72 microphones. However, it is not realistic to arrange a large number of microphones such as 72 in the first place. Therefore, the number of microphones must be reduced. If you reduce the number of microphones placed, a situation will arise in which the observation microphone is not necessarily placed exactly in front of the sound source. If no microphone is placed in front of the sound source, the orientation will be estimated from nearby microphones. Previous research has been evaluated as a problem in which the orientation of the sound source is distinguished between at most four to eight orientations. The history of research on estimating the orientation of a sound source has generally focused on simple tasks, where the number of orientations that can be distinguished is limited and it is possible to use a relatively large number of microphones compared to the orientations to be distinguished. We aim to achieve high angular resolution under the conditions that the number of orientations to be distinguished is 24 and the number of microphones used is 6 and to consider sound source orientation estimation when the number of microphones distributed around the sound source is clearly smaller than the number of orientations of the sound source to be discriminated. To cope with the relative deterioration of the information obtained from the microphone, we take measures to utilize the radiation characteristics of the sound source.

The reason RAPM and other methods were unable to achieve high angular resolution is because they did not use band-splitting processing. Normally, the radiation characteristics of a sound source tend to be sharper as the frequency increases, emitting strong energy toward the front, and becoming duller as the frequency decreases, emitting energy uniformly in all directions. Since a powerful clue for orientation estimation is the bias in the spread of energy, it is expected that the estimation accuracy will be higher as a higher frequency band is used. On the other hand, to estimate the strongest radiation direction of a signal whose energy varies depending on the frequency band, such as voice, the signal-to-noise ratio will differ depending on the frequency band due to noise and reverberation. Therefore, we thought it would be possible to improve estimation accuracy by estimating orientation by combining a band with the high energy of the sound source signal, a band with little fluctuation, and a band with less influence on noise and reverberation. We believe that the problem with conventional methods processed in ranges from one frequency to another is that it is difficult to select only the frequency bands that are effective for estimation. Therefore, we thought that if we proposed a method that uses multiple bands and integrates the similarity of each band after the fact, it would be possible to achieve higher estimation accuracy with higher resolution than before.

Our method is similar to RAPM [11]. We also use a radiation pattern. One difference between these methods is cost function. Our cost function is the Euclidean distance based on the power spectrum. Therefore, the proposed method estimates head orientation by minimizing the Euclidean distance with respect to hypothesis orientation

θ

. The cost function of proposed method is defined as

J (θ) = \sum_{b \in B} D i s t (ϕ_{b}, P_{b} (θ)),

(11)

D i s t (ϕ_{b}, P_{b} (θ)) = {(\sum_{m = 1}^{M} {((10 {log}_{10} ϕ_{m, b} - α) - (10 {log}_{10} P_{b} (θ) - β))}^{2})}^{\frac{1}{2}},

(12)

where

α = \frac{1}{M} \sum_{m = 1}^{M} 10 {log}_{10} ϕ_{m, b},

(13)

β = \frac{1}{M} \sum_{m = 1}^{M} 10 {log}_{10} P_{b} (θ) .

(14)

The cost function of RAPM is designed to measure a similarity between radiation pattern and observed pattern. As RAPM focuses on shape of energy patterns, the cost function is based on cosine similarity. One of the advantages is that the cost is independent of observed energy. However, this is one of the disadvantages since RAPM completely ignores amount of energy.

The cost function of proposed method is designed to measure a similarity between radiation pattern and observed pattern based on shape of energy patterns and amount of energy. First of all, we use power as feature vector instead of energy. Then we use the Euclidean distance to account for the amount of power. When we simply use the Euclidean distance, a range mismatch between radiation vector and observed vector deteriorates the cost even if shape of these vectors is similar. To avoid this problem, we apply mean subtraction to the vectors before measuring the Euclidean distance.

3.1. Technique of Integrating Multiple Frequency Band

In the proposed and RAPM methods, a set of the 1/3-octave band is used. In a set B, continuous bands from the lowest band to the highest are usually included. However contribution of head orientation estimation is different from frequency bands. The bands including a set B do not need continuous 1/3-octave bands.

Radiation pattern depends on the frequency band. In some bands the radiation pattern is sharp; in other bands, it is dull. We use both continuous and non-continuous bands for a set B. To find the best estimation performance, we try all possible combinations of the bands.

3.2. Contribution of the Proposed Method

Our method is positioned as a solution that is similar in approach to RAPM in that it uses the radiation characteristics of the sound source as a template in advance. There are two main differences. Although they are similar in how they estimate head orientation based on the match between observed and template patterns, they differ in the criterion used. The proposed method measures the similarity after normalizing the difference in gain between the template and observation. On the other hand, RAPM does not have a gain normalization mechanism. In other words, the difference is whether the similarity evaluation criterion has a gain normalization mechanism or not.

Next, the proposed method uses a multi-band matching procedure, whereas RAPM is a single-band matching procedure. Although it is not impossible to interpret the original RAPM’s band as a multi-band connection of several bands, it is assumed that the series of bands is a continuous region.

On the other hand, the proposed method divides the band into multiple bands, calculates the similarity in each band with a gain normalization mechanism, and then performs fusion to obtain the final similarity. This method makes it possible to fuse multiple discontinuous bands. In other words, the proposed method is clearly different from RAPM in that it has a mechanism for pattern matching templates in multiple bands and post-fusion of similarities in multiple bands, and it aims to improve estimation accuracy.

In the following experimental section, we will verify the effects of gain normalization for each band, template matching for each band, and fusion of the similarity of multiple bands.

4. Experiments

4.1. Radiation Pattern

The radiation pattern from a directional sound source was created by measuring the impulse responses of points on a circle centered on the source point. A loudspeaker (GENELEC 8020DPM) was used as a directional sound source. The impulse response was measured on the azimuth plane at a height of 1 m above the floor. We placed the low-frequency speaker cone so that its center was

1 m

away, and we measured the angle from

0 degrees

to

360 degrees

at 15-degree intervals around a circle with a radius of

1.5 m

. The microphone used was an 8-channel circular microphone array called Kurage, manufactured by System In Frontier Inc., Tokyo, Japan [13]. Kurage means jellyfish in Japanese, and the microphone was named after it because of its jellyfish-like appearance when attached to the tripod. Kurage is equipped with a MEMS microphone element, and only channel 1 was used for measuring the impulse response. Channel 1 was positioned

1 m

above the floor. The measurement signal used was a time-stretched pulse with a signal length of 32,768 samples. We applied 16 synchronous addition processes to the measurements and synchronized the playback of the measurement signals and recording them. A System In Frontier RASP-24 AD/DA converter was used for playback and recording, with a sampling rate of 48,000 Hz and a quantization bit depth of 24 bits. The measurement location was Osaka Sangyo University, Building 15, Laboratory 15806. The volume of the room is approximately

H 2.5 m \times W 13.4 m \times D 8.2 m

, the background noise level is

{LA}_{eq} = 44.6 dB

, and the reverberation time is about

T_{60} = 0.4 s

. The 1/3 octave band energy with center frequencies of

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

was calculated from the impulse responses obtained at 15-degree intervals, and the radiation characteristics were calculated for each frequency band.

Figure 2 shows the radiation patterns from the loudspeaker in a polar coordinates with

0 degrees

at the top and positive angles counterclockwise. The radiation level at the center of the circle corresponds to −25 dB, and the radiation level at the outer circle corresponds to

0 dB

. The four lines (green, red, blue, and black) represent the radiation pattern when the speaker is pointed at

0 degrees

. The green line represents the energy radiation pattern of a 1/3 octave band with a center frequency of

4000 Hz

. Similarly, the red, blue, and black lines represent the energy radiation patterns of the 1/3 octave bands with center frequencies of

2000 Hz

,

1000 Hz

, and

500 Hz

, respectively. Figure 2 shows that the higher the frequency band, the more radiated energy tends to be biased toward the front of the sound source. The proposed method aims to improve the accuracy of head orientation estimation by using these four radiation patterns as prior knowledge. It can be confirmed that the actually measured radiation pattern differs from the true radiation pattern in that it has an uneven shape that includes measurement errors, and that it is a rough measurement result. By improving the measurement environment, it is possible to obtain a highly accurate loudspeaker radiation pattern. Additionally, a theoretical radiation pattern can be created based on the speaker’s shape and size. The reason for evaluating the estimation method using roughly measured patterns, as shown in Figure 2, is that we aim to estimate the head orientation using a person as a sound source. A human’s radiation pattern must be determined by measurement, and measurement distortion is inevitable. There are also individual differences in humans’ radiation patterns. The radiation pattern of the average human is usually different from that of a subject for head direction estimation. Based on these assumptions, we will evaluate estimation methods under the mismatch condition using a roughly measured radiation pattern.

4.2. Evaluation Dataset for Comparing Head Orientation Estimation Method

In this subsection, we describe about the evaluation database. The evaluation database must be composed of audio files under a variety of conditions in order to enable stable performance comparisons that are not affected by a variety of conditions. The essential variation for evaluating head orientation estimation is the variation in the speaker’s head orientation. There are other variations that need to be considered, such as variations in the speaker and the content of the utterance. We constructed an evaluation database taking these three variations into consideration.

The evaluation database contains 24 head orientation variations, divided into 15-degree increments from

0 degrees

to

360 degrees

. There are 10 speaker variations (5 male speakers and 5 female speakers) and 10 variations of the utterance. All utterances are in Japanese, and their audio files are from the Japanese Newspaper Article Speech Database (JNAS) [14]. As described above, there are 100 (

(5 + 5) speaker \times 10 sentence

) types of evaluation audio files per orientation.

The generation of 24 types of audios file for evaluation with different head orientations is achieved by convolving the actually measured impulse response using a loudspeaker as a substitute for the human head into the audio waveform. The evaluation database contains 2400 types of audio files (

24 orientation \times (5 + 5) speaker \times 10 sentence

). The sampling rate is 16,000 Hz, and the quantization bit rate is

16 bit

.

We measured the impulse response with time-stretched pulse (TSP) method [15], which applied sixteen-times synchronous additions in the laboratory room 15806, Building 15, Osaka Sangyo University. A period of the TSP signal was 16,384 samples. A GENELEC 8020DMP was used to reproduce the measurement signal. For recording, we used a Seeed Studio ReSpeaker USB Mic Array [16]. The laboratory room was covered with standard office floor carpet, the background noise level was

{LA}_{eq} = 44.6 dB

, and the reverberation time was approximately

T_{60} = 0.4 s

.

The evaluation results described in the following section correspond to the estimation results when calibrated to the same sensitivity for all microphones. In the experiment, we created an evaluation database by actually measuring the impulse response and convolving it with the audio signal. Impulse responses were measured in sequence, using the same MEMS microphone under all combinations of sound source position, sound source orientation, and microphone position. Using the procedure, we constructed an experiment in which all microphones were calibrated to the same sensitivity.

5. Results and Discussion

We performed head orientation estimation from the audio file in the evaluation database using three methods, namely OGCF, RAPM, and the proposed method, where the proposed method is a single-band frequency band using four 1/3 octave bands with central frequencies,

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

. All methods use frame length

N = 8192

, and 512 sample points frame shift. Figure 3, Figure 4 and Figure 5 show boxplots of the distribution of estimation errors. The horizontal axis represents the speaker’s orientation, i.e., the correct direction. The vertical axis represents the estimation error. The orange bar represents the second quartile (median), and the top and bottom of the black box represent the third and first quartiles. Please note that in some boxplots, all the estimation results were the same value, so the boxes do not spread, and the boxes are collapsed and appear as a single orange line.

Figure 3 shows that the distribution of estimation errors tends to differ depending on the sound source direction. The second quartile ranges from

- 30

to

+ 30

, depending on the direction of the sound source. There is more variability than in RAPM (Figure 4) and the proposed method (Figure 5). This is thought to be due to the effect of Equation (4). In this experiment, there were

J = 24

variations in the sound source direction, but the number of microphones was only six;

M = 6

.

The distribution of the estimation errors for the RAPM method (Figure 4) and the proposed method (Figure 5) is narrower than that for the OGCF method. The second quartile appears near

0 degrees

and

\pm 15 degrees

, and stable results independent of direction are obtained. RAPM and the proposed method are superior to OGCF, but no significant differences are observed between RAPM and the proposed method.

Next, the error distributions of the OGCF, RAPM, and the proposed method are compared from another viewpoint. The differences among these methods are shown in Table 1. We show the error distributions as histograms in Figure 6. The horizontal axis represents the estimation error, and the vertical axis represents the frequency. Since the 2400 audio files are composed of 427,104 frames, the total number of estimations is 427,104. We also compared derivative methods that applied multiple frequency bands to the proposed method. PROP

+ 500 Hz + 2000 Hz + 4000 Hz

is a proposed method that uses three bands with center frequencies:

500 Hz

,

2000 Hz

, and

4000 Hz

. This method is the best according to the mean errors and mean absolute errors for all possible combinations in Table 2.

RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

is original version of RAPM. Although we applied multiple frequency bands to the RAPM, a positive effect was not obtained for RAPM (See Table 2). Therefore, we show RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

in Figure 6. PROP

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

is a proposed method that uses four 1/3 octave bands with center frequencies:

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

. This version of proposed method is not applied to multiple frequency band technique. This version is for comparison with the original RAPM.

Blue, orange, green, and red bars represent OGCF, RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

, PROP

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

, and PROP

+ 500 Hz + 2000 Hz + 4000 Hz

results, respectively.

It can be confirmed that the mode (that is, the most frequent bin) of the estimation error of OGCF is

- 15 degrees

, and the errors are mainly distributed around

- 30 degrees

,

- 15 degrees

,

0 degrees

,

15 degrees

, and

30 degrees

. The mode of estimation error for RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

is

0 degrees

, and it can be confirmed that the errors are mainly districuted around

- 15 degrees

,

0 degrees

, and

15 degrees

. The frequency is concentrated in

0 degrees

, which is a narrower range than OGCF, indicating that RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

is

0 degrees

is superior to OGCF in estimating head orientation. PROP

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

is mainly distributed around

- 15 degrees

,

0 degrees

, and

15 degrees

. It shows similar trend as RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

. This result suggests that the estimation accuracy of PROP

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

and that of RAPM

+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz

are almost equivalent.

The error distribution of PROP

+ 500 Hz + 2000 Hz + 4000 Hz

shows the best because shape of the error distribution is most concentrated at

0 degrees

. By applying multiple frequency band technique, estimation becomes more accurate.

Evaluation for Processing Based on Multi-Frequency Bands

The frequency bands used for this evaluation are four 1/3 octave bands with center frequencies of

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

. There are a total of 15 combinations of multi-frequency band processing, and we compared them. These 15 types include 4 types of single band processing, 6 types of two band processing, 4 types of three band processing, and 1 type of four band processing. The mean absolute error is used for the evaluation. This is because comparisons using the mean error are inappropriate, as the positive and negative estimation errors cancel each other out, and the mean error becomes a small value. Table 2 shows the comparison results. For reference, the average error is also listed in Table 2.

It was shown that RAPM had no effect on multiple frequency band technique. On the other hand, by applying techniques to the proposed method, it is experimentally shown that the most accurate combination of frequency bands is

500 Hz

,

2000 Hz

, and

4000 Hz

.

The proposed method and RAPM can be said to be methods based on pattern matching between radiation characteristic templates and observed patterns. The accuracy of orientation estimation is determined by how precisely the similarity of patterns can be evaluated. The essential difference between the proposed method and PAPM is the criteria used during matching. The proposed method applies mean subtraction to the template and the observed patterns, so it realizes pattern matching with the radiation characteristics that is not affected by the gain of the observed pattern. Since mean subtraction is applied to each band, precise matching is possible for each band.

Next, further improvement in estimation accuracy can be expected by solving the problem of selecting a combination of bands that are effective for orientation estimation. As this paper did not propose a method to find the optimal combination, we tried all combinations to find an optimal one. Through this experiment, we were able to show that dividing the band into several bands and selecting an appropriate band from among them contributes to improving head direction estimation, although the optimal frequency band division and the band to be selected may vary depending on the sound source. In this paper, four 1/3-octave bands (i.e.,

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

) are selected as the multiple frequency bands because energy of human speech signal is distributed to these frequencies. We believe that there will be a need in the future to develop methods for automatically dividing bands and adjusting multiple optimal bands according to the characteristics of the sound source.

On the other hand, RAPM does not have a mechanism to normalize the difference in gain between the radiation characteristic template and the observed sound. Therefore, even if the band is divided, it does not necessarily result in more precise matching than before division, and we believe that the band division process was not sufficiently effective and the estimation accuracy was inferior to the proposed method.

6. Conclusions

We propose a head orientation estimation method. It is based on minimizing the Euclidean distance between mean-subtracted power spectra. We also propose a multiple frequency band technique. Our method with a multiple frequency band technique achieve mean absolute error of

10.58 degrees

. This accuracy is sufficient for the dialogue activity monitoring system that we aim to realize. The mean absolute error of the proposed method without multiple frequency band technique is reduced by

0.53 degrees

compared to that of the original RAPM method. By applying multiple frequency band technique, the mean absolute error of the proposed method using

500 Hz

,

2000 Hz

, and

4000 Hz

is reduced by

1.11 degrees

compared to that of the original RAPM method. We experimentally demonstrated that multiple frequency band technique is effective for the proposed method and not effective for RAPAM.

Future works are an evaluation of head orientation estimation when the sound source is not at the center of six microphones and the integration of sound source localization and head orientation estimation. There is an urgent need to support multi-tracking of each speaker as a dialogue activity monitoring system. Improvements in elemental technology are also necessary, and measures against noise and reverberation are particularly important.

The proposed method directly evaluates the energy pattern of the multi-channel acoustic signal observed by six microphones, and is likely to be easily affected by noise and reverberation. Although it is important to evaluate robustness against noise and reverberation, this time we evaluated head orientation estimation in a specific environment assumed by the application. However, under conditions where Gaussian noise is constantly observed in all microphones, it has been confirmed in experiments using four or eight distributed microphones that there is no significant effect on estimation accuracy even if the SNR changes to

0 dB

,

10 dB

, or

20 dB

, as described in [17]. In other words, the proposed method is robust to stationary noise. As it will be necessary to extend our method when the directional sound sources interfere, this is a topic for future research.

By fusing other sensor data, such as visual data, to estimate head orientation, we can expect further improvement in the accuracy of head orientation estimation. On the other hand, audio signals have the advantage of not being affected by occlusion, and camera images have the advantage of not being affected by noise. Estimation using audio signals and estimation using camera images have a complementary relationship, and we believe that improving estimation accuracy using audio signals is still essential.

Author Contributions

Conceptualization, T.T. and M.N.; methodology, T.T.; software, T.T. and T.K.; validation, T.T., T.K., A.L. and M.N.; formal analysis, T.T.; investigation, T.T., T.K. and M.N.; resources, T.T.; data curation, T.T., T.K., R.A., Y.O. and A.L.; writing—original draft preparation, T.T.; writing—review and editing, T.T., T.K., R.A., Y.O., A.L. and M.N.; visualization, T.T.; supervision, T.T.; project administration, T.T.; funding acquisition, T.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number JP25K15220.

Data Availability Statement

The JNAS database used as evaluation audio to reproduce this research is available from NII-SRC (https://research.nii.ac.jp/src/en/JNAS.html, accessed on 31 August 2025). Scientific Python version 1.13.0 is used for multichannel signal processing. This software is available following link: https://scientific-python.org/, accessed on 31 August 2025.

Conflicts of Interest

The authors declares no conflicts of interests. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

OGCF	Oriented Global Coherence Field
RAPM	RAdiation Pattern Matching
CSP	Crosspower Spectrum Phase
JNAS	Japanese Newspaper Article Speech Database
TSP	Time-Stretched Pulse

References

Takahashi, T.; Kanbayashi, T.; Nakayama, M. Simultaneous Localization of Two Talkers Placed in an Area Surrounded by Asynchronous Six-Microphone Arrays. Electronics 2025, 14, 711. [Google Scholar] [CrossRef]
Abad, A.; Segura, C.; Nadeu, C.; Hernando, J. Audio-based approches to head orientation estimation in a smart-room. In Proceedings of the Interspeech 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, 27–31 August 2007; pp. 590–593. [Google Scholar]
Segura, C.; Canton-Ferrer, C.; Abad, A.; Casas, J.R.; Hernando, J. Multimodal head orientation towards attention tracking in smartrooms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, USA, 15–20 April 2007; Volume II, pp. 681–684. [Google Scholar]
Felsheim, R.C.; Brendel, A.; Naylor, P.A.; Kellermann, W. Head Orientation Estimation from Multiple Microphone Arrays. In Proceedings of the 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 491–495. [Google Scholar]
Kunešová, M.; Zajic, Z. Multitask Detection of Speaker Changes, Overlapping Speech and Voice Activity Using Wav2vec 2.0. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; Mcleavey, C.; Sutsukever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 1–27. [Google Scholar]
Lu, C.; Wang, L.Y. Implementation of an Automatic Meeting Minute Generation System Using YAMNet with Speaker Identification and Keyword Prompts. Appl. Sci. 2024, 14, 5718. [Google Scholar] [CrossRef]
Bando, Y.; Otsuka, T.; Itoyama, K.; Yoshii, K.; Sasaki, Y.; Kagami, S.; Okuno, H.G. Challenges in deploying a microphone array to localize and separate sound sources in real auditory scenes. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 723–727. [Google Scholar]
Hashimoto, E.; Mizumoto, T.; Nagira, K.; Shiramatsu, S. Human Capital Visualization Using Speech Amount During Meetings. arXiv 2025, arXiv:2508.02075v1. [Google Scholar]
Brutti, A.; Omologo, M.; Svaizer, P. Oriented Global Coherence Field for the Estimation of the Head Orientation in Smart Rooms Equipped with Distributed Microphone Arrays. In Proceedings of the Interspeech 2005—Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005; pp. 2337–2340. [Google Scholar]
Müller, K.; Çakmak, B.; Didier, P.; Doclo, S.; Østergaard, J.; Wolff, T. Head Orientation Estimation with Distributed Microphones Using Speech Radiation Patterns. arXiv 2023, arXiv:2312.01808v1. [Google Scholar]
Omologo, M.; Svaizer, P. Acoustic Event Localization using a Crosspower-Spectrum Phase Based Techniques. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Adelaide, SA, Australia, 19–22 April 1994; Volume 2, pp. 273–276. [Google Scholar]
Kurage. Available online: https://www.sifi.co.jp/en/product/microphone-array/ (accessed on 29 August 2025).
Itou, K.; Yamamoto, M.; Takeda, K.; Takezawa, T.; Matsuoka, T.; Kobayashi, T.; Shikano, K.; Itahashi, S. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research. J. Acoust. Soc. Jpn. 1999, 20, 199–206. [Google Scholar] [CrossRef]
Suzuki, Y.; Asano, F.; Kim, H.Y.; Sone, T. An optimum computer-generated pulse signal suitable for the measurement of very long impulse responses. J. Acoust. Soc. Am. 1995, 97, 1119–1123. [Google Scholar] [CrossRef]
ReSpeaker. Mic Array v2.0—Far-Field w/4 PDM Microphones. Available online: https://wiki.seeedstudio.com/ReSpeaker_Mic_Array_v2.0/ (accessed on 28 August 2025).
Tsuda, R.; Awatani, T.; Nakayama, M.; Takahashi, T. Quasi-real-time estimation of a loudspeaker direction from sound pressure level ratio among four channels. In Proceedings of the INTER-NOISE and NOISE-CON Congress and Conference Proceedings, Chiba, Japan, 20–23 August 2023; pp. 3882–3889. [Google Scholar]

Figure 1. Alignment of directional sound source and six microphone arrays.

Figure 2. Radiation pattern from loudspeaker (GENELEC 8020DPM).

Figure 3. Distribution of estimation error for OGCF method based on six four-channel microphone arrays (orange bar: median of estimations).

Figure 4. Distribution of estimation error for RAPM method using

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

frequency bands based on six microphones with the radiation pattern as prior knowledge (orange bar: median of estimations).

Figure 4. Distribution of estimation error for RAPM method using

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

frequency bands based on six microphones with the radiation pattern as prior knowledge (orange bar: median of estimations).

Figure 5. Distribution of estimation error for the proposed method using

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

frequency bands based on six microphones with the radiation pattern as prior knowledge (orange bar: median of estimations).

Figure 5. Distribution of estimation error for the proposed method using

500 Hz

,

1000 Hz

,

2000 Hz

, and

4000 Hz

frequency bands based on six microphones with the radiation pattern as prior knowledge (orange bar: median of estimations).

Figure 6. Histogram of estimation errors for OGCF, RAPM, and the proposed method.

Table 1. Characteristics of methods.

Methods	Device	Prior Knowledge	Frequency Band
OGCF	Microphone Arrays	No	Single Band
RAPM	Distributed Microphones	Radiation Pattern	Single Band/Multiple Bands
PROP	Distributed Microphones	Radiation Pattern	Single Band/Multiple Bands

Table 2. Effectiveness of using multi-frequency bands.

Methods	Mean Error	Mean Absolute Error
OGCF	$- 2.07 degrees$	$24.21 degrees$
RAPM $+ 500 Hz$	$4.34 degrees$	$28.74 degrees$
RAPM $+ 1000 Hz$	$3.36 degrees$	$24.15 degrees$
RAPM $+ 2000 Hz$	$0.21 degrees$	$14.41 degrees$
RAPM $+ 4000 Hz$	$- 1.82 degrees$	$15.08 degrees$
PRAM $+ 500 Hz + 1000 Hz$	$4.33 degrees$	$22.22 degrees$
PRAM $+ 500 Hz + 2000 Hz$	$1.25 degrees$	$14.55 degrees$
PRAM $+ 500 Hz + 4000 Hz$	$- 1.45 degrees$	$14.46 degrees$
PRAM $+ 1000 Hz + 2000 Hz$	$1.76 degrees$	$14.20 degrees$
PRAM $+ 1000 Hz + 4000 Hz$	$- 0.49 degrees$	$14.32 degrees$
PRAM $+ 2000 Hz + 4000 Hz$	$- 0.99 degrees$	$11.81 degrees$
PRAM $+ 500 Hz + 1000 Hz + 2000 Hz$	$2.58 degrees$	$14.30 degrees$
PRAM $+ 500 Hz + 1000 Hz + 4000 Hz$	$- 0.03 degrees$	$14.07 degrees$
PRAM $+ 500 Hz + 2000 Hz + 4000 Hz$	$- 0.61 degrees$	$11.70 degrees$
PRAM $+ 1000 Hz + 2000 Hz + 4000 Hz$	$- 0.20 degrees$	$11.69 degrees$
RAPM $+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz$	$0.21 degrees$	$11.60 degrees$
PROP $+ 500 Hz$	$4.52 degrees$	$29.61 degrees$
PROP $+ 1000 Hz$	$0.83 degrees$	$24.72 degrees$
PROP $+ 2000 Hz$	$0.40 degrees$	$14.93 degrees$
PROP $+ 4000 Hz$	$- 1.46 degrees$	$14.13 degrees$
PROP $+ 500 Hz + 1000 Hz$	$2.49 degrees$	$22.58 degrees$
PROP $+ 500 Hz + 2000 Hz$	$2.10 degrees$	$15.35 degrees$
PROP $+ 500 Hz + 4000 Hz$	$- 0.45 degrees$	$13.41 degrees$
PROP $+ 1000 Hz + 2000 Hz$	$1.29 degrees$	$14.72 degrees$
PROP $+ 1000 Hz + 4000 Hz$	$- 0.95 degrees$	$13.91 degrees$
PROP $+ 2000 Hz + 4000 Hz$	$- 0.77 degrees$	$11.03 degrees$
PROP $+ 500 Hz + 1000 Hz + 2000 Hz$	$2.36 degrees$	$14.87 degrees$
PROP $+ 500 Hz + 1000 Hz + 4000 Hz$	$- 0.13 degrees$	$13.39 degrees$
PROP $+ 500 Hz + 2000 Hz + 4000 Hz$	$- 0.27 degrees$	$10.58 degrees$
PROP $+ 1000 Hz + 2000 Hz + 4000 Hz$	$- 0.51 degrees$	$11.37 degrees$
PROP $+ 500 Hz + 1000 Hz + 2000 Hz + 4000 Hz$	$- 0.09 degrees$	$11.07 degrees$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Takahashi, T.; Kanbayashi, T.; Aoki, R.; Ochi, Y.; Lee, A.; Nakayama, M. Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones. J. Exp. Theor. Anal. 2025, 3, 34. https://doi.org/10.3390/jeta3040034

AMA Style

Takahashi T, Kanbayashi T, Aoki R, Ochi Y, Lee A, Nakayama M. Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones. Journal of Experimental and Theoretical Analyses. 2025; 3(4):34. https://doi.org/10.3390/jeta3040034

Chicago/Turabian Style

Takahashi, Toru, Taiki Kanbayashi, Ryota Aoki, Yuta Ochi, Akira Lee, and Masato Nakayama. 2025. "Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones" Journal of Experimental and Theoretical Analyses 3, no. 4: 34. https://doi.org/10.3390/jeta3040034

APA Style

Takahashi, T., Kanbayashi, T., Aoki, R., Ochi, Y., Lee, A., & Nakayama, M. (2025). Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones. Journal of Experimental and Theoretical Analyses, 3(4), 34. https://doi.org/10.3390/jeta3040034

Article Menu

Head Orientation Estimation Based on Multiple Frequency Bands Using Sparsely Aligned Microphones

Abstract

1. Introduction

2. Method of Estimating Head Orientation

2.1. Oriented Global Coherence Field (OGCF) Method

2.2. Radiation Pattern Matching (RAPM) Method

3. Head Orientation Estimation Method Using Multi-Frequency Bands (Proposed Method)

3.1. Technique of Integrating Multiple Frequency Band

3.2. Contribution of the Proposed Method

4. Experiments

4.1. Radiation Pattern

4.2. Evaluation Dataset for Comparing Head Orientation Estimation Method

5. Results and Discussion

Evaluation for Processing Based on Multi-Frequency Bands

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI