^{1}

^{*}

^{2}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (

This paper addresses the problem of three-dimensional speaker orientation estimation in a smart-room environment equipped with microphone arrays. A Bayesian approach is proposed to jointly track the location and orientation of an active speaker. The main motivation is that the knowledge of the speaker orientation may yield an increased localization performance and

In recent years, significant research efforts have been focused on developing human-computer interfaces in intelligent environments that aim to support human tasks and activities. The knowledge of the position and the orientation of the speakers present in a room constitutes valuable information allowing for better understanding of user activities and human interactions in those environments, such as the analysis of group dynamics or behaviors, deciding which is the active speaker among all present or determining who is talking to whom. In general, it can be expected that the knowledge about the orientation of human speakers would permit the improvement of speech technologies that are commonly deployed in smart-rooms. For instance, an enhanced microphone network management strategy for microphone selection can be developed based on both speaker position and orientation cues.

Very few methods have been proposed to solve the problem of speaker localization and speaker orientation estimation from acoustic signals. They differ mainly in how they approach the problem and can be coarsely classified in to two groups. The first group assumes the task of localization and orientation estimation as two separate and independent problems, working as a two-step algorithm: first locate the speaker, and then, the head orientation is estimated [

The second group of approaches [

No previous work has been found that tackles the task of three-dimensional (3D) speaker orientation estimation with microphone arrays. This can be attributed to the fact that most smart environments have the microphones placed in nearly the same plane in order to maximize the localization performance in the

In this paper, a Bayesian approach is proposed to jointly track the location and orientation of a speaker. The main motivation is that the knowledge of the speaker orientation may yield to an increased localization performance and

The remainder of this paper is organized as follows. In Section 2, the head rotation representation is described. Section 3 introduces the speaker localization and orientation estimation algorithms as a two-step approach. Section 4 presents an alternative two-step algorithm employing a PF at each step. Section 5 describes the joint PF. Sections 6 and 7 show the experiments and results. Finally, Section 8 gives the conclusions.

The parametrization of the head rotation in this work is based on the decomposition into Euler angles (

Rotate by angle

Rotate by angle

Rotate by angle

The rotation matrix,

The Euler angles (_{z}_{y}

The two-step algorithm to estimate the location and the orientation of speakers is based on the work presented in [

In a multi-microphone environment, one of the observable clues with positional information more commonly used in acoustic localization algorithms is the time difference of arrival of the signal between microphone pairs. Consider a smart-room provided with a set of ^{3} position in space. Then, the time difference of arrival, _{p,i,j}, of an hypothetical acoustic source located at _{i} and _{j}

The cross-correlation function is well-known as a measure of the similarity between signals for any given time displacement, and ideally, it should exhibit a prominent peak in correspondence to the delay between the pair of signals [_{ij}_{ij}_{1}_{2}), with a unitary value for frequencies _{1} ≤ |_{2}, and zero otherwise.

The estimation of the TDOA for each microphone pair is computed as follows:

The contributions of each microphone pair can be combined to derive a single estimation of the source position. However, in the general case, the availability of multiple TDOA estimations leads to a minimization of an over-determined and non-linear error function. A very efficient approach is the SRP-PHAT or global coherence field introduced in [

The basic operation of the SRP-PHAT algorithms consists of exploring the three-dimensional (3D) space, searching for the maximum of the global contribution of the PHAT-weighted generalized cross-correlations (GCC-PHAT) from all the microphone pairs. The 3D room space is quantized into a set of positions with a typical separation of 5–10 cm. The theoretical TDOA, _{p,i,j}, from each exploration position to each microphone pair are precomputed and stored.

The set of GCC-PHAT functions are combined to create a spatial likelihood function (SLF)

The estimated acoustic source location is the position of the quantized space that maximizes the contribution of the GCC-PHAT of all microphone pairs:
_{p̂,i,j}, is estimated using the obtained location.

The orientational cues used in this work are based on GCC-PHAT averaged peak (GCC-PHAT-A), described in [_{p̂,ij} is the delay in samples and

Basically, the GCC-PHAT-A measure reduces to the sum of the energy of the band-filtered PHAT-weighted cross-correlation around the estimated TDOA, and essentially, it measures the proportion of the signal between frequencies _{1} and _{2} that contributes to the main peak in the localization. It is also important to note that this measure is commensurable across all microphone pairs independent of microphone gains, due to the PHAT weighting and, therefore, constitutes a valuable orientational feature.

In order to estimate the orientation of a speaker based on the GCC-PHAT-based orientational measures, a simple vectorial method is employed, similar to that described in [_{ij}_{ij}

_{min}_{max}_{ij}

The sum of the vectors formed by all the orientational measures of each microphone pair is considered the estimated head direction, _{sum}

The estimated head orientation angle, _{sum}

In this section, a two-step approach to estimate the location and orientation of the speaker is proposed, employing a particle filter in each stage, which is introduced here to enable a fair comparison with the joint particle filter approach.

The concept of tracking can be defined as the recursive estimation of the hidden state of a target based on the partial observations at every time instant. Assuming that the evolution of the state sequence is defined by a Markov process of first order, the dynamics of the state can be described by the transition equation:
_{k}_{k−1}, and an independent and identically distributed (i.i.d.) process noise, _{k−1}. At every time instant, _{k}_{k}_{k}

Tracking aims to estimate _{k}_{1:}_{k}_{i}, _{k}_{1:k}, up to time _{k}_{1:}_{k}_{k}_{1:}_{k}

In the prediction step, the prior pdf, _{k}_{1:}_{k−1}), is obtained making use of the _{k}_{k−1}), which is derived from transition

In the update stage, the new measurement, _{k}_{k}_{k}

Particle filters (PF) [

Let
_{s}_{k}_{1:}_{k}

Considering that the samples,

In the literature regarding other domains, some techniques aim at constructing efficient importance density functions through Markov Chain Monte Carlo methods [

A common problem with the PF is the degeneracy phenomenon, where, after a few iterations, all the weight concentrates in just one particle, and the rest of the particles have almost zero contribution to the approximation of the posterior. A measure of the degeneracy of the PF is the _{s}

The best estimation of the state at time _{k}

The design parameters of the PF are the state model, the dynamical model and the observational model, which are defined in the following sections.

A common approach is to characterize the human movement dynamics as a _{k}_{k}_{k} y_{k} z_{k}^{T}_{k}_{k} ẏ_{k} ż_{k}^{T}

For the sake of simplicity, consider the Langevin process in the _{x}_{z}_{k−1} characterized as a zero-mean Gaussian noise variable with covariance matrix _{k−1}:

The particle filter approach requires the definition of the likelihood function,
_{k}

In this work, the localization likelihood is derived from a spatial likelihood function _{k}_{k,loc}

The likelihood function, _{k,loc}_{k}

The state vector of the particle filter used to estimate the orientation consists only of the pan angle and the dynamical model as follows:

The state head direction vector in 3D space _{k}_{k}_{k}_{k}^{T}.

The orientation likelihood is obtained from the GCC-PHAT averaged peak features described in Section 3.2. A vector, _{n}, is created from the estimated speaker's position, _{k}_{n}| to the normalized orientational measure of the microphone pair as defined in Section 3.2.2. The orientation observation is formed by the resulting vector, _{sum}_{n}_{k,ori}

The scalar product of the two unitary vectors is scaled into the range [0, 1] to better resemble a likelihood function. The exponent, _{sum}_{sum}

In this work, a particle filter approach is proposed to jointly track the location and orientation of a speaker. The main motivation is that the knowledge of the speaker orientation may yield to an increased localization performance and

The state of the particles is composed by the position of the center of the speaker's head _{k}_{k} y_{k} z_{k}^{T}_{k}_{k} ẏ_{k} ż_{k}^{T}

The estimation of the position of the speaker's mouth _{k}

The state head direction vector in the 3D space, _{k}_{k}, ψ_{k}

Similarly to Section 4.2.1, a

The random variable, _{x}_{forward}_{sideway}_{k}_{z}

In this work, the horizontal orientation angle of the speaker is assumed to be dependent on his velocity. It is expected that the faster the person moves, the more probable it is that the person is looking to his moving direction. This is modeled by predicting the next state head direction as the weighted sum of the current state head direction vector in the _{d}_{ψ}_{max}_{x}_{y}_{z}_{k−1}.

Finally, the next state

The _{θ}_{θ}_{k}_{θ}_{θ}_{z}_{k}

The observation likelihood, _{k}_{k}_{k,loc}_{k,ori}_{k}

The joint PF tracker performance will be compared with the two two-step algorithms introduced in Sections 3.2.2. and 4 in the task of estimating the position and orientation of the speaker's head. Since the two-step approaches are only able to estimate the horizontal orientation angle, the pitch and roll hypothesis are set to 0 for all time frames. The comparison with the two-step PF approach assesses that the performance increase obtained by the joint method is due to the joint dynamic and observation models and not the filtering itself.

The performance of the proposed head orientation estimation algorithm was evaluated using a purposely recorded database collected in the smart-room at the Universitat Politècnica de Catalunya. It is a meeting room equipped with several multimodal sensors, such as microphone arrays, table-top microphones and fixed or pan-tilt-zoom video cameras. The room dimensions are 3, 966×5, 245×4,000 mm, which correspond to the

The database is composed of one single person dataset involving the recording of multi-microphone audio, multi-camera video and IMU data for seven people moving freely in a smart room speaking most of the time and another multi-person dataset consisting in the recording of a group discussion with four participants. Only the simple person dataset will be considered in this work, since it is oriented toward the person tracking task, while the multi-person dataset is oriented toward the group analysis task. A sample of the database is shown in

The ground truth provided by the database consists in the annotations of the center of the head and the Euler rotation angles of every participant. The center of the head was obtained automatically by means of a multi-camera video PF tracker and the Euler orientation angles are acquired by an inertial measurement unit (IMU) provided by accelerometers, magnetometers and gyroscopes in the three axes.

The metrics proposed in [

In [

This is the accuracy of the tracker when it comes to keeping correct correspondences over time, estimating the number of people, recovering tracks,

_{k}_{k}_{k}_{k}

This is the precision of the tracker when it comes to determining the exact position of a tracked person in the room.

_{k}_{i,k}

This is the precision of the tracker when it comes to determining the exact orientation of a tracked person in the room. It is the Euclidean angle error for matched

_{i,k}_{i,k}_{i,k}_{i,k}_{i,k}_{i,k}

Experiments were conducted over the cited database to compare the performance of the joint PF tracker and the two two-step approaches. A tight relationship between the tracking accuracy (MOTA) and precision (MOTP and MHOTP) has been observed in the three algorithms, since it is possible to output a localization and orientation hypothesis only when the confidence of the algorithm is above a threshold and achieve a high precision at the expense of tracking accuracy, and

The overall precision of the estimation of 3D direction of the head in relation to the tracking accuracy is shown in

A PF approach for joint head position and 3D orientation estimation has been presented in this article. Experiments conducted over the purposely recorded database with Euler angles and head center annotations for seven different people in a smart room showed an increased performance for the joint PF approach in relation to two two-step algorithms that first estimate the position and then the orientation of the speaker. Both two-step approaches have a very similar angle estimation error, with a small increase in the localization precision (MOTP) for the two-step PF. The proposed joint algorithm outperforms both two-step algorithms in terms of localization precision and orientation angle precision (MHOTP), assessing the superiority of the joint approach. Furthermore, by means of the definition of a joint dynamical model, part of the the elevation angle of the head is inferred by the algorithm. Future work will be devoted to extending the joint PF to track multiple speakers and to study the fusion with video approaches with a focus on 3D orientation estimation.

This work has been partially funded by the Spanish project SARAI (TEC2010-21040-C02-01).

The authors declare no conflict of interest.

Euler angles, basic head rotations.

Smart-room sensor setup used in this database, with 5 cameras (Cam1-Cam5) and 6 T-shaped microphone clusters (T-Cluster 1 -6).

Single person dataset snapshots, with superposed head position and rotation annotations.

Curve of all possible tracking accuracy (acoustic multiple object tracking accuracy (A-MOTA)), localization tracking precision (multiple object tracking precision (MOTP)) (a) and 3D orientation angle precision (multiple head orientation tracking precision (MHOTP)) (b) results, employing a sweep threshold parameter on the algorithm confidence.

Curve of all possible tracking accuracy (A-MOTA), horizontal orientation angle precision (MHOTP_{ψ}_{θ}

Tracking performance joint and two-step approaches for an A-MOTA working point of 10%. PF, particle filter.

System | MOTP | MHOTP | MHOTP_{ψ} |
MHOTP_{θ} |
---|---|---|---|---|

2-Step | 133.94 mm | 17.76° | 11.27° | 10.63° |

2-Step PF | 125.58 mm | 17.84° | 11.53° | 10.53° |

Joint PF | 95.30 mm | 16.06° | 10.38° | 9.39° |

Tracking performance joint and two-step approaches for an A-MOTA working point of 75%.

System | MOTP | MHOTP | MHOTP_{ψ} |
MHOTP_{θ} |
---|---|---|---|---|

2-Step | 140.86 mm | 28.04° | 22.64° | 11.54° |

2-Step PF | 136.62 mm | 26.25° | 21.01° | 11.58° |

Joint PF | 122.67 mm | 25.08° | 19.60° | 11.30° |