Audiovisual Tracking of Multiple Speakers in Smart Spaces

Sanabria-Macias, Frank; Marron-Romera, Marta; Macias-Guarasa, Javier

doi:10.3390/s23156969

Open AccessArticle

Audiovisual Tracking of Multiple Speakers in Smart Spaces

by

Frank Sanabria-Macias

,

Marta Marron-Romera

and

Javier Macias-Guarasa

^*

Universidad de Alcalá, Department of Electronics, Engineering School, Campus Universitario, 28805 Alcalá de Henares, Spain

^*

Author to whom correspondence should be addressed.

Sensors 2023, 23(15), 6969; https://doi.org/10.3390/s23156969

Submission received: 29 May 2023 / Revised: 1 August 2023 / Accepted: 2 August 2023 / Published: 5 August 2023

(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to

50.3 %

average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to

69.7 %

average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to

18.1 %

average relative improvement in the MOT task for the CAV3D dataset (3D comparison).

Keywords:

smart spaces; audiovisual tracking; speaker localization; particle filter; multi-pose face observation model; probabilistic SRP-PHAT

1. Introduction

Smart spaces are environments equipped with a set of monitoring sensors, communication, and computing systems. The primary objective of smart spaces is understanding people’s behavior within them and their interactions, to improve human–machine interfaces. In this context, one of the required core low-level information is the presence, position, orientation (pose), and voice activity of users within the space, as these features play a significant role in high-level behavior and interaction understanding between the users and the environment.

Many of the approaches proposed in the literature for people localization within a monitored scene use information from a single type of sensor (video cameras [1,2,3], microphone arrays [4,5,6], infrared beacons [7,8,9], and others). Among them, the most used ones are video cameras and microphone arrays. In applications such as “smart” video conferencing, human–machine interfacing [10], automatic scene analysis [11], automatic camera tracking [12], and far-field speech recognition [13] it would be possible to locate and monitor speakers or audio sources within the space using the audio or video information. Additionally, providing accurate speaker positions could facilitate other tasks such as speech recognition [14].

Smart spaces are usually closed environments where the reverberation phenomenon is present, complicating the localization task [15] by audio means. Microphones are usually located in fixed positions on walls or ceilings. Therefore, the distance between the speech sources and the available microphones will lower the signal-to-noise ratio, making it difficult to locate the source [16]. Usually, video cameras are included in this kind of application to increase tracking accuracy as they provide additional information.

Nevertheless, video camera-only-based tracking systems also have their shortcomings. The uncertainties inherent to the acquisition system, the lighting conditions (brightness, shadows, contrast), and noise (sensor or optics characteristics) [17] decrease their accuracy in people pose extraction in real scenarios. Another problem in the vision-based systems is the total or partial occlusion of the subjects to be tracked [18] due to the light’s directional nature. In this context, audio signals are not strictly directional, especially at low frequencies. Hence, they are the perfect complement to the visual tracking process when targets are occluded or out of the camera’s field of view (FoV).

From the discussion above, it is clear that audio and video sources provide complementary information for people tracking in smart spaces. A combination of the best features from both sensors can thus improve the accuracy and robustness of the pursuit tracking process. This combined use of audio and video information in the tracking task is referred to as audiovisual tracking, in which the mouth is the element to track, as this is the area of maximum radiated acoustic power when speaking.

2. Previous Works

Given the increased availability of easy-to-deploy audio and video sensors and the improvements in computing facilities, in recent years, there has been a relevant growth in the number of proposals for multiple speakers tracking (Multiple-Object Tracking, MOT) in smart spaces, combining audio and video information [17,19,20,21,22,23,24,25,26,27,28,29,30,31,32]. Qian et al. in [17] conducted an extensive review of state of the art in audiovisual speaker tracking. In their review, different literature proposals were classified according to various aspects like the space used to perform the tracking (image plane, ground-level or three-dimensional—3D), the configuration of the sensor location (co-located or distributed), the number of sensors, the tracking method, among others. This review highlighted the main research lines and open problems in the audiovisual speaker tracking field.

Along the review, one of the most relevant research lines focuses on tracking a variable and an unknown number of speakers [19,27,33], as this is a still not fully solved problem today. Another line focuses on finding a better visual appearance model to track multiple speakers in indoor environments [34]. At the same time, other proposals are centered on audiovisual tracking in compact configurations (co-located camera and microphone array) for applications such as human-robot interaction [17,20,21,35].

In any case, a clear tendency in the literature proposals is the predominance of probabilistic tracking models such as Bayesian Filters (BFs, i.e., mainly Kalman filters and particle filters) [17,20,29,31,32,34,36,37]. The reasoning behind the use of BFs is that they provide a natural and robust framework to combine different sources of information, in this case, diverse sensing systems.

The reduction of the number of sensors is another important tendency in the literature. In recent proposals, most of the approaches address the tracking task only with only one camera, and one microphone array [17,20,27]. Many of these proposed systems track speakers in the image plane (2D) [32,38]. However, proposals addressing the localization in the full 3D space can also be found in [17,20,24,27,29,31], thus imposing additional challenges analyzed below.

Finally, in the MOT context, it is always necessary to handle situations, like targets occlusions or closely located, where care must be taken in the association between measurements and targets [20]. These situations could be very challenging when audio or visual observations fail or give low-confidence likelihood, thus presenting an interesting set of proposals in the literature, as analyzed in the following sections.

Another general aspect is the recent rise of deep learning techniques in audiovisual speaker tracking. The most commonly used approach is based on an observation model that keeps the Bayesian scheme. There are many examples of face detectors based on deep learning [25,26,27,28,29,31], although Siamese networks have also been used to generate measures of particle similarity to previous reference images of each target [32,39], and fusion models based on the attention mechanism [32]. Fewer proposals we found with end-to-end trained audiovisual solutions as in [24,40] for object tracking in which visual and auditory inputs are fused by an added fusion layer.

This paper presents a complete contribution to robust audiovisual multiple-speaker 3D tracking in smart spaces. The proposed approach is thoroughly analyzed and compared using the best available datasets published for this purpose. We also compare our method against, to the best of our knowledge, the best proposals in the state-of-the-art, emphasizing our focus on signal recognition and pattern analysis. This distinguishes our work from recent proposals that primarily rely on deep learning techniques.

2.1. Visual Tracking

An important issue to analyze within the visual part of the tracking is how to locate the mouth position in the image plane. Within a smart space, speakers could be far from sensors, at distances up to several meters, where detecting the mouth is challenging, as its size can be just a few pixels on the image. Given its small size, usually, the face is first located, as it is easier to detect, and then an estimation of the mouth location within the face area is made. This strategy is used in several proposals [17] and gives a chance to facial feature estimation with pattern recognition methods.

Typically, audiovisual speaker tracking literature uses color histogram-based likelihood estimations to locate faces in images [19,26,41]. However, this technique has proved to be not very accurate [17,32]. More accurate techniques for face location within the image are based on the use of trained detectors like the classic Viola and Jones (VJ) [42], or more recent ones based on deep learning [17].

Regarding the estimation of the mouth position within the face area, most face detectors used in audiovisual tracking are independent of the face pose, as in [17,20]. This approach has limited accuracy for mouth position estimation, as it depends on the face pose relative to the camera. To partially overcome this issue, in [17,20] the mouth localization is solved considering the aspect ratio of the face detection bounding box (BB).

Location in a 3D space using a single camera poses an additional challenge in the visual tracking part, as it requires estimating the target depth based on the scene projection on the 2D plane. From the 2D mouth position estimation, Qian et al. in [20] proposed a 3D projection assuming the shoulders width in 3D as a known parameter. In [17], as faces are detected in 2D with the face detection MXNet [43] network, the 3D mouth position is derived assuming that the face detection BB is related to the distance between the target and the camera.

A third issue with the mouth localization is its integration into the probabilistic tracking framework. In some works, like [17], when a face is located in 2D, a BB is projected to the 3D space, assuming a predefined probability distribution around its 3D points. As a more realistic proposal, in [44] a likelihood function for the face location is proposed, built with the internal information extracted from a modified VJ detector.

Most proposals [19,20,33] were evaluated in the AV16.3 database [45] where people’s faces were always visible on some of the available cameras (people do not turn their backs to all the cameras simultaneously). In [17], the more complex evaluation dataset CAV3D was presented, using a co-located audiovisual sensor setup, where faces are not always visible (sometimes people turn their backs to the camera), including strange poses and varied dynamic behaviors. This new condition implies that a face detector will not provide localization information as long as the camera does not see the faces, and thus another mechanism is required to allow the localization task to provide accurate estimations, generating another relevant challenge to be solved in the visual tracking task.

Some solutions implement a generative color histogram-based algorithm when the faces are not detected [17]. However, when complex contexts arise, with part of the target’s movement happening outside of the camera’s FoV (such as in CAV3D), the audio modality is the only one available, and no color-based approach can be used.

2.2. Audio Tracking

In audio localization, the most common approach is to compute the Generalized Cross-Correlation function (GCC) between pairs of microphones to generate an acoustic activation map with Steered Response Power (SRP) strategies, usually combined with the PHAT transform [20,46,47,48,49,50,51,52,53,54].The audio likelihood model thus obtained is then associated with the SRP acoustic map. The spatial resolution of these methods strongly depends on the array geometry, and for small microphone arrays (short distance between microphones, compared to the search space area), SRP presents a wide active response (low resolution), mainly in radial distance from the source [17].

Another problem in audio localization is room reverberation, which generates multiple peaks in GCC. To alleviate this effect, the approach in [55] proposed to model GCC as a Gaussian mixture. This strategy allows associating only one GCC Gaussian (or peak) with the source.

For multiple-speaker tracking (MOT), the association between GCC peaks and speakers must be addressed, taking also into account that when more than one speaker is in the same direction relative to the microphone array position, the GCC peaks related to the different speakers are mixed. To address this problem, ref. [55] proposes to use an a priori distribution from a Kalman filter to assign each peak to one of the speakers in each pair of microphones, where the assignment is made independently for each pair of microphones.

In [29], the authors propose a phase-aware VoiceFilter and a separation before the localization method. They separate the speech from different speakers by first using VoiceFilter with phase estimation and then applying a localization algorithm. However, the method needs clean speech samples for each speaker in the training phase, thus limiting its applicability. In [32], a novel acoustic map based on a spatial–temporal Global Coherence Field (stGCF) map is proposed, which utilizes a camera model to establish a mapping relationship between the audio and video localization spaces.

2.3. Contributions

The main contributions of our proposal (named GAVT) are: (i) providing an audiovisual localization approach in which the visual strategy is based on exploiting the knowledge of a trained face detector, modified to generate a likelihood model. This approach will thus not depend on a standard face detection process; (ii) using a pose-dependent strategy that improves the 2D mouth location estimation, thanks to the likelihood model extension with a new in-plane rotation and an image evaluation exploration; (iii) proposing a mechanism to handle MOT tasks for visual observations dependent on the new likelihood dispersion; and finally (iv) presenting a novel mechanism to avoid target interference in audio tracking that selects the more adequate GCC peak for each target, based on the joint distribution of all pairs of microphones.

The evaluation will be done on the AV16.3 and CAV3D datasets, both in single and multiple-speaker scenarios, to provide realistic and comparative quantitative and qualitative results and contributions within all the challenges exposed in this state-of-the-art revision.

The remainder of this paper is structured as follows: Section 3 describes the notation used and the definition of the multi-speaker tracking problem. Section 4 describes the general scheme of our proposal, while Section 5 and Section 6 detail the proposed video and audio observation models, respectively. The experimental setup and results obtained are described correspondingly in Section 7 and Section 8. Finally, Section 9 presents the paper conclusions.

3. Notation and Problem Statement

Real scalar values are represented by lowercase letters (e.g.,

α, c

). Vectors are represented by lowercase bold letters (e.g.,

x

). Matrices represented by uppercase bold letters (e.g.,

M

). Uppercase letters are reserved to define vector and set sizes (e.g., vector

y = {(y_{1}, \dots, y_{M})}^{T}

is of size M). Calligraphic fonts are reserved to represent sets (e.g.,

R

for real or

G

for generic sets). The

l_{p}

norm (

p > 0

) of a vector is depicted as

{∥ . ∥}_{p}

, e.g.,

{∥ x ∥}_{p} = {(| x_{1} |^{p} + \dots + {| x_{N} |}^{p})}^{1 / p}

, where

| . |

is reserved to represent absolute values of scalars or the module operation for complex values. The

l_{2}

norm

{∥ . ∥}_{2}

(Euclidean distance) will be written by default as

∥ . ∥

for simplicity. The discrete Fourier transform of a discrete signal

x [n]

is represented with the complex function

X [ω]

, with

X^{*} [ω]

being the complex-conjugate of

X [ω]

. Along the paper, we will use the

^{a}

,

^{v}

, and

^{a v}

superscripts to refer to elements belonging to the audio, video, and audiovisual modalities, respectively. Moreover, the tilde (^~) in

\tilde{X}

refers to the projection of X in a different space.

Within the notation context defined above, let us consider an indoor environment with a set of

N_{M}

microphones

M = {m_{1}, m_{2}, \dots, m_{N_{M}}}

, where

m_{ν}

is a known three-dimensional vector

m_{ν} = {(m_{ν x}, m_{ν y}, m_{ν z})}^{T}

denoting the position of the

ν^{t h}

microphone from the reference coordinate origin. For processing purposes, the microphones are grouped in pairs, as elements in a set

Q = {π_{1}, π_{2}, \dots, π_{N_{Q}}}

, where

π_{j} = (m_{j_{1}}, m_{j_{2}})

is composed of two three-dimensional vectors

m_{j_{1}}

and

m_{j_{2}}

, (

m_{j_{1}}, m_{j_{2}} \in M

, with

m_{j_{1}} \neq m_{j_{2}}

) that describe the spatial location of the microphones in the pair j. If all microphone pairs are allowed, then

N_{Q} = N_{M} \cdot (N_{M} - 1) / 2

.

Given this setup, let us assume that there is a set of

N_{S}

acoustic sources

S = {r_{1}, r_{2}, \dots, r_{N_{S}}}

, where

r_{i}

is a known three-dimensional vector

r_{i} = {(r_{i x}, r_{i y}, r_{i z})}^{T}

, emitting

N_{S}

acoustic signals

x_{i} (t)

, which are received by each microphone

m_{ν}

obtaining

N_{M}

time signal

s_{ν} (t)

according to the propagation model in Equation (1):

s_{ν} (t) = \sum_{i = 1}^{N_{S}} [h_{i, ν} (t) * x_{i} (t) + η_{i, ν} (t)],

(1)

with

h_{i, ν} (t)

being the Room Impulse Response (RIR) between the acoustic source position

r_{i}

and the

ν^{t h}

microphone, * the convolution operator, and

η_{i, ν} (t)

a signal that models all the audio signal adverse effects not included in

h_{i, ν} (t)

(noise, interference, etc.).

In an anechoic (free-field) condition the signals received by each microphone are just a delayed and attenuated version of the acoustic source signal, as shown in Equation (2):

s_{ν} (t) = \sum_{i = 1}^{N_{S}} [α_{i, ν} \cdot x (t - τ_{i, ν} (r_{i}))],

(2)

where

τ_{i, ν} (r_{i}) = c^{- 1} ∥ r_{i} - m_{ν} ∥

is the propagation delay between

m_{ν}

and

r_{i}

,

τ_{i, ν} (r_{i})

,

α_{i, ν} = \frac{1}{4 π c τ_{i, ν} (r_{i})}

is a distance-related attenuation assuming spherical propagation [56], and c is the sound propagation velocity in air.

The environment is also equipped with a set of

N_{C}

cameras

C = {c_{1}, c_{2}, \dots, c_{N_{C}}}

, giving the corresponding

I_{θ} (t)

images, where

c_{θ}

is a known three-dimensional vector

c_{θ} = {(c_{θ x}, c_{θ y}, c_{θ z})}^{T}

, denoting the position of the

θ^{t h}

camera from the reference coordinate origin. The

K_{θ}

intrinsic calibration matrix for each camera is also available, with

θ = 1 \dots N_{C}

.

In the geometrical discussions, we will refer to the 3D space as that in which every point is defined by its 3D coordinates

{(x, y, z)}^{T}

. The visual observation space will be defined in the 2D plane (image plane) as

{(u, v, s)}^{T}

, where

(u, v)

are the pixel coordinates in the image plane, and s refers to the size of the explored image plane.

The proposal in this work will be exploiting the formulation of Bayesian Filters (BFs), which are techniques that estimate the posterior probability density function (PDF) of a system state whose dynamics are statistically modeled along time, given a set of observations or measurements [57], also statistically modeled.

In these models, the state (

x_{n}

) and observation (

z_{n}

) vectors can be mathematically obtained along time, being

n

the corresponding time instant. The state vector

x_{n}

characterizes the properties of the system to be estimated (e.g., position, velocity, physical dimensions), and the observation vector

z_{n}

considers the measurements from the system behavior with all sensors in the environment.

In the BF framework, the estimation is made in two steps: prediction and update (also referred to as correction [57]). In the prediction stage, a prior PDF of the state vector is computed given its previous value

p (x_{n - 1} | Z_{n - 1})

and using the state model

p (x_{n} | x_{n - 1})

, as shown in Equation (3). In the update stage, a new posterior PDF is computed

p (x_{n} | Z_{n})

(see Equation (4)) from the prior obtained in the prediction stage, and by including there the current observation vector information

z_{n}

, through the observation model

p (z_{n} | x_{n}

) and its likelihood

p (z_{n} | Z_{n - 1}

).

p (x_{n} | Z_{n - 1}) = \int p (x_{n} | x_{n - 1}) p (x_{n - 1} | Z_{n - 1}) d x_{n - 1}

(3)

p (x_{n} | Z_{n}) = \frac{p (z_{n} | x_{n}) \cdot p (x_{n} | Z_{n - 1})}{p (z_{n} | Z_{n - 1})},

(4)

where

Z_{n} = {z_{1}, z_{2}, \dots, z_{n}}

.

Particle filters (PFs) are a particular class of BFs that approximate the state distribution with a set of weighted samples

{x_{n}^{i}, w_{n}^{i}

/

i = 1, \dots, N_{P}}

, called particles, that characterize each estimation hypothesis, as shown in Equation (5):

p (x_{n} | Z_{n}) \approx \sum_{i = 1}^{N_{P}} w_{n}^{i} \cdot δ (x_{n} - x_{n}^{i}),

(5)

where

N_{P}

is the number of particles, and

w_{n}^{i}

are the weights characterizing the probability of every given particle

x_{n}

or state value hypothesis.

The update stage is then carried out by applying Importance Sampling [57] (IS, and its derivation Sequential Importance Resampling, SIR), which is a statistical technique to estimate the properties of a posterior distribution

p (x_{n} | Z_{n})

, when its samples are generated from another sampled one, the a prior PDF in this case

p (x_{n} | Z_{n - 1})

.

In this work, PFs are used for tracking the mouth position of multiple speakers in a smart space using audio and video information, so that the observation vector

z_{n}

will be composed of observations from the audio modality

z_{n}^{a}

and from the video modality

z_{n}^{v}

, so that

z_{n} = {(z_{n}^{a}, z_{n}^{v})}^{T}

.

4. Audiovisual Tracking: The `GAVT` System Proposal

The tracking task will be thus carried out using only one video camera and one small microphone array in both co-located and distributed scenarios. It is assumed that the sensors are calibrated, and the audio and video signals are synchronized. It is also considered that there are a constant (and known) number of speakers (targets) in the space and that speakers do not stop talking for long periods. The problem complexity is thus concentrated in the multimodal 3D localization and tracking task in a multiple-speaker context, within a probabilistic approach without a reidentification process to solve the possible association problems that may appear in such context. Within this framework, and following the notation and the problem statement previously described, in this section we are presenting the global system proposed for addressing the audiovisual tracking task.

4.1. General Architecture

In our proposal, each target speaker or objective

o

, will be characterized at any time instant n by a state vector

x_{n, o} = {(p_{n, o}, {\dot{p}}_{n, o})}^{T}

, where

p_{n, o} = {(x_{n, o}, y_{n, o}, z_{n, o})}^{T}

is the speaker 3D position, and

{\dot{p}}_{n, o} = {({\dot{x}}_{n, o}, {\dot{y}}_{n, o}, {\dot{z}}_{n, o})}^{T}

its velocity.

A separated particle filter is used for tracking each target

o

, with the SIR algorithm, using a set of

N_{P}

particles (

N_{P}

is assumed to be fixed for all targets and time instants). Each particle i (

i = 1 \dots N_{P}

) at any time instant n will be characterized by a state vector

x_{n, o}^{i} = {(p_{n, o}^{i}, {\dot{p}}_{n, o}^{i})}^{T}

, where

p_{n, o}^{i} = {(x_{n, o}^{i}, y_{n, o}^{i}, z_{n, o}^{i})}^{T}

is the particle 3D position, and

{\dot{p}}_{n, o}^{i} = {({\dot{x}}_{n, o}^{i}, {\dot{y}}_{n, o}^{i}, {\dot{z}}_{n, o}^{i})}^{T}

its velocity. Therefore, the set of

N_{P}

particles will be

P_{n, o} = {x_{n, o}^{1}, x_{n, o}^{2}, \dots x_{n, o}^{N_{P}}}

.

Following the standard PF framework, each particle set for a target

o

from a previous time instant

P_{n - 1, o} = {x_{n - 1}^{1}_{, o}, x_{n - 1}^{2}_{, o}, \dots {x_{n - 1}^{N_{P}}}_{, o}}

, will be propagated according its corresponding state model

p (x_{n, o} | x_{n - 1}_{, o})

, giving the predicted particle set

P_{n | n - 1, o} = {x_{n}^{1}_{| n - 1, o}, x_{n}^{2}_{| n - 1, o}, \dots {x_{n}^{N_{P}}}_{| n - 1, o}}

, on a frame by frame basis. This set represents the sampled version of the prior distribution

p (x_{n, o} | Z_{n - 1}_{, o})

for each tracked target o, and thus completes the prediction stage of the standard BF framework in Equation (3).

Then, during the update stage at time n, audio and video data

z_{n}^{a v} = {(z_{n}^{a}, z_{n}^{v})}^{T}

are used to evaluate, for every target o and predicted particle i, the audiovisual likelihood

l^{a v} (p_{n, o}^{i})

, combining the likelihoods calculated from the audio and video modalities,

l^{a} (p_{n, o}^{i})

and

l^{v} (p_{n, o}^{i})

, respectively, so that

l^{a v} (p_{n, o}^{i}) = f (l^{a} (p_{n, o}^{i}), l^{v} (p_{n, o}^{i}))

.

Likelihood calculations are carried out for each predicted particle using the corresponding target observation model

p (z_{n}_{, o} | x_{n, o})

.

For every

o

target, the particles’ weights

{w_{n, o}^{1}, w_{n, o}^{2}, \dots w_{n, o}^{N_{P}}}

are obtained, conforming the sampled version of the likelihood

p (z_{n}_{, o} | Z_{n - 1}_{, o})

at the PF standard framework correction step in Equation (4).

After the update process, the particle set is resampled with the multinomial resampling method proposed in [58]. This way, particles with low weights are eliminated and those with high weight are replicated, keeping constant the number of particles

N_{P}

used to characterize the state hypothesis of each estimation of the speaker

o

position at that time instant.

As a final result, we obtain the new particle set

P_{n, o} = {x_{n, o}^{1}, x_{n, o}^{2}, \dots x_{n, o}^{N_{P}}}

, that constitutes the sampled version of the posterior PDF

p (x_{n, o} | Z_{n}_{, o})

. This final new particle set will be the one used as the prior distribution in the next time step.

The general scheme described above is graphically presented in Figure 1.

4.2. Prediction

The state model used at each iteration of the PF to propagate the particles is the Langevin motion model [59], commonly used in the acoustic speaker tracking literature [60], as shown in Equation (6):

x_{n, o}^{i} = A x_{n - 1, o}^{i} + Q u,

(6)

where

A

and

Q

are the transition and noise state matrices, respectively,

u \sim N (0, Σ)

is the noise related to the process or state, for which a normal distribution is assumed, with zero mean

0

and

Σ

covariance matrix.

Matrix

A

corresponds to a first-order motion behavior, and it is described in Equation (7):

A = [\begin{matrix} 1 & 0 & 0 & ϵ Δ_{T} & 0 & 0 \\ 0 & 1 & 0 & 0 & ϵ Δ_{T} & 0 \\ 0 & 0 & 1 & 0 & 0 & ϵ Δ_{T} \\ 0 & 0 & 0 & ϵ & 0 & 0 \\ 0 & 0 & 0 & 0 & ϵ & 0 \\ 0 & 0 & 0 & 0 & 0 & ϵ \end{matrix}],

(7)

Moreover, the coefficients of

Q

matrix are given by Equation (8):

Q = d i a g (ζ Δ_{T}, ζ Δ_{T}, ζ Δ_{T}, ζ, ζ, ζ),

(8)

where

Δ_{T}

is the time interval (in seconds) between frames

n

and

n - 1

,

ϵ = e^{- β Δ_{T}}

and

ζ = \bar{v} \sqrt{1 - ϵ^{2}}

are the process’ noise model parameters, and

d i a g (\cdot)

is a diagonal matrix with the diagonal values being its arguments. The control parameters used in the proposal are a steady-state velocity

\bar{v}

and its velocity rate of change

β

, following the formulation in [59].

4.3. Update and Position Estimation

To fuse the audiovisual information generated from the audio and video sources, we assume independence between these modalities, so that they fulfill Equation (9):

p (z_{n, o}^{a v} | x_{n, o}) = p (z_{n, o}^{a} | x_{n, o}) \cdot p (z_{n, o}^{a} | x_{n, o})

(9)

In practice, this means that the audiovisual likelihoods are obtained by computing the product of the likelihoods from both modalities, as shown in Equation (10):

p (z_{n, o}^{a v} | x_{n, o}) \sim l^{a v} (p_{n, o}^{i}) = f (l^{a} (p_{n, o}^{i}), l^{v} (p_{n, o}^{i})) = l^{a} (p_{n, o}^{i}) \cdot l^{v} (p_{n, o}^{i})

(10)

Then, following the sampled background of PFs presented in Equation (5), weights are updated by their likelihood values as shown in Equation (11):

w_{n, o}^{i} = l^{a v} (p_{n, o}^{i})

(11)

Finally, the most probable state for each

o

target (

{\hat{x}}_{n, o}

), thus the deterministic instantaneous value for

p (x_{n, o} | Z_{n}_{, o})

, is estimated evaluating Equation (12) [58].

{\hat{x}}_{n, o} = \sum_{i = 1}^{N_{P}} w_{n, o}^{i} \cdot x_{n, o}^{i}

(12)

5. Video Observation Model

The visual likelihood

l^{v} ({\tilde{p}}_{n, o}^{i})

of the video observation model in the real 3D coordinate system, consists of three blocks, as shown in Figure 2, and explained below:

The first block is an appearance-based observation model based on the VJ likelihood $l^{V J} ({\tilde{p}}_{n, o}^{i})$ . Using the probabilistic version of the VJ detector, this appearance-based algorithm accurately estimates the face and mouth location, taking into account different face poses, and generating a likelihood $l^{V J} ({\tilde{p}}_{n, o}^{i})$ to them, as no tracking-by-detection is performed.
The second block is based on a color histogram (color-based likelihood) $l^{c o l} ({\tilde{p}}_{n, o}^{i})$ : The color histogram model is intended to be used when the first block fails to detect the face. These cases may appear with poses not correctly handled by the VJ model, for example, with the head tilted too far forward, or due to a person’s rapid movement, which blurs facial features and may cause a poor response from the VJ model.
The third block is a foreground versus background segmentation (Fg./Bg. Segmentation) that generates a foreground likelihood (Fg. Likelihood) $l^{f g} ({\tilde{p}}_{n, o}^{i})$ , which is used to validate the proposals from both previous models and restrict the video observation hypothesis dispersion, providing a $l^{f g} ({\tilde{p}}_{n, o}^{i})$ likelihood, when the other two components fail.

Thus, the general processing sequence of the video observation model

l^{v} ({\tilde{p}}_{n, o}^{i})

, shown in Figure 2, is as follows:

The first step is to apply the coordinate transformation from the World Cartesian coordinates System (WCS) to the Face Observation Space (FOS) one. There, the tilde in ${\tilde{p}}_{n, o}^{i}$ refers to the projection of $p_{n, o}^{i}$ in this FOS. Thus, after projection, the algorithm determines which observation hypotheses are in the camera’s FoV.
Then, foreground versus background segmentation (Fg./Bg. Segmentation) is applied. For each target $o$ , the FOS is explored with VJ, looking for face detection, so in positive case $C_{o}^{V J} = 1$ .
For targets within the FoV that have no observation response from the VJ model (i.e., $C_{o}^{V J} = 0$ ), the color-based model (light red box in Figure 2) is applied if possible. If this is the case, $C_{o}^{c o l} = 1$ .
If neither the VJ nor the color models can be applied, the foreground likelihood will be assigned.
In any case, a mechanism to prevent a specific person’s visual hypothesis from being confused with observations from another target, an occlusion detector is included (in the VJ and Color modules). Thus, a procedure to restrict the likelihood analysis from hypotheses declared as occluded (Occlusions Correction block) is applied.

Following the observational model, a global visual likelihood function or video observation model

l^{v} ({\tilde{p}}_{n, o}^{i})

is finally defined as in Equation (13), where a confidence level is assigned according to each model component.

p (z_{n, o}^{v} | x_{n, o}^{i}) \sim l^{v} ({\tilde{p}}_{n, o}^{i}) = \{\begin{matrix} l^{V J} ({\tilde{p}}_{n, o}^{i}) & i f C_{o}^{V J} \\ l^{c o l} ({\tilde{p}}_{n, o}^{i}) & i f C_{o}^{V J} = 0 and C_{o}^{c o l} = 1 \\ l^{f g} ({\tilde{p}}_{n, o}^{i}) & o t h e r w i s e \end{matrix}

(13)

If the measurement found with the appearance model

p (z_{n, o}^{v} | x_{n, o}^{i})

is reliable,

C_{o}^{V J} = 1

and the likelihood of related hypotheses are weighted according to the similarity Gaussian function

l^{V J} ({\tilde{p}}_{n, o}^{i})

. If the VJ estimation is not reliable but the color model does show significant similarity to the reference model (

C_{o}^{c o l} = 1

), the related hypotheses are weighted according to their similarity to this reference model as

l^{c o l} ({\tilde{p}}_{n, o}^{i})

. Finally, if both models do not provide confident enough measurements, the hypotheses are weighted according to the background subtraction model

l^{f g} ({\tilde{p}}_{n, o}^{i})

.

In the following subsections, the different blocks of the video observation model are further described and analyzed.

5.1. From World Coordinate System to Face Observation Space (WCS to FOS)

The person (its mouth) position hypotheses in the WCS

{p_{n, o}^{i}} = {{(x_{n, o}^{i}, y_{n, o}^{i}, z_{n, o}^{i})}^{T}}

are projected to the camera image plane using the pin-hole model. Thus, a collection of 2D points

(u_{n, o}^{i}, v_{n, o}^{i})

and distances to the camera

d_{n, o}^{i}

are obtained, to be validated through the video observation model.

Therefore, these 2D points represent hypotheses of the speakers’ mouth position in the image plane (in the FOS), assumed to come from some 2D BB face hypothesis

(u_{n, o}^{i}, v_{n, o}^{i})

, around the mouth.

It is then interesting to obtain further information about that face hypothesis that may give robustness to the validation process commented. This information is going to be the face hypothesis 2D BB size (

S_{n, o}^{i}

). To determine it, the face height of a person

h_{r}^{3 D}

is assumed to be constant and known in the 3D WCS. Thus, it is projected to the FOS through the distance

d_{n, o}^{i}

, as shown in Equation (14):

S_{n, o}^{i} = \frac{h_{r}^{3 D} \cdot f_{c}}{d_{n, o}^{i}},

(14)

where

f_{c}

is the camera focal length.

Once the speakers position hypothesis

{p_{n, o}^{i}}

are projected from the WCS to the FOS, each particle represents the speaker mouth 2D location as

{{\tilde{p}}_{n, o}^{i}} = {{(u_{n, o}^{i}, v_{n, o}^{i}, S_{n, o}^{i})}^{T}}

.

Figure 3 shows a schematic view of the projection mechanism here described.

5.2. Appearance-Based Multi-Pose Video Observation Model

For the sake of clarity, the core of the video observation model based on the VJ likelihood, and its characteristics will be explained first. Then the rest of the processes will be detailed.

5.2.1. Viola and Jones Likelihood Model

The VJ likelihood evaluation is made using the probabilistic VJ model described in [44]. This model consists of modifying the standard VJ face detector to obtain face likelihood values. Given an image position

{\tilde{p}}_{n, o}^{i}

, the model applies a cascade of face-trained classifiers returning a likelihood value

Ω

as in Equation (15):

Ω = \frac{κ}{M} \frac{\sum_{m = 1}^{κ} (H_{m} - θ_{m})}{\sum_{m = κ}^{M} m}

(15)

where

κ

is the number of stages that the image patch passes through the cascade of classifiers, M is the total number of stages,

H_{m}

is the weight output by the stage m, and

θ_{m}

is its threshold.

The likelihood model is applied with three different templates to evaluate different possible face poses in yaw rotations. One template handles frontal faces, and the other two handle left and right profile faces, respectively. Therefore, within the 2D BB, the mouth position where the proposed visual likelihood model is applied is different for each template, as shown in Figure 4.

The approach is flexible enough to allow for other poses to be considered, by, for example, extending the previous templates with in-plane rotations (roll). In this case, the image can be rotated with a given angle (

α

) in the opposite direction to allow reusing the already trained poses.

In this work, the image and the particles are rotated in both directions (clockwise and counterclockwise). For each direction, the models for the three face pose classifiers (front, right, and left profiles) are applied.

Thus, for each position

{\tilde{p}}_{n, o}^{i}

in the image, nine response values are obtained, as shown in Equation (16):

Ω ({\tilde{p}}_{n, o}^{i}) = [\begin{matrix} Ω^{F_{0}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{R_{0}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{L_{0}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{F_{- α}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{R_{- α}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{L_{- α}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{F_{+ α}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{R_{+ α}} ({\tilde{p}}_{n, o}^{i}) \\ Ω^{L_{+ α}} ({\tilde{p}}_{n, o}^{i}) \end{matrix}],

(16)

where F, R, and L refer to frontal, right, and left profiles, respectively, and their subindexes refer to the angle rotation.

Figure 5 shows the templates for in-plane face rotations: three on the left with clockwise rotation, and three on the right with counterclockwise ones, using

α = 15^{\circ}

.

The characteristics of the model response are explored in the FOS. Figure 6 shows the responses of the nine templates of the VJ likelihood model for three different face poses. The different templates better respond to faces with poses close to those with similar profiles and rotations. The likelihood responses have higher values around the mouth position for each pose, and more than one template can generate a positive response to the face when the pose is similar.

Figure 7 shows that the model response to a face is shaped like a blob in the FOS. This blob-shape-like behavior resembles that of Gaussian behavior. The width of the significant response levels in the

(u, v)

plane is small, a few pixels in diameter. The width in response to the template size S is much larger, several tens of pixels.

5.2.2. Exploration Mechanism of the FOS

One exploration alternative could be to evaluate the likelihood of the particle’s projection on the FOS. However, the peak’s width of the response on the image plane

(u, v)

is very narrow. This characteristic is right from the point of view of location accuracy, but it increases the possibility that no particle will hit the same region where a peak appears in the FOS. On the other hand, exploring the FOS domain in the whole area occupied by the particles may imply a high computational cost.

To ensure that the region with a peak response is found while keeping computational complexity low, we plan to take advantage of the model’s face response’s redundancy in dimension S. Our proposal consists of exploring the volume occupied by particles in the FOS using three slices in the S dimension. With the response in these slices, we can estimate a Gaussian shape of the likelihood in the FOS, to finally, weight all particles with the estimated Gaussian.

The three slices or scanning planes are defined, one at the average face size, one

γ

times larger, and one

γ

times smaller than the average size:

[\bar{S} / γ, \bar{S}, \bar{S} \cdot γ]

. The average face size of particles is computed as

{\bar{S}}_{o} = \frac{1}{N_{P}} \sum_{i = 1}^{N_{P}} S_{o}^{i}

. In

(u, v)

, the exploration is restricted to the rectangular area defined by the maximum and minimum particle position values in each dimension

[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]

. Figure 8 shows the process to weight the particles.

5.2.3. Gaussian Approximation

For each pose, the VJ likelihood is evaluated in the three slices. Next, a threshold

θ_{V J}

is applied to remove points outside the Gaussian pursuit, and the mean

μ_{n, o}^{\tilde{p}} = (μ_{u}, μ_{v}, μ_{s})

and covariance matrix

Σ_{n, o}^{\tilde{p}}

of the remaining points are computed.

Next, instead of adding all pose outputs, like in [2,23], we select the best pose (

p o s e_{b e s t}

) as that with the highest response value (described in Equation (17)), proven to reduce the number of false positives in preliminary experiments.

p o s e_{n, o}^{b e s t} = \underset{p o s e}{arg max} {Ω^{p o s e} ({\tilde{p}}_{n, o}^{i})}, w i t h p o s e \in {F_{0} R_{0} L_{0} F_{- α} R_{- α} L_{- α} F_{+ α} R_{+ α} L_{+ α}}

(17)

Using the centroid’s value

μ_{n, o}^{\tilde{p}}

, the corresponding pose face BB is projected to the Fg./Bg. image. Finally, poses whose BB have less than

60 %

of intersection with the foreground area are eliminated, as

60 %

is the approximated area percentage covered by a face in the templates.

From the remaining poses, that with the greatest number of particles is selected on the set of not eliminated poses, and the observation confidence of the appearance-based model is set to one:

C_{o}^{V J} = 1

. In the case all the poses were eliminated, it will be considered that there is no visual observation with the appearance-based model, setting

C_{o}^{V J} = 0

and emptying the associated

μ_{n, o}^{\tilde{p}}

and

Σ_{n, o}^{\tilde{p}}

.

Finally, some adjustment to the covariance matrix is needed. To avoid orientation artifacts in the generated Gaussian model, covariance terms are not considered, setting

σ_{u v, o} = σ_{u s, o} = σ_{v s, o} = 0

. Also, the horizontal and vertical dispersion values are equalized, taking the minimum of both values

σ_{u u, o} = σ_{v v, o} = m i n (σ_{u u, o}, σ_{v v, o})

. To prevent a zero variance in the S dimension, caused by the case where only one slice had points over the threshold, its minimum value is set to

σ_{s s, o}^{m i n}

, so that

σ_{s s, o} \geq σ_{s s}^{m i n}

.

The right image in Figure 8, shows the particles (red), their mean value (black cross), and the standard deviation of the estimated Gaussian as a 3D surface (green).

5.2.4. Occlusion Detection with the VJ Model

After estimating which of the measurements are associated with each target, it is necessary to make sure that multiple targets are not related to the same measurement.

The following procedure is used to avoid this situation:

For each pair of targets $(o, o^{'})$ , the overlap between the exploration spaces (FoVs) of both targets is calculated $ρ = ({[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]}_{o} \cap {[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]}_{o^{'}})$ .
If there is overlap ( $ρ \neq \emptyset$ ), the global possible measurements’ dispersion $σ_{u u, o}$ for each target $(o)$ in the image plane $(u, v)$ is calculated.
If any of the targets have high dispersion, it may imply that some of the assigned measurements are from other targets. The threshold value to consider that a target has a high measurement dispersion is calculated as a fraction of the average face size of the particle set ${\bar{S}}_{o}$ .
If the dispersion of any target exceeds this threshold and is not occluded, the Gaussian related to that target observation is recalculated after removing the assigned measurements within the $ρ$ overlapping region. Then, the Euclidean distance between the two centroids of the observations reassigned to each target within the image plane is calculated as $d_{u v} = ∥ (μ_{n, o}^{\tilde{p}}, μ_{n, o^{'}}^{\tilde{p}}) ∥$ .
If this $d_{u v}$ distance is less than two times the dispersion in the image plane ( $2 \cdot σ_{u u, o}$ ) the decision of which measurement is associated with which target is based on another definition of the distance $d_{u v}$ is defined, using the representation of the predicted particles in the image plane as $d_{u v} = ∥ (μ_{n, o}^{\tilde{p}}, {\bar{p}}_{n, o}) ∥$ . Then again, the observations will be assigned to the target $(o)$ from which the shortest distance $d_{u v}$ is found, and the other target $(o^{'})$ is declared as occluded, and the related particles are at a shorter distance than the threshold.

The pseudocode of this process is presented in Algorithm 1.

Figure 9 shows an occlusion situation. In the left graph, the image and the related locations (in the image plane) from the particles representing two targets are shown in red and green colors. The center graph shows a top view of the particle positions in the WCS.

Finally, the right-hand graph shows the same particles’ related location in 2D together with the standard dispersion

σ_{u u, o}

, through its representing circle (in magenta and cyan). There, it can be observed that part of the particles representing target number two (green) are being assigned in the observation space to target one (red). It can be thus noticed the particles dispersion, and therefore the occlusion situation.

Algorithm 1 Occlusions detection in VJ likelihood block.

if every target pair $o, o^{'}$ in FoV then
$ρ \leftarrow ({[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]}_{o} \cap {[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]}_{o^{'}})$ % intersection factor
if $ρ \neq \emptyset$ then
find intersection region, $({[u_{m i n}, u_{m a x}, v_{m i n}, v_{m a x}]}_{i n t e r})$
for $o^{*} \in {o, o^{'}}$ do
if $σ_{u u, o *} > {\bar{S}}_{o *}$ and $f a c e I s O c c l u d e d (o *) = = 0$ then
${[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]}_{o *} \leftarrow {[u_{m i n} : u_{m a x}, v_{m i n} : v_{m a x}]}_{o *} - ρ$
end if
end for
if $d_{u v} = ∥ (μ_{n, o}^{\tilde{p}}, μ_{n, o^{'}}^{\tilde{p}}) ∥ < σ_{u u, o}$ then
if $d_{u v} = ∥ (μ_{n, o^{'}}^{\tilde{p}}, {\bar{p}}_{n, o}) ∥ < d_{u v} = ∥ (μ_{n, o}^{\tilde{p}}, {\bar{p}}_{n, o}) ∥$ then
$f a c e I s O c c l u d e d (o) \leftarrow 1$ % The target with prediction farthest from the measure is occluded
for each particle do
if $d_{u v} = ∥ (μ_{n, o^{'}}^{\tilde{p}}, {\tilde{p}}_{n, o}^{i}) ∥ < 2 σ_{u u, o^{'}}$ then
$o c c I d x (i) \leftarrow 1$ % particles are set as occluded
end if
end for
else
$f a c e I s O c c l u d e d (o^{'}) \leftarrow 1$
for each particle do
if $d_{u v} (μ_{n, o}^{\tilde{p}}, {\tilde{p}}_{n, o^{'}}^{i})) < 2 σ_{u u, o}$ then
$o c c I d x (i) \leftarrow 1$
end if
end for
end if
end if
end if
end if

5.2.5. VJ Likelihood Assignment

The likelihood values assigned to the particles are associated with the Gaussian with mean

μ_{n, o}^{\tilde{p}}

, covariance matrix

Σ_{n, o}^{\tilde{p}}

and amplitude equal to the maximum observed there

Ω^{p o s e_{b e s t}}

(with

p o s e_{n, o}^{b e s t}

as defined in Equation (17)). Likelihood values below the threshold

θ_{V J}

are set to zero. Equation (18) describes such likelihood.

l^{V J} ({\tilde{p}}_{n, o}^{i}) = \{\begin{matrix} Ω^{p o s e_{b e s t}} N ({\tilde{p}}_{n, o}^{i} | μ_{n, o}^{\tilde{p}}, Σ_{n, o}^{\tilde{p}}) & if Ω^{p o s e_{b e s t}} N ({\tilde{p}}_{n, o}^{i} | μ_{n, o}^{\tilde{p}}, Σ_{n, o}^{\tilde{p}}) \geq θ_{V J} \\ 0 & otherwise \end{matrix}

(18)

5.3. Head Color-Based Likelihood Block

As explained above, in the FOS coordinates, each particle is assigned a BB in the image, corresponding to the face in a frontal pose. To tackle face occlusions, the spatiogram of the 2D BB mouth estimation

B B_{n, o}^{i}

is calculated as in [17], and replicated in Equation (19). The likelihood will be proportional to the Bhattacharyya coefficient between the histogram associated with each particle

i

and the spatiogram of the reference face patch image

B B_{n, o}^{f a}

:

S (B B_{n, o}^{i}, B B_{n, o}^{f a}) = \sum_{b = 1}^{B} \sqrt{r_{i}^{b} r_{f a}^{b}} [8 π | Σ_{i}^{b} Σ_{f a}^{b} |^{\frac{1}{4}} N (μ_{i}^{b} | μ_{f a}^{b}, 2 (Σ_{i}^{b} + Σ_{f a}^{b}))],

(19)

were

r_{i}^{b}, μ_{i}^{b}, Σ_{i}^{b}

and

r_{f a}^{b}, μ_{f a}^{b}, Σ_{f a}^{b}

are the histogram count, the spatial mean and covariance matrices in the color bin b of particle

i

and in the reference face patch image

f a

, respectively.

The face is considered to be located by the color model (setting

C_{o}^{c o l} = 1

) if the maximum similarity value exceeds a threshold

θ_{c o l}

. In this case, the 75th percentile of the lower likelihoods is discarded, to reduce the dispersion of particles due to the higher dispersion of the color model. Otherwise, we assume that the face is not detected by this color-based strategy (setting

C_{o}^{c o l} = 0

).

Equation (20) describes the final color likelihood proposed

l^{C o l} ({\tilde{p}}_{n, o}^{i})

.

l^{C o l} ({\tilde{p}}_{n, o}^{i}) = \{\begin{matrix} S (B B_{n, o}^{i}, B B_{n, o}^{f a}) & if m a x (S (B B_{n, o}^{i}, B B_{n, o}^{f a})) \geq θ_{c o l} \\ 0 & otherwise \end{matrix}

(20)

The color reference histogram is initialized in the face BB of the first ground-truth frame, considering a frontal pose. This reference model is updated in every iteration when the VJ likelihood delivers a confident value. In this case, the face BB corresponding to the best-detected pose is used.

Like in the VJ likelihood block, with the color model, particles from an occluded target o, very close to another target

o^{'}

, can be captured by the significant likelihood region of

o^{'}

. To avoid this situation, the o particles that are very close to

o^{'}

are labeled as occluded. The limit value of closeness is set equal to the diagonal of the average face size of the

o^{'}

target

\sqrt{2} {\bar{S}}_{o^{'}}

. This procedure is carried out before evaluating the color-based likelihood.

5.4. Correction on Occlusions

The occlusion correction stage is aimed at avoiding hypotheses of one person’s position (target) obtaining a high likelihood if mistaken for another target. Simultaneously, this correction mechanism should not penalize hypotheses in an occluded region, thus allowing the mechanism to keep track of persons passing behind one another in the monitored environment.

The correction globally works as described in Algorithm 2. If a target is occluded and all its related position hypotheses are in an occluded area, all their likelihood values

l^{v} ({\tilde{p}}_{n, o}^{i})

are set to

1 / N_{P}

. Otherwise, if the target has some position hypotheses in an occluded area and some others in a non-occluded one, the likelihood of those located in the occluded area (

o c c I d x (i) = = 1

) is set to the average of those in a non-occluded one (

o c c I d x (i) \neq 1

).

Algorithm 2 Occlusions correction pseudocode.

for every target $o$ do
if target $o$ is occluded then
if all position hypotheses ${\tilde{p}}_{n, o}^{i}$ are occluded then
$l^{v} ({\tilde{p}}_{n, o}^{i}) = 1 / N_{P} \forall i$
else
$l^{v} ({\tilde{p}}_{n, o}^{i} / o c c I d x (i) = = 1) = m e a n (l^{v} ({\tilde{p}}_{n, o}^{i} / o c c I d x (i) \neq 1)) \forall i$
end if
end if
end for

5.5. Foreground vs. Background Segmentation (Fg. vs. Bg.)

The foreground vs. background segmentation procedure starts by subtracting a reference frame (with the environment background, without people) from the given one, in grayscale. After the subtraction, a threshold

θ_{f g}

is applied to the resulting difference image, obtaining a binary image

I_{θ, f g}

.

Hypotheses in the foreground or outside the camera’s FoV receive a uniformly distributed likelihood value (

U ({\tilde{p}}_{n, o}^{i})

). In contrast, those within the FoV but in the background, receive a zero weight, as stated in Equation (21):

l^{f g} ({\tilde{p}}_{n, o}^{i}) = \{\begin{matrix} \sim U ({\tilde{p}}_{n, o}^{i}) & if I_{θ, f g} ({\tilde{p}}_{n, o}^{i}) = = 1 \lor \notin FoV \\ \sim 0 & otherwise \end{matrix}\}

(21)

6. Audio Observation Model

In this work, the audio observation model is based on a probabilistic version of SRP-PHAT, proposed in [55]. This model exploits a probabilistic interpretation of the Generalized Cross-Correlation with PHAT transform (GCC-PHAT) between the signals of each pair of microphones. With probabilistic GCC-PHAT, it is possible to associate only one correlation peak to each target. For each time step, the procedure works as follows:

The GCC-PHAT for every pair of microphones, and its associated Gaussian model are obtained.
SRP-PHAT is computed for every particle position.
A Gaussian selection procedure chooses which Gaussian is associated with each target.
Finally, the probabilistic SRP-PHAT value is associated with the likelihood of the targets.

Figure 10 shows the general scheme of the audio observation model.

6.1. GCC-PHAT and Gaussian Model

The GCC-PHAT (to ease the notation, we will skip the explicit mention to PHAT when referring to the

G C C^{P H A T}

and

S R P^{P H A T}

functions in the equations) is computed for the signals arriving at each microphone pair

π_{j}

(composed of microphones

m_{j_{1}}

and

m_{j_{2}}

) around the evaluated video frame, as presented in Equation (22):

G C C_{π_{j}} (τ) = \sum_{k = 0}^{N_{f} - 1} Ψ_{j_{1} j_{2}} [k] S_{j_{1}} [k] S_{j_{2}}^{*} [k] e^{j \frac{2 π k f_{s}}{N_{f}} τ}

(22)

where

N_{f}

is the number of discrete frequencies used in the Fourier analysis of the discretized signals captured by the

j_{1}

and

j_{2}

microphones (sampled versions of the

s_{j_{1}} (t)

and

s_{j_{2}} (t)

signals);

S_{j_{1}} [k]

and

S_{j_{2}} [k]

are the frequency spectra of these signals; and

Ψ_{j_{1} j_{2}} [k] = \frac{1}{| S_{j_{1}} [k] S_{j_{2}}^{*} [k] |}

is the PHAT filter.

τ

is the lag variable of the correlation function, associated with the time difference of arrival of the audio signal to the pairs of microphones.

The second step is to model the GCC-PHAT of each microphone pair

π_{j}

as a Gaussian Mixture Model (GMM). The main assumption here is that each GCC-PHAT peak is caused by a different acoustic source at a given position, generated by the direct propagation path, by a reverberant echo, or by other noise sources. With this consideration, each peak in the GCC-PHAT function is associated with a Gaussian function in the GMM model described by Equation (23):

{\hat{G C C}}_{π_{j}} (τ) \approx \sum_{h = 0}^{N_{J} - 1} ω_{π_{j}}^{h} N (τ | μ_{π_{j}}^{h}, σ_{π_{j}}^{h}),

(23)

where

N_{J}

is the number of peaks detected in the GCC-PHAT function, and

μ_{π_{j}}^{h}

,

σ_{π_{j}}^{h}

and

ω_{π_{j}}^{h}

represent the mean, standard deviation and weights of the

h^{t h}

component of the mixture.

The correlation values are first normalized (making their sum equal to one) and their negative values are set to zero. The GMM parameters

{μ_{π_{j}}^{h}, σ_{π_{j}}^{h}, ω_{π_{j}}^{h}}_{h = 1}^{N_{J}}

are estimated according to the procedure described in [55].

6.2. SRP-PHAT and Gaussian Selection

Once the GMM model is available, the traditional SRP-PHAT formulation can be applied to the position of each particle. Then, the SRP-PHAT value for a given target o is calculated as the average

\hat{S R P}

over all the particles associated with o,

{\bar{S R P}}_{n, o}

.

Gaussian selection is applied sequentially to every target, starting from the one with the highest

{\bar{S R P}}_{n, o}

value. For each target

o

and microphone pair

π_{j}

, the set of maximum SRP-PHAT delays

{τ_{π_{j}}^{p_{n, o}^{i m a x}}}

is evaluated in each Gaussian h, and that with the highest value is selected, as shown in Equation (24).

p_{n, o}^{i m a x} = \underset{p_{n, o}^{i}}{arg max} (\hat{S R P} (p_{n, o}^{i}))

(24)

After the selection, the selected Gaussian is subtracted from the mixture, and the Gaussian selection process continues with all the other targets, in

{\bar{S R P}}_{n, o}

value decreasing order.

Figure 11 represents the Gaussian selection process in a frame extracted from sequence seq18-2p-0101 in the AV16.3 dataset, where there were two close speakers. The graphic on the left shows in black the

G C C_{π_{16}} (τ)

function (for the

π_{16}

microphone pair) and the calculated Gaussian mixture

{\hat{G C C}}_{π_{16}} (τ)

in blue, along with the projections of the particles with maximum

\hat{S R P} (p_{n, o}^{i})

for both targets. The graphic on the right highlights the selected Gaussians for each target. In both graphics, target one selected Gaussian appears in red, and target two appears in green.

From Figure 11 it can be observed that both targets obtain close TDoA projection for their particles with maximum

\hat{S R P} (p_{n, o}^{i})

value, sharing the same Gaussian. In this case, target 2 obtained the highest value, so it was assigned the associated Gaussian.

6.3. Single Gaussian SRP-PHAT Model

The final step to generate a likelihood value from the acoustic information consists of simplifying the SRP-PHAT GMM model, by considering just one Gaussian for each pair of microphones, that with the highest weight value, as shown in Equation (25).

\hat{S R P} (p_{n, o}^{i}) \approx \sum_{j = 1}^{N_{Q}} ω_{π_{j}}^{*} N (τ | μ_{π_{j}}^{*}, σ_{π_{j}}^{*})

(25)

where

ω_{π_{j}}^{*}

,

μ_{π_{j}}^{*}

and

σ_{π_{j}}^{*}

are the parameters associated with the Gaussian component with the highest weight value in Equation (23).

Because of reverberation and low SNR conditions, some speech segments may exhibit low SRP-PHAT values, degrading the quality of the acoustic power maps. To avoid this degradation, we consider the maximum SRP-PHAT value as an indicator of confidence, so that a threshold

θ_{a}

is applied to limit the influence of such segments. Finally, the likelihood from the audio information is calculated as described in Equation (26):

p (z_{n, o}^{a} | x_{n, o}^{i}) \propto l^{a} (p_{n, o}^{i}) = \{\begin{matrix} \sim \hat{S R P} (p_{n, o}^{i}) & i f max (\hat{S R P} (p_{n, o}^{i})) \geq θ_{a} \\ \sim U (p_{n, o}^{i}) & o t h e r w i s e \end{matrix}

(26)

7. Experimental Setup

The tracking system has been evaluated in three modalities. The first one uses only audio information, the second one uses only video information, and the final one combines the two sources of information in an audiovisual modality.

7.1. Datasets

The databases used for system evaluation are the well-known AV16.3 [45], and CAV3D [17] in the state of the art of interest. Both databases are fully labeled, providing the mouth ground-truth location. Synchronization information between the audio and video streams is also available.

7.1.1. AV16.3

AV16.3 was recorded in the Smart Meeting Room of the IDIAP research institute [61], in Switzerland, which consists of a 8.2 m × 3.6 m × 2.4 m rectangular room, containing a centrally located 4.8 m × 1.2 m rectangular table, on top of which there are located two circular microphone arrays of radius 0.1 m, each of them composed of eight microphones. The centers of the two arrays are separated 0.8 m, and their coordinates origin are in the middle point between the two arrays. The room is also equipped with three video cameras providing different-angle views of the room. The database contains audio and video data taken by the three video cameras and the two circular microphone arrays. The cameras have a frame rate of 25 f.p.s. (40 ms period) while the audio has been recorded at 16 kHz. The dataset contains sequences grouped in two contexts, a Single Objective Tracking (SOT) one, and a Multiple Objective Tracking (MOT) one.

7.1.2. CAV3D

CAV3D was recorded in the Bruno Kessler Center in Information and Communication Technology (FBK-ICT) in Italy, consisting of a 4.77 m × 5.95 m × 4.5 m rectangular room. Sensing was conducted with a monocular color camera co-located with an 8-element circular microphone array placed in the room’s center. Video was recorded at 15 f.p.s. (≈66.7 ms period). The audio sampling rate was 96 kHz. The dataset contains 20 sequences grouped in three contexts, a Single Objective Tracking (SOT) one, another with a single active speaker and a second interfering person (not speaking) (SOT2), and a Multiple Objective Tracking (MOT) one.

7.1.3. Sequences Selection

The experiments were carried out using the same sequences in AV16.3 and CAV3D evaluated in [17,22], two state-of-the-art proposals used for performance comparison. Table 1 shows the AV16.3 selected sequences: seq08, seq11, seq12 for SOT, and seq18, seq19, seq24, seq25 seq30 for MOT. For each sequence, the three cameras’ views were tested independently and combined with the first microphone array, giving a total of nine evaluation sequences in SOT and fifteen in MOT. Table 2 shows the CAV3D selected sequences, which are all the available ones for the SOT and MOT cases, in which we focused our experimental work.

7.2. Evaluation Metrics

The evaluation metrics used are the Track Loss Rate (TLR), and the Mean Absolute Error (MAE), as defined in [17]. The TLR is the percentage of frames with a track loss, where a target is considered to be lost if the error exceeds a given threshold. These metrics will be further specified below.

The MAE for 3D is expressed in m as in Equation (27):

ϵ_{3 D} = \frac{1}{N_{F} N_{S}} \sum_{i = 1}^{N_{S}} \sum_{n = 1}^{N_{F}} | | {\hat{p}}_{n, i} - r_{n, i} | |,

(27)

where

N_{F}

is the number of frames evaluated, and

{\hat{p}}_{n, i}

and

r_{n, i}

are, respectively, the estimated and real positions of source i in frame n.

To evaluate the TLR in 3D, a target is considered to be lost if the error with respect to the ground-truth is larger than

300 mm

. We also use a fine error metric defined as

ϵ_{3 D}^{'}

, where only the frames where tracking is successful are considered in Equation (27).

For 2D, the MAE in the image plane is expressed in pixels as in Equation (28):

ϵ_{2 D} = \frac{1}{N_{F}^{'} N_{S}^{'}} \sum_{i = 1}^{N_{S}^{'}} \sum_{n = 1}^{N_{F}^{'}} | | {\tilde{\hat{p}}}_{n, i} - {\tilde{r}}_{n, i} | |,

(28)

where

N_{S}^{'}

are

N_{F}^{'}

are the number of sources and frames, respectively, in which the source position is inside the camera’s FoV.

{\tilde{\hat{p}}}_{n, i}

and

{\tilde{r}}_{n, i}

are, respectively, the projection of

{\hat{p}}_{n, i}

and

r_{n, i}

in the image plane.

Moreover, for computing the TRL in 2D it is used a threshold of

1 / 30

of the image diagonal in pixels. As in the 3D case, a fine error metric

ϵ_{2 D}^{'}

is also defined, for this 2D TLR threshold.

When comparing different proposals or experimental conditions, we will also calculate the relative performance improvement in all the evaluated metrics, as follows:

Δ_{A l g 1}^{A l g 2} = 100 \frac{{Metric}_{A l g 1} - {Metric}_{A l g 2}}{{Metric}_{A l g 1}} [%]

(29)

where

A l g 1

and

A l g 2

refer to the algorithms or conditions we are comparing, and

{Metric}_{A l g}

refers to the considered Metric calculated using the corresponding

A l g

. Given that in all the proposed metrics the lower the better, a positive result for

Δ_{A l g 1}^{A l g 2}

implies that

A l g 2

is better than

A l g 1

.

7.3. System Configuration

In AV16.3, the audio signals were resampled up to 96 kHz. Also in this database, each image frame was scaled by 2 to adapt them to the VJ OpenCV [62] templates (

20 \times 20

pixels) when they are far away from the camera. Also, a lens distortion correction was applied.

For both AV16.3 and CAV3D data, the audio signal pre-processing starts with a pre-emphasis filter (

H (z) = 1 - 0.98 z^{- 1}

) to enhance high-frequency content. After filtering a segment of 8192 samples (

\approx 85.3 ms

), flattop weighted windows are applied to the signal, with a window shift value equal to the video frame rate (25 f.p.s. in AV16.3 and 15 f.p.s. in CAV3D). Thus, there is one audio segment associated with each video frame. Moreover, the Fourier filter size has been selected equal to the signal window size (8192).

Regarding the algorithm parameters needed in the proposal as described in Section 5 and Section 6, all of them were tuned on a small subset composed of additional sequences from the AV16.3 dataset, except for the Fg./Bg. Segmentation, in which the tuning was carried out using the extra sequence seq21 from CAV3D.

These are the final values in the experimental setup: The face size in 3D was set to a fixed value of

h_{r}^{3 D} = 17

cm. The appearance-based likelihood threshold

θ_{V J}

, was set to

0.5

. The

γ

face size change factor for slice exploration was set to

γ = 1.2

. The minimum dispersion in the S dimension was fixed to

σ_{s s}^{m i n} = 10

. The color spatiogram likelihood threshold was set to

θ_{c o l} = 0.6

. The Fg./Bg. Segmentation gray scale intensity threshold was set to

θ_{f g} = 80

for AV16.3 and

θ_{f g} = 30

for CAV3D. The acoustic power threshold

θ_{a}

was set to

0.8

. The model parameters

\bar{v} = 1

ms

^{- 1}

and

β = 10

s

^{- 1}

used in [60,63] have shown good results, and thus they are the ones we applied in this work. The PF algorithm used was the SIR with

N_{P} = 1000

particles per target.

8. Results and Discussion

In this section, different results, both quantitative and qualitative, are included to demonstrate the contributions of the GAVT global proposal and its processes in the audiovisual MOT objective.

As mentioned in Section 2, the AV16.3 and CAV3D datasets are used for the experimental work, due to their application in the most important state-of-the-art related works [17,20].

Therefore two different sections are here included to analyze such results: a first one in which the multimodality of the proposal is evaluated and discussed; and a second one in order to compare them with the best proposals in the state-of-the-art on the selected datasets [20].

In the tables within this section, we will provide values for TLR,

ϵ_{2 D}

,

ϵ_{2 D}^{'}

,

ϵ_{3 D}

,

ϵ_{3 D}^{'}

, segregated for SOT and MOT partitions, and their average value. In all cases, we explicitly include the average of the standard deviation of the metric (

\pm σ

), given that we carried out ten runs per sequence due to the probabilistic nature of the GAVT proposal. We will also include information on the modality being used, either audio only (A for short), video only (V for short), or audiovisual (AV for short). Finally, to quickly identify the experimental conditions of each table with results, in the first row we state the dataset used (AV16.3 or CAV3D), and if the metrics are in the image plane (2D) of in the three-dimensional space (3D), also including the modality being used. The comparison between modalities and algorithms will be done using the

Δ_{a l g 1}^{a l g 2}

metric defined in the previous section. In all cases, the best results across metrics will be highlighted with a green background in the corresponding cell.

8.1. Audiovisual Combination Improvements

We first present the contribution of audiovisual tracking versus individual audio-only and video-only modalities on the AV16.3 database sequences. We did not make this comparison in the CAV3D dataset as in that one there are several sequences in which the speakers leave the camera’s FoV for a certain time, so that the video-only modality could not be compared equally with the other two modalities within such dataset.

The results of our GAVT proposal using the audio-only (A), video-only (V) and audiovisual (AV) modalities are shown in Table 3 and Table 4 for the 2D and 3D metrics, respectively, as well as the relative performance improvements comparison from A to AV modalities (

Δ_{A}^{A V}

), and from V to AV (

Δ_{V}^{A V}

).

The obtained results clearly show that for both the SOT and MOT tasks, the audiovisual combination outperforms its monomodal counterparts. As expected, the visual modality is far better than the audio modality, but in all cases, the audiovisual combination contributes to improved results.

In the 2D case, the average MAE is strongly improved when combining the audio and video modalities, with similar improvements for the SOT and MOT tasks (

86.2 %

and

87 %

, respectively), with an average improvement of

86.7 %

. The improvements of the audiovisual modality as compared with the video-only one are still relevant, being

37.6 %

on average. When considering the fine MAE, the audiovisual modality does not improve the visual-only modality, with minor degradation of

- 2.0 %

on average, which is not significant (especially considering the non-linear characteristic of this metric), with errors below

3.35

pixels in all cases.

In the 3D case, the improvements of the audiovisual modality are consistent across all metrics, with MAE of 18 cm on average, and very relevant improvements of up to

50.03 %

as compared with the visual-only modality.

Also as expected, the results for the SOT task are better than those for the MOT task. For example, in the audiovisual modality, with SOT 2D MAE of

5.12

pixels vs. MOT 2D MAE of

7.4

. In the 3D case, the 3D MAE for SOT is 15 cm vs. 21 cm in the MOT task.

For the SOT task, the highest relative improvement in the comparison between audiovisual and audio-only modalities was found for sequence seq11 camera 2, with a

79 %

, and the lowest relative improvement happened for sequence seq08 camera 1 with a

25 %

. We will further discuss these two extreme cases.

The top part of Figure 12 shows the mean (dark line) and standard deviation (light color area) of the 3D error over time for the seq11 sequence camera 2. The bottom part of Figure 12 shows the top view of the tracked trajectories for each of the modalities.

Analyzing the errors in 2D (Table 3), where the tracked positions are reprojected to the camera image plane, we can see that the video-only modality is very accurate, with an error of

5.78

pixels, and close to the audiovisual solution with

5.12

pixels, while the audio-only modality presents a much higher error of

37.9

pixels.

From comparing 2D and 3D errors, we can observe that the video modality errors come principally from the estimation of depth (see the bottom part of Figure 12), so that we can interpret that the proposed multi-pose face model can accurately locate the mouth in the image plane domain.

From the results of sequence seq11 camera 2 in Figure 12, it can be observed that the audio modality tracker presents significant errors for most of the sequence. This error can be observed in the trajectories (see bottom left graphic in Figure 12) where the system overestimates the distance from the target to the microphone array, especially when the target is far from it. The video-only tracker presents a better estimation in the first part of the sequence but underestimates the distance at the end of the sequence (see bottom middle graphic in Figure 12) when the target is farthest from the camera. Observing the audiovisual trajectories (see bottom right graphic in Figure 12), both modalities compensate for each other errors, presenting an intermediate estimation that leads to improved results.

Figure 13 shows the detailed results for sequence seq08 camera 1, where it can be observed that the audio-only modality present low errors during the whole sequence (bottom left graphic), while the video-only modality fails on the estimation of depth with important errors (bottom middle graphic). Although the video-only modality presents more relevant errors than the audio-only one, the audiovisual combination improves again the monomodal tracking options.

This behavior of audiovisual integration usually works successfully even when the video performance is lower than the audio. As an example, the upper graphic of Figure 14 shows the 3D MAE variation along time for sequence seq08 camera 3, where the video modality shows higher errors than the audio-only modality, especially in the first half of the sequence. The audiovisual fusion is again able to obtain results that improve the monomodal versions. The lower graphic of Figure 14 shows another case in which the audio modality presents higher errors than the video modality in the central part of the sequence, and the audiovisual fusion compensates for such behavior.

For the MOT task, the audio modality significantly increases its errors, up to 68 cm on average, while the video modality obtained a result of 39 cm. This performance decrease in the audio tracking modality could be explained by targets being interchanged between them.

Regarding the detailed analysis for the MOT task, Figure 15 shows an example for sequence seq24 camera 3. The top part of Figure 15 shows the mean (dark line) and standard deviation (light color area) of the 3D error over time for the seq24 sequence camera 3. The bottom part of Figure 15 shows the top view of the tracked trajectories for each of the modalities. In this sequence, there were two speakers, so two sets of graphics are shown.

In this case, the audio-only modality (bottom left graphics of Figure 15) only shows good results in the initial part of the sequence. The video-only modality (bottom middle graphics of Figure 15) exhibits good results for speaker 2, but not so good for speaker one, but the combination of both modalities (bottom right graphics of Figure 15) can compensate the errors, achieving good results even for speaker 1.

In the MOT task, the average 3D errors are

40 %

higher compared to the SOT case. We observed that most large errors were associated with instances where one target was lost because of a missing face and the difficulty to recover it, similar to the difficulties encountered in the SOT task. Nevertheless, our occlusion detection and handling mechanisms proved effective in preventing particle interchanges, leading to good results. As an example, in sequence seq22 (see Figure 16), we can see the successful tracking of two targets walking in circular patterns in front of the camera.

We can also find examples of audiovisual integration not being able to successfully compensate for the impact of a modality performing badly. Figure 17 illustrates such a scenario, where the system correctly integrates the audio and video modalities for speaker 1 (top graphic), resulting in improved tracking accuracy. However, for speaker 2 (bottom graphic), the integration improves the results as compared with the video-only modality, but grows worse as compared with the audio-only modality toward the end of the sequence.

8.2. Comparison with the State-of-the-Art

In this section, we will compare our GAVT proposal with that of [17] (which we will refer to as AV3T) that represents state-of-the-art performance in the AV16.3 and CAV3D datasets within our experimental design.

Table 5 and Table 6 show the 2D and 3D performance metrics of the GAVT and AV3T systems on the AV16.3 dataset, for the audio-only (A) and video-only (V) modalities. The tables also include the relative performance improvements of GAVT as compared with AV3T (

Δ_{AV 3 T}^{GAVT}

).

From these results, it is clear that our audio-only method performs worse than that proposed by Qian since we did not use height information from the video (

- 51 %

performance improvement for the average 2D MAE and

- 38 %

for the average 3D MAE).

It’s also clear that our video-only tracker outperforms AV3T in the MAE metrics (

23 %

for both the average 2D and 3D MAEs). This result shows that our proposed pose-dependent visual appearance model works better to localize the mouth than using the generic face detection with a mouth position estimation based on the aspect ratio proposed in [17].

Table 7 and Table 8 show again, for the AV16.3 dataset, the 2D and 3D performance metrics for the audiovisual modality (AV) of the GAVT and the AV3T systems, as well as the improvements of our proposal against AV3T (

Δ_{AV 3 T}^{GAVT}

).

Our audiovisual method GAVT presents significant improvements as compared with AV3T, for both the 2D and 3D metrics, and for both the SOT and MOT tasks. More important error reductions were found in the 2D MAE metrics (up to almost

70 %

relative improvement in 2D MAE), while the average relative improvement in 3D MAE is

4.2 %

at the expense of a performance decrease of

- 9.4 %

for the 3D TLR.

The results on the CAV3D dataset for the full system with audio and video integration are presented in Table 9 and Table 10. The performance improvements we achieve are clear for all the 3D metrics and for both the SOT and MOT tasks, reaching an average of

11.5 %

in the 3D MAE. However, for the 2D metrics, we are only able to improve the average MAE metrics, up to

2.6 %

in 2D MAE, with improvements in MOT, but not in SOT. It is worth mentioning here that the most relevant metric for our purposes is the 3D MAE, as we are interested in precise 3D localization in the given environment.

It is worth mentioning that our method does not use a face detector and in a context where speakers do not look at the camera for a while, go out of the FoV, or there may be no visible face, particles may move away from the actual target (lose the target) and it is more difficult to recover the location of the speaker. In other words, without a face detector, which scans each frame of the entire image, it is more difficult to recover from a target loss. Despite that, our proposal successfully solves the audiovisual 3D localization task, improving the AV3T system in 3D MAE performance for both the AV16.3 and CAV3D datasets.

8.3. Limitations of the `GAVT` Proposal

We consider that the two databases, AV16.3 and CAV3D, cover a wide range of situations that can be encountered in an intelligent space (different environment sizes, sensor configurations, camera resolutions, speakers moving in and out of the camera’s FoV, etc.). However, there are characteristics of GAVT that may limit its performance in additional situations that may arise in a real-world context.

We can first refer to the presence of different head sizes, which will impact the estimation of the distance from cameras to the speaker: For example, smaller heads (such as children’s) will lead to an overestimation of this distance (given our assumption of an average adult head size).

We can also consider the effect of larger environments, which may first increase the number of possible speakers, thus increasing the difficulty of the task; and second increase the occurrence of situations in which any speaker will not be visible in the camera’s FoV, mainly if the remain silent for long periods of time. In these two situations, the algorithm will have no information to track for a long time, and the particles may drift in the direction from the speaker’s last known position. In these cases, a detection mechanism would be necessary, based on audio and/or video information, to reinsert particles once the speaker starts talking or appears in the camera’s FoV. Larger environments and a higher number of speakers will also require an increase in the particle number, which at its time, may lead to a relevant increase in computational complexity.

9. Conclusions and Future Work

In this paper, we have proposed a robust and precise system to track a known number of multiple speakers in a 3D smart space, combining audio and video information. The system uses particle filters with an ad hoc designed audiovisual probabilistic observation model. The visual likelihood model is based on VJ detectors with a pose-dependent strategy that improves the mouth location estimation in 2D and 3D. Additionally, we adopt a specific mechanism to handle MOT tasks and avoid target interference by exploiting the likelihood dispersion effect. The audio likelihood model uses a probabilistic version of SRP, adopting a refined peak selection strategy to avoid target interference, based on the joint distribution of all pairs of microphones. The final fusion model assumes statistical independence of both modalities, so that the audiovisual probability results from the product of audio and video probability density functions.

In the AV16.3 dataset, our audiovisual system proposal shows average relative improvements in 2D mouth localization, of

86.7 %

and

37.6 %

over the audio and visual counterparts, respectively. In 3D localization, the improvements were

66.1 %

and

50.3 %

. This demonstrates that the proposed audiovisual likelihood combination significantly improves the monomodal counterparts tracking results. When compared to the state-of-the-art, in 2D and 3D metrics, the proposed system presents improved results in the visual and audiovisual modalities for both the SOT and MOT tasks. In the AV16.3 dataset, the 2D average relative improvement is

23 %

for the visual modality and

69.7 %

for the audiovisual case, while the 3D improvements are

23 %

for the visual modality and

4.2 %

for the audiovisual one. For the audiovisual modality and the CAV3D dataset, the 2D average improvement is

2.6 %

, while the 3D improvements are

11.5 %

, rising to

18.1 %

in the more difficult MOT task. The better results in 2D show that the proposed pose-dependent face model gives a better adapted likelihood of finding the mouth inside the face, in the image plane, as compared with state-of-the-art proposals. The 3D results are consistently better.

The most important errors in the experiments here described are derived from bad depth estimations and target recovery with the visual likelihood model after a target is lost. As a global conclusion, the audiovisual system here proposed and described has been demonstrated to successfully handle occlusions for MOT tasks, and significantly improve state-of-the-art results in a challenging audiovisual tracking context like CAV3D.

In future work, we plan to focus on new alternatives to decrease the depth estimation errors, being the main error source. For this purpose, we plan to improve the head pose information. Rather than a discrete set of possible poses, we will evaluate a continuous estimation of the three head pose angles using recent deep learning-based head pose estimators. The challenge in this case will be the use of head pose estimation algorithms in low-resolution faces, present in the AV16.3 sequences. We will also combine the proposed likelihood model with a face detector, to solve target losses. To deal with long target losses, we will include a birth and death hypotheses mechanism that will also help to handle tracking a variable and an unknown number of speakers.

Author Contributions

Conceptualization, F.S.-M., M.M.-R. and J.M.-G.; Funding acquisition, M.M.-R. and J.M.-G.; Investigation, F.S.-M., M.M.-R. and J.M.-G.; Methodology, F.S.-M., M.M.-R. and J.M.-G.; Resources, M.M.-R. and J.M.-G.; Software, F.S.-M.; Writing—original and draft, F.S.-M.; Writing—review and editing, F.S.-M., M.M.-R. and J.M.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by the Spanish Ministry of Science and Innovation MICINN/AEI/10.13039/501100011033 under projects EYEFUL-UAH (PID2020-113118RB-C31) and ATHENA (PID2020-115995RB-I00), by CAM under project CONDORDIA (CM/JIN/2021-015), and by UAH under projects ARGOS+ (PIUAH21/IA-016) and METIS (PIUAH22/IA-037).

Institutional Review Board Statement

Ethical review and approval were waived for this study, as we have used two datasets that were made available by their authors to the scientific community.

Informed Consent Statement

We did not require users consent as we have used two datasets that were made available by their authors to the scientific community, so that we were not involved in the data acquisition processes.

Data Availability Statement

Publicly available datasets were used in this study. The AV16.3 data can be found at https://www.idiap.ch/en/dataset/av16-3 and the CAV3D data can be requested at https://speechtek.fbk.eu/cav3d-dataset.

Acknowledgments

The authors would like to thank IDIAP and FBK for providing the AV16.3 and CAV3D datasets.

Conflicts of Interest

The authors declare no conflict of interest.

References

Leotta, F.; Mecella, M. PLaTHEA: A marker-less people localization and tracking system for home automation. Softw. Pract. Exper. 2015, 45, 801–835. [Google Scholar] [CrossRef]
Sanabria-Macías, F.; Romera, M.M.; Macías-Guarasa, J.; Pizarro, D.; Turnes, J.N.; Reyes, E.J.M. Face tracking with a probabilistic Viola and Jones face detector. In Proceedings of the IECON 2019-45th Annual Conference of the IEEE Industrial Electronics Society, Lisbon, Portugal, 14–17 October 2019; Volume 1, pp. 5616–5621. [Google Scholar]
Byeon, M.; Lee, M.; Kim, K.; Choi, J.Y. Variational inference for 3-D localization and tracking of multiple targets using multiple cameras. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3260–3274. [Google Scholar] [CrossRef] [PubMed]
Tian, Y.; Chen, Z.; Yin, F. Distributed IMM-unscented Kalman filter for speaker tracking in microphone array networks. IEEE/ACM Trans. Audio Speech Lang. Process 2015, 23, 1637–1647. [Google Scholar] [CrossRef]
Su, D.; Vidal-Calleja, T.; Miro, J.V. Towards real-time 3D sound sources mapping with linear microphone arrays. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Marina Bay Sands, Singapore, 29 May–3 June 2017; pp. 1662–1668. [Google Scholar]
Grondin, F.; Michaud, F. Lightweight and optimized sound source localization and tracking methods for open and closed microphone array configurations. Rob. Auton. Syst. 2019, 113, 63–80. [Google Scholar] [CrossRef] [Green Version]
Seymer, P.; Wijesekera, D. Implementing fair wireless interfaces with off-the-shelf hardware in smart spaces. In Proceedings of the 2017 International Conference on Internet Computing (ICOMP), Las Vegas, NV, USA, 17–20 July 2017; pp. 79–85. [Google Scholar]
Yang, D.; Xu, B.; Rao, K.; Sheng, W. Passive infrared (PIR)-based indoor position tracking for smart homes using accessibility maps and a-star algorithm. Sensors 2018, 18, 332. [Google Scholar] [CrossRef] [Green Version]
Vaščák, J.; Savko, I. Radio beacons in indoor navigation. In Proceedings of the 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA), Kosice, Slovakia, 23–25 August 2018; pp. 283–288. [Google Scholar]
Tsiami, A.; Filntisis, P.P.; Efthymiou, N.; Koutras, P.; Potamianos, G.; Maragos, P. Far-field audio-visual scene perception of multi-party human-robot interaction for children and adults. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6568–6572. [Google Scholar]
Gebru, I.D.; Alameda-Pineda, X.; Forbes, F.; Horaud, R. EM algorithms for weighted-data clustering with application to audio-visual scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2402–2415. [Google Scholar] [CrossRef] [Green Version]
Chu, P.L.; Feng, J.; Sai, K. Automatic Camera Selection for Video Conferencing. U.S. Patent 9,030,520, 12 May 2015. [Google Scholar]
Li, G.; Liang, S.; Nie, S.; Liu, W.; Yang, Z. Deep neural network-based generalized sidelobe canceller for dual-channel far-field speech recognition. Neural Netw. 2021, 141, 225–237. [Google Scholar] [CrossRef] [PubMed]
Subramanian, A.S.; Weng, C.; Watanabe, S.; Yu, M.; Yu, D. Deep learning based multi-source localization with source splitting and its effectiveness in multi-talker speech recognition. Comput. Speech Lang. 2022, 75, 101360. [Google Scholar] [CrossRef]
Tourbabin, V.; Rafaely, B. Speaker localization by humanoid robots in reverberant environments. In Proceedings of the 2014 IEEE 28th Convention of Electrical & Electronics Engineers in Israel (IEEEI), Eilat, Israel, 3–5 December 2014; pp. 1–5. [Google Scholar]
Lopatka, K.; Kotus, J.; Czyzewski, A. Detection, classification and localization of acoustic events in the presence of background noise for acoustic surveillance of hazardous situations. Multimed. Tools Appl. 2016, 75, 10407–10439. [Google Scholar] [CrossRef] [Green Version]
Qian, X.; Brutti, A.; Lanz, O.; Omologo, M.; Cavallaro, A. Multi-speaker tracking from an audio–visual sensing device. IEEE Trans. Multimed. 2019, 21, 2576–2588. [Google Scholar] [CrossRef]
Anuj, L.; Krishna, M.G. Multiple camera based multiple object tracking under occlusion: A survey. In Proceedings of the 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bengaluru, India, 24–25 February 2017; pp. 432–437. [Google Scholar]
Kılıç, V.; Barnard, M.; Wang, W.; Kittler, J. Audio constrained particle filter based visual tracking. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 3627–3631. [Google Scholar]
Qian, X.; Brutti, A.; Omologo, M.; Cavallaro, A. 3d audio-visual speaker tracking with an adaptive particle filter. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2896–2900. [Google Scholar]
Gebru, I.D.; Evers, C.; Naylor, P.A.; Horaud, R. Audio-visual tracking by density approximation in a sequential Bayesian filtering framework. In Proceedings of the 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), San Francisco, CA, USA, 1–3 March 2017; pp. 71–75. [Google Scholar]
Liu, H.; Li, Y.; Yang, B. 3D audio-visual speaker tracking with a two-layer particle filter. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1955–1959. [Google Scholar]
Sanabria-Macias, F.; Marron-Romera, M.; Macias-Guarasa, J. 3D Audiovisual Speaker Tracking with Distributed Sensors Configuration. In Proceedings of the 2020 European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–22 January 2021; pp. 256–260. [Google Scholar]
Qian, X.; Madhavi, M.; Pan, Z.; Wang, J.; Li, H. Multi-target doa estimation with an audio-visual fusion mechanism. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2021; pp. 4280–4284. [Google Scholar]
Xiong, Z.; Liu, H.; Zhou, Y.; Luo, Z. Multi-speaker tracking by fusing audio and video information. In Proceedings of the 2021 IEEE Statistical Signal Processing Workshop (SSP), Virtual, 11–14 July 2021; pp. 321–325. [Google Scholar]
Liu, H.; Sun, Y.; Li, Y.; Yang, B. 3D audio-visual speaker tracking with a novel particle filter. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milano, Italy, 10–15 January 2021; pp. 7343–7348. [Google Scholar]
Qian, X.; Brutti, A.; Lanz, O.; Omologo, M.; Cavallaro, A. Audio-Visual Tracking of Concurrent Speakers. IEEE Trans. Multimed. 2022, 24, 942–954. [Google Scholar] [CrossRef]
Qian, X.; Liu, Q.; Wang, J.; Li, H. Three-dimensional Speaker Localization: Audio-refined Visual Scaling Factor Estimation. IEEE Signal Process. Lett. 2021, 28, 1405–1409. [Google Scholar] [CrossRef]
Zhao, J.; Wu, P.; Liu, X.; Goudarzi, S.; Liu, H.; Xu, Y.; Wang, W. Audio Visual Multi-Speaker Tracking with Improved GCF and PMBM Filter. Proc. Interspeech 2022, 2022, 3704–3708. [Google Scholar]
Qian, X.; Wang, Z.; Wang, J.; Guan, G.; Li, H. Audio-Visual Cross-Attention Network for Robotic Speaker Tracking. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 31, 550–562. [Google Scholar] [CrossRef]
Zhao, J.; Wu, P.; Liu, X.; Xu, Y.; Mihaylova, L.; Godsill, S.; Wang, W. Audio-visual tracking of multiple speakers via a pmbm filter. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 5068–5072. [Google Scholar]
Li, Y.; Liu, H.; Tang, H. Multi-modal perception attention network with self-supervised learning for audio-visual speaker tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 Feburary–1 March 2022; Volume 36, pp. 1456–1463. [Google Scholar]
Kılıç, V.; Barnard, M.; Wang, W.; Hilton, A.; Kittler, J. Audio informed visual speaker tracking with SMC-PHD filter. In Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME), Turin, Italy, 29 June–3 July 2015; pp. 1–6. [Google Scholar]
Barnard, M.; Koniusz, P.; Wang, W.; Kittler, J.; Naqvi, S.M.; Chambers, J. Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling. IEEE Trans. Multimed. 2014, 16, 864–880. [Google Scholar] [CrossRef]
Shi, Z.; Zhang, L.; Wang, D. Audio-Visual Sound Source Localization and Tracking Based on Mobile Robot for The Cocktail Party Problem. Appl. Sci. 2023, 13, 6056. [Google Scholar] [CrossRef]
Zhou, H.; Taj, M.; Cavallaro, A. Target detection and tracking with heterogeneous sensors. IEEE J. Sel. Top. Signal Process. 2008, 2, 503–513. [Google Scholar] [CrossRef]
Brutti, A.; Lanz, O. A joint particle filter to track the position and head orientation of people using audio visual cues. In Proceedings of the 2010 European Signal Processing Conference (EUSIPCO), Aalborg, Denmark, 23–27 August 2010; pp. 974–978. [Google Scholar]
Gebru, I.D.; Ba, S.; Li, X.; Horaud, R. Audio-visual speaker diarization based on spatiotemporal bayesian fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1086–1099. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Liu, H.; Yang, B.; Ding, R.; Chen, Y. Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter. Complexity 2020, 2020, 1–8. [Google Scholar] [CrossRef]
Wilson, J.; Lin, M.C. Avot: Audio-visual object tracking of multiple objects for robotics. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Virtual, 31 May–31 August 2020; pp. 10045–10051. [Google Scholar]
Ban, Y.; Alameda-Pineda, X.; Girin, L.; Horaud, R. Variational bayesian inference for audio-visual tracking of multiple speakers. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1761–1776. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. I–511. [Google Scholar]
Wu, X.; He, R.; Sun, Z. A lightened CNN for deep face representation. arXiv 2015, arXiv:1511.02683. [Google Scholar]
Sanabria-Macías, F.; Maranón-Reyes, E.; Soto-Vega, P.; Marrón-Romera, M.; Macias-Guarasa, J.; Pizarro-Perez, D. Face likelihood functions for visual tracking in intelligent spaces. In Proceedings of the IECON 2013—39th Annual Conference of the IEEE Industrial Electronics Society, Vienna, Austria, 10–13 November 2013; pp. 7825–7830. [Google Scholar]
AV16.3: An Audio-Visual Corpus for Speaker Localization and Tracking. Available online: https://www.idiap.ch/en/dataset/av16-3 (accessed on 29 July 2023).
Awad-Alla, M.; Hamdy, A.; Tolbah, F.A.; Shahin, M.A.; Abdelaziz, M. A two-stage approach for passive sound source localization based on the SRP-PHAT algorithm. APSIPA Trans. Signal Inf. Process. 2020, 9, e8. [Google Scholar] [CrossRef] [Green Version]
Marti, A.; Cobos, M.; Lopez, J.J.; Escolano, J. A steered response power iterative method for high-accuracy acoustic source localization. J. Acoust. Soc. Am. 2013, 134, 2627–2630. [Google Scholar] [CrossRef] [PubMed]
Velasco, J.; Martín-Arguedas, C.J.; Macias-Guarasa, J.; Pizarro, D.; Mazo, M. Proposal and validation of an analytical generative model of SRP-PHAT power maps in reverberant scenarios. Signal Process. 2016, 119, 209–228. [Google Scholar] [CrossRef]
Diaz-Guerra, D.; Miguel, A.; Beltran, J.R. Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 300–311. [Google Scholar] [CrossRef]
Vera-Diaz, J.M.; Pizarro, D.; Macias-Guarasa, J. Acoustic source localization with deep generalized cross correlations. Signal Process. 2021, 187, 108169. [Google Scholar] [CrossRef]
DiBiase, J.H.; Silverman, H.F.; Brandstein, M.S. Robust localization in reverberant rooms. In Microphone Arrays; Springer: Berlin/Heidelberg, Germany, 2001; pp. 157–180. [Google Scholar]
Dmochowski, J.P.; Benesty, J.; Affes, S. A Generalized Steered Response Power Method for Computationally Viable Source Localization. IEEE Audio Speech Lang. Process. 2007, 15, 2510–2526. [Google Scholar] [CrossRef]
Do, H.; Silverman, H.F. SRP-PHAT methods of locating simultaneous multiple talkers using a frame of microphone array data. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, TX, USA, 5–19 March 2010; pp. 125–128. [Google Scholar]
Cobos, M.; Marti, A.; Lopez, J.J. A Modified SRP-PHAT Functional for Robust Real-Time Sound Source Localization With Scalable Spatial Sampling. IEEE Signal Process. Lett. 2011, 18, 71–74. [Google Scholar] [CrossRef]
Oualil, Y.; Faubel, F.; Doss, M.M.; Klakow, D. A TDOA Gaussian mixture model for improving acoustic source tracking. In Proceedings of the 2012 European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 1339–1343. [Google Scholar]
Ziomek, L. Fundamentals of Acoustic Field Theory and Space-Time Signal Processing; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
Arulampalam, M.S.; Maskell, S.; Gordon, N.; Clapp, T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 2002, 50, 174–188. [Google Scholar] [CrossRef] [Green Version]
Hol, J.D.; Schon, T.B.; Gustafsson, F. On resampling algorithms for particle filters. In Proceedings of the 2006 IEEE Nonlinear Statistical Signal Processing Workshop, Cambridge, UK, 13–15 September 2006; pp. 79–82. [Google Scholar]
Wu, K.; Khong, A.W. Acoustic source tracking in reverberant environment using regional steered response power measurement. In Proceedings of the 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kaohsiung, Taiwan, 29 October–1 November 2013; pp. 1–6. [Google Scholar]
Zhong, X. Bayesian Framework for Multiple Acoustic Source Tracking. PhD Thesis, University of Edinburgh, Edinburgh, UK, 2010. [Google Scholar]
Lathoud, G.; Magimai-Doss, M. A sector-based, frequency-domain approach to detection and localization of multiple speakers. In Proceedings of the 2005 IEEE International Conference on Acoustics, Speech and Signal Processing, Philadelphia, PA, USA, 23 March 2005; Volume 3, pp. 265–268. [Google Scholar]
OpenCV Processing Library. Available online: http://opencv.org/ (accessed on 29 July 2023).
Ward, D.B.; Lehmann, E.A.; Williamson, R.C. Particle Filtering Algorithms for Tracking an Acoustic Source in a Reverberant Environment. IEEE Trans. Speech Audio Process. 2003, SAP-11, 826–836. [Google Scholar] [CrossRef]

Figure 1. Particle filter general scheme for multiple speakers audiovisual tracking.

Figure 2. Video observation model: VJ Likelihood (light blue) Color-based Likelihood (light red) and Fg. Likelihood (light green) blocks.

Figure 3. Projections of each mouth hypothesis

{p_{n, o}^{i}}

in the 3D WCS to its corresponding 2D BB in the FOS

{{\tilde{p}}_{n, o}^{i}}

.

Figure 3. Projections of each mouth hypothesis

{p_{n, o}^{i}}

in the 3D WCS to its corresponding 2D BB in the FOS

{{\tilde{p}}_{n, o}^{i}}

.

Figure 4. Face templates for the right profile, frontal, and left profile poses.

Figure 5. Face templates with in-plane rotations with

α = 15^{\circ}

(counterclockwise, the three on the left, and clockwise the three on the right).

Figure 5. Face templates with in-plane rotations with

α = 15^{\circ}

(counterclockwise, the three on the left, and clockwise the three on the right).

Figure 6. Face likelihood model response for different poses: right at the top row, frontal at the middle row, and left at the bottom row. From left to right in each row:

Ω^{F_{0}} ({\tilde{p}}_{n, o}^{i}) Ω^{R_{0}} ({\tilde{p}}_{n, o}^{i}) Ω^{L_{0}} ({\tilde{p}}_{n, o}^{i}) Ω^{F_{- α}} ({\tilde{p}}_{n, o}^{i}) Ω^{R_{- α}} ({\tilde{p}}_{n, o}^{i}) Ω^{L_{- α}} ({\tilde{p}}_{n, o}^{i}) Ω^{F_{+ α}} ({\tilde{p}}_{n, o}^{i}) Ω^{R_{+ α}} ({\tilde{p}}_{n, o}^{i}) Ω^{L_{+ α}} ({\tilde{p}}_{n, o}^{i})

.

Figure 6. Face likelihood model response for different poses: right at the top row, frontal at the middle row, and left at the bottom row. From left to right in each row:

Ω^{F_{0}} ({\tilde{p}}_{n, o}^{i}) Ω^{R_{0}} ({\tilde{p}}_{n, o}^{i}) Ω^{L_{0}} ({\tilde{p}}_{n, o}^{i}) Ω^{F_{- α}} ({\tilde{p}}_{n, o}^{i}) Ω^{R_{- α}} ({\tilde{p}}_{n, o}^{i}) Ω^{L_{- α}} ({\tilde{p}}_{n, o}^{i}) Ω^{F_{+ α}} ({\tilde{p}}_{n, o}^{i}) Ω^{R_{+ α}} ({\tilde{p}}_{n, o}^{i}) Ω^{L_{+ α}} ({\tilde{p}}_{n, o}^{i})

.

Figure 7. Face likelihoods response at different 2D BB sizes (

S_{n, o}^{i}

) in pixels. (Left): likelihood maps. (Right): 3D plot of the likelihoods in the FOS

(u_{n, o}^{i}, v_{n, o}^{i}, S_{n, o}^{i})

.

Figure 7. Face likelihoods response at different 2D BB sizes (

S_{n, o}^{i}

) in pixels. (Left): likelihood maps. (Right): 3D plot of the likelihoods in the FOS

(u_{n, o}^{i}, v_{n, o}^{i}, S_{n, o}^{i})

.

Figure 8. VJ Likelihood Model: Predicted particles (red dots) projected on the image (left), the three slices in the S dimension (middle) with the particles (red dots), and the estimated Gaussian (green blob) along with particles (red dots) in the FOS (right).

Figure 9. Occlusion situation. Image with particles’ projection (left). Top view of 3D position-related particles (middle). The same with standard deviation circles (right). For target one the particles are in red color, and the standard deviation circle in magenta. For target two the particles are in green color, and the standard deviation circle in cyan.

Figure 10. General scheme of the audio likelihood model.

Figure 11. Gaussian selection. In the left graphic:

G C C_{π_{16}} (τ)

in black, and

{\hat{G C C}}_{π_{16}} (τ)

in blue. In the right graphic: selected Gaussians for target 1 (red) and target 2 (green). The GMM model was generated with

N_{J} = 7

.

Figure 11. Gaussian selection. In the left graphic:

G C C_{π_{16}} (τ)

in black, and

{\hat{G C C}}_{π_{16}} (τ)

in blue. In the right graphic: selected Gaussians for target 1 (red) and target 2 (green). The GMM model was generated with

N_{J} = 7

.

Figure 12. Detailed results for seq11 camera 2: Mean and standard deviation of error over time (top graphic), and top view of the speaker trajectory (bottom graphics). Green: audio only, blue: video only, red: audiovisual, black: ground truth).

Figure 13. Detailed results for seq08 camera 1: Mean and standard deviation of error over time (top graphic), and top view of the speaker trajectory (bottom graphics). Green: audio only, blue: video only, red: audiovisual, black: ground truth). The dotted circle represents the microphone array.

Figure 14. Additional detailed results for sequences seq08 camera 3 and seq12 camera 1: Mean and standard deviation of error over time. Green: audio only, blue: video only, red: audiovisual, black: ground truth).

Figure 15. Detailed results for sequence seq24 camera 3: Mean and standard deviation of error over time for the tracking of two speakers (top graphics), and top view of the 3D trajectories of both speakers (bottom graphics). Green: audio only, blue: video only, red: audiovisual, black: ground truth. Speaker one graphics are above those of speaker two. The dotted circle represents the microphone array.

Figure 16. Top view of the 3D trajectories for the two speakers in the seq22 camera 5 (red: audiovisual, black: ground truth). The dotted circle represents the microphone array.

Figure 17. Detailed results for sequence seq30 camera 3: Mean and standard deviation of error over time. Top: speaker 1; Bottom: speaker 2. Green: audio only, blue: video only, red: audiovisual, black: ground truth).

Table 1. Brief descriptions of the sequences in the AV16.3 dataset.

Type	Sequence ID	Description	Time Length (MM:SS)
SOT	`seq08`	Moving speaker facing the array, walking backward and forward once	00:22.28
SOT	`seq11`	Moving speaker facing the array, doing random motions	00:30.76
SOT	`seq12`	Moving speaker facing the array, doing random motions	00:48.16
		Total SOT	01:41.20
MOT	`seq18`	Two moving speakers facing the array, speaking continuously, growing closer to each other, then further apart twice, the first time seated and the second time standing	00:57.00
MOT	`seq19`	Two standing speakers facing the array, speaking continuously, growing closer to each other, then further from each other	00:22.8
MOT	`seq24`	Two moving speakers facing the array, speaking continuously, walking backward and forward, each speaker starting from the opposite side and occluding the other one once	00:47.96
MOT	`seq25`	Two moving speakers greeting each other, discussing, then parting, without occluding each other	00:55.72
MOT	`seq30`	Two moving speakers, both speaking continuously, walking backward and forward once, one behind the other at a constant distance	00:22.04
		Total MOT	03:56.28
		TOTAL ALL	05:37.48

Table 2. Brief descriptions of the sequences in the CAV3D dataset.

Type	Sequence ID	Description	Time Length (MM:SS)
SOT	`seq06`	Female moving along a reduced area inside the room.	00:52.10
SOT	`seq07`	Male moving along a reduced area inside the room	00:58.97
SOT	`seq08`	Male moving along a reduced area inside the room	01:10.02
SOT	`seq09`	Male moving along a reduced area inside the room within noisy situations	00:50.43
SOT	`seq10`	Male moving along a reduced area inside the room within noisy situations	00:50.05
SOT	`seq11`	Male moving along a reduced area inside the room within clapping/noising and bending/sitting situations	01:10.61
SOT	`seq12`	Male moving along a reduced area inside the room within clapping/noising and bending/sitting situations	01:28.87
SOT	`seq13`	Male moving along the whole room	01:24.22
SOT	`seq20`	Male moving along a reduced area inside the room	00:46.46
		Total SOT	9:31.73
MOT	`seq22`	A female and a male moving along the room simultaneously speaking	00:39.55
MOT	`seq23`	A female and a male moving along the room simultaneously speaking	01:04.04
MOT	`seq24`	A female and a male moving along the room simultaneously speaking	01:09.46
MOT	`seq25`	Two females and a male moving along the room simultaneously speaking	01:02.46
MOT	`seq26`	Two females and a male moving along the room simultaneously speaking	00:36.48
		Total MOT	04:31.99
		TOTAL ALL	14:03.72

Table 3. Performance scores of GAVT on AV16.3. Average TLR [%] and MAE in 2D [pixels] per modality (A: audio-only, V: video-only, and AV: audiovisual), and summary comparison (

Δ_{A}^{A V}

: improvement from A to AV,

Δ_{V}^{A V}

: improvement from V to AV). The best results across metrics are highlighted with a green background.

Table 3. Performance scores of GAVT on AV16.3. Average TLR [%] and MAE in 2D [pixels] per modality (A: audio-only, V: video-only, and AV: audiovisual), and summary comparison (

Δ_{A}^{A V}

: improvement from A to AV,

Δ_{V}^{A V}

: improvement from V to AV). The best results across metrics are highlighted with a green background.

		`GAVT` on AV16.3 in 2D (A, V, AV and Comparison)
		A	V	AV	$Δ_{A}^{AV}$	$Δ_{V}^{AV}$
SOT	TLR	65.35 ± 3.50	6.42 ± 3.54	5.19 ± 2.55	92.1%	19.2%
	$ϵ_{2 D}$	37.09 ± 2.53	5.78 ± 2.11	5.12 ± 1.42	86.2%	11.4%
	$ϵ_{2 D}^{'}$	8.98 ± 0.41	3.35 ± 0.12	3.35 ± 0.10	62.6%	−0.1%
MOT	TLR	74.56 ± 6.87	20.19 ± 7.49	9.55 ± 7.26	87.2%	52.7%
	$ϵ_{2 D}$	57.01 ± 12.12	14.29 ± 7.08	7.40 ± 5.46	87.0%	48.2%
	$ϵ_{2 D}^{'}$	8.39 ± 0.80	3.15 ± 0.33	3.27 ± 0.32	61.0%	−4.0%
Avg	TLR	69.95 ± 5.18	13.31 ± 5.51	7.37 ± 4.91	89.5%	44.6%
	$ϵ_{2 D}$	47.05 ± 7.32	10.03 ± 4.60	6.26 ± 3.44	86.7%	37.6%
	$ϵ_{2 D}^{'}$	8.69 ± 0.60	3.25 ± 0.22	3.31 ± 0.21	61.9%	−2.0%

Table 4. Performance scores of GAVT on AV16.3. Average TLR [%] and MAE in 3D [m] per modality (A: audio-only, V: video-only, and AV: audiovisual), and summary comparison (

Δ_{A}^{A V}

: improvement from A to AV,

Δ_{V}^{A V}

: improvement from V to AV). The best results across metrics are highlighted with a green background.

Table 4. Performance scores of GAVT on AV16.3. Average TLR [%] and MAE in 3D [m] per modality (A: audio-only, V: video-only, and AV: audiovisual), and summary comparison (

Δ_{A}^{A V}

: improvement from A to AV,

Δ_{V}^{A V}

: improvement from V to AV). The best results across metrics are highlighted with a green background.

		`GAVT` on AV16.3 in 3D (A, V, AV and Comparison)
		A	V	AV	$Δ_{A}^{AV}$	$Δ_{V}^{AV}$
SOT	TLR	43.80 ± 2.78	46.76 ± 6.14	12.05 ± 3.84	72.5%	74.2%
	$ϵ_{3 D}$	0.37 ± 0.02	0.33 ± 0.04	0.15 ± 0.01	59.4%	54.6%
	$ϵ_{3 D}^{'}$	0.17 ± 0.01	0.15 ± 0.01	0.10 ± 0.01	38.8%	29.3%
MOT	TLR	59.88 ± 9.07	46.15 ± 9.58	19.78 ± 11.0	67.0%	57.1%
	$ϵ_{3 D}$	0.68 ± 0.15	0.39 ± 0.09	0.21 ± 0.08	69.7%	46.7%
	$ϵ_{3 D}^{'}$	0.16 ± 0.02	;0.15 ± 0.02	0.11 ± 0.01	26.1%	24.2%
Avg	TLR	51.84 ± 5.93	46.46 ± 7.86	15.92 ± 7.43	69.3%	65.7%
	$ϵ_{3 D}$	0.52 ± 0.09	0.36 ± 0.06	0.18 ± 0.12	66.1%	50.3%
	$ϵ_{3 D}^{'}$	0.16 ± 0.01	0.15 ± 0.02	0.11 ± 0.01	32.7%	26.7%

Table 5. Performance scores of AV3T and GAVT on AV16.3. Average TLR [%] and MAE in 2D [pixels] for the audio-only (A) and video-only (V) modalities, and summary improvements comparison of GAVT to AV3T (

Δ_{AV 3 T}^{GAVT}