Towards End-to-End Acoustic Localization Using Deep Learning: From Audio Signals to Source Position Coordinates

This paper presents a novel approach for indoor acoustic source localization using microphone arrays, based on a Convolutional Neural Network (CNN). In the proposed solution, the CNN is designed to directly estimate the three-dimensional position of a single acoustic source using the raw audio signal as the input information and avoiding the use of hand-crafted audio features. Given the limited amount of available localization data, we propose, in this paper, a training strategy based on two steps. We first train our network using semi-synthetic data generated from close talk speech recordings. We simulate the time delays and distortion suffered in the signal that propagate from the source to the array of microphones. We then fine tune this network using a small amount of real data. Our experimental results, evaluated on a publicly available dataset recorded in a real room, show that this approach is able to produce networks that significantly improve existing localization methods based on SRP-PHAT strategies and also those presented in very recent proposals based on Convolutional Recurrent Neural Networks (CRNN). In addition, our experiments show that the performance of our CNN method does not show a relevant dependency on the speaker’s gender, nor on the size of the signal window being used.


Introduction
The development and scientific research in advanced perceptual systems has notably grown during the last decades, and has experimented a tremendous rise in the last years due to the availability of increasingly sophisticated sensors, the use of computing nodes with higher and higher computational power, and the advent of powerful algorithmic strategies based on deep learning (all of them actually entering the mass consumer market).The aim of perceptual systems is to automatically analyze complex and rich information taken from different sensors, in order to obtain refined information on the sensed environment and the activities being carried out within them.The scientific works in these environments, cover research areas from basic sensor technologies, to signal processing and pattern recognition, and open the path to the idea of systems able to analyze human activities, providing them with advanced interaction capabilities and services..In this context, localization of humans (being the most interesting element for perceptual systems) is a fundamental task that needs to be addressed so that the systems can actually start to provide higher level information on the activities being carried out.Without a precise localization, further advanced interaction between humans and their physical environment cannot be carried out successfully.
The scientific community has devoted a huge amount of effort to build robust and reliable indoor localization systems, based on different sensors [1][2][3].Non-invasive technologies are preferred in this context, so that no electronic or passive devices need to be carried by humans for localization.The two non-invasive technologies that have been mainly used in indoor localization are those based on video systems and acoustic sensors.
This paper focuses on audio-based localization, with no previous assumptions on the acoustic signal characteristics nor in the physical environment, apart from the fact that unknown wide-band audio sources (e.g.human voice) are captured by a set of microphone arrays placed in known positions.The main objective of the paper is to directly use the signals captured by the microphone arrays to automatically obtain the position of the the acoustic source detected in the given environment.
Even though there are a lot of proposals in this area, Acoustic Source Localization (ASL) is still a hot research topic.This paper proposes a convolutional neural network (CNN) architecture that is trained end-to-end to solve the acoustic localization problem.To our knowledge, this is the first work in the literature that does not provide the network with feature vectors extracted from the speech signals, but directly uses the speech signal.Avoiding hand crafted features has been proved to increase the accuracy of classification and regression methods based on convolutional neural networks in other fields, such as in computer vision [4,5].
Our proposal is evaluated using both semi-synthetic and real data, outperforming traditional solutions based on Steered Response Power (SRP) [6], that are still the basis of state-of-the-art systems [7][8][9][10].
The rest of the paper is organized as follows.In Section 2 a review study of the state-of-the-art in acoustic source localization with special emphasis on the use of deep learning approaches.Section 3 describes the CNN based proposal, with details on the training and fine tuning strategies.The experimental work is detailed in Section 4, and Section 5 summarizes the main conclusions and contributions of the paper and gives some ideas for future work.

State of the Art
Many approaches exist in the literature to address the acoustic source localization (ASL) problem.According to the classical literature review in this topic, these approaches can be broadly divided in three categories [11,12]: time delay based, beamforming based, and high-resolution spectral-estimation based methods.This taxonomy relies in the fact that ASL has been traditionally considered a signal processing problem based on the definition of a signal propagation model [11][12][13][14][15][16][17][18][19], but, more recently, the range of proposals in the literature also considered strategies based on exploiting optimization techniques and mathematical properties of related measurements [20][21][22][23][24], and also using machine learning strategies [25][26][27], aimed at obtaining a direct mapping from specific features to source locations [28], area in which deep learning approaches are starting to be applied and that will be further described later in this section.
Time delay based methods (also referred to as indirect methods), compute the time difference of arrivals (TDOAs) across various combinations of pairs of spatially separated microphones, usually using the Generalized Correlation Function (GCC) [13].In a second step, the TDOAs are combined with knowledge of the microphones' positions to generate a position estimation [11,29].
Beamforming based techniques [12,15,19,30] attempt to estimate the position of the source, optimizing a spatial statistic associated with each position, such as in the Steered Response Power (SRP) approach, in which the statistic is based on the signal power received when the microphone array is steered in the direction of a specific location.SRP-PHAT is a widely used algorithm for speaker localization based on beamforming that was first proposed in [6] 1 .It combines the robustness of the SRP approach with the Phase Transform (PHAT) filtering, which increases the robustness of the algorithm to signal and room conditions, making it an ideal strategy for realistic speaker localization systems [16,17,[32][33][34].Other beamforming based methods such as the Minimum Variance Distortionless Response (MVDR) [18], exhibits problems when facing reverberant environments, because it introduces a new trade-off between dereverberation and noise reduction.
In what respect to spectral estimation based methods, the multiple signal classification algorithm (MUSIC) [35], has been widely used, but these methods, in general, tend to be less robust than beamforming methods [12], as they assume incoherent signals and are very sensitive to small modeling errors.
In the past few years, deep learning approaches [36] have taken the lead in different signal processing and machine learning fields, such as computer vision [37,38] and speech recognition [39][40][41], and, in general, in any area in which complex relationships between observed signals and the underlying processes generating them need to be discovered.
The idea of using neural networks for ASL is not new.Back in the early nineties and the first decade of the current century, works such as [25,42,43] proposed the use of neural network techniques in this area.However an evaluation on realistic and extensive data sets was not viable at this time, and the proposals were somehow limited in scope.
The main differences between the different proposals using neural networks for ASL reside in the architectures, input features, the network output (target), and the experimental setup (using real or simulated data).
Regarding the information given to the neural network, we can find several works using features physically related to the ASL problem.Some of the proposals use features derived from the GCC or related functions, which actually make sense as these correlation function is closely related to the TDOAs which are used in traditional methods to generate position estimations.The published works use either the GCC coefficients directly [50], features derived from them [45,55] or from the correlation matrix [47,49], or even combined with others, such as cepstral coefficients [53].Other works are focused in exploiting binaural cues [44,46], features derived from convolving the spectrum with head related impulse responses [58] or even narrowband SRP values [56].The latter approach goes one step further from correlation related values, as the SRP function actually integrates multiple GCC estimations in such a way that acoustic energy maps can be easily generated from it.
Opposed to the previously described works using refined features directly related to the localization problem, we can also find others using frequency domain features directly [48,52], in some cases generated from spectrograms of general time-frequency representations [51,54].These approaches represent a step forward compared with the previous ones, as they give the network the responsibility of automatically learn the relationship between spectral cues and the location related information [57] kind of combines both strategies, as they use spectral features but calculating them in a cross-spectral fashion, that is, combining the values from all the available microphones in the so-called Cross Spectral Map (CSM).
In none of the referenced works, the authors try to make use of the raw acoustic signal directly, and we are interested in evaluating the capabilities of CNN architectures in directly exploiting this raw input information.
In what respect to the estimation target, most of the works are oriented towards estimating the Direction of Arrival (DOA) of the acoustic sources [45,50,51,55,56], or DOA related measurements such as azimuth angle [44,46,48], elevation angle [58], or position bearing+range [53].Some of the proposals pose the problem not as a direct estimation (regression) but as a classification problem among a predefined set of possible position related values [47][48][49]52,54] (azimuth, positions in a predefined grid, etc.).Works with a very different target try to estimate a clean acoustic source map [57] or learn time-frequency masks as a preprocessing stage prior to ASL [59].
In none of the referenced works the authors try to directly estimate the coordinate values of the acoustic sources, and, again, we are interested in evaluating the capabilities of CNN architectures to directly generate this output information.
So, in this paper we describe, for the first time in the literature to the best of our knowledge, a CNN architecture in which we directly exploit the raw acoustic signal to be provided to the neural network, with the objective of directly estimating the three dimensional position of an acoustic source in a given environment.This is the reason why we refer to this strategy as end-to-end, considering the full coverage of the ASL problem.The proposal has been tested on both semi-synthetic and real data from a publicly available database.

Problem Statement
Our system obtains the position of an acoustic source from the audio signals recorded by an array of M microphones.Given a reference coordinate origin, the source position is defined with the 3D coordinate vector s = s x s y s z .The microphones positions are known and they are defined with coordinate vectors m i = m i,x m i,y m i,z with i = 1, . . ., M. The audio signal captured from the i th microphone is denoted by x i (t).This signal is discretized with a sampling frequency f s and is defined with x i [n].We assume for simplicity that x i [n] is of finite-length with N samples.This corresponds to a small window of audio with duration w s = N/ f s , which is a design parameter in our system.We denote as x i the vector containing all time samples of the signal: The problem we seek to solve is to find the following regression function f : that obtains the speaker position given the signals recorded from the microphones.In classical simplified approaches, f is found by assuming that signals received from different microphones mainly differ by a delay that depends on the relative position of the source with respect to the microphones.However, this assumption breaks in environments where the signal suffers from random noise and distortion, such as multi-path signals or microphone non-linear response.
Due to the aforementioned effects, and the random nature of the audio signal, the regression function of equation ( 2) cannot be estimated analytically.We present in this paper a learning approach for directly obtaining f using Deep Learning.We represent f using a Convolutional Neural Network (CNN) which is learned end-to-end from the microphone signals.In our system we assume that microphones positions are fixed.We thus drop the requirement of knowing the microphone's position from equation (2) which will be implicitly learned by our network with the following regression problem: where f net denotes the function that we represent using the CNN and whose topology is described next.

Network Topology
The topology of our neural network is shown in figure 1.It is composed of five convolutional blocks of one dimension and two fully connected blocks.Following equation ( 3), the network inputs are the set of windowed signals from the microphones and the network output is the estimated position of the acoustic source.Table (1) shows the size and amount of convolutional filters in the proposed network.We use filters of size 7 (layers 1 and 2), size 5 (layers 3 and 4) and size 3 (layer 5).The number of filters is 96 in the first two convolutional layers and 128 in the rest.As seen in figure 1, some of the layers are equipped with MaxPooling filters with the same pool size as their corresponding convolutional filters.The last two layers are fully-connected layers, one hidden with 500 nodes and the output layer.All layer's activation functions are "ReLUs" with the exception of the output layer.During training we include dropout with probability 0.5 in the fully-connected layers to prevent overfitting.

Training Strategy
The amount of available real data that we have in our experimental setup (see Section 4) will be, in general, limited for training a CNN model.To cope with this problem we propose a training strategy comprising two steps: Step 1. Training the network with semi-synthetic data: We use close-talk speech recordings and a set of randomly generated source positions to generate simulated versions of the signals captured by a set of microphones that share the same geometry with the environment used in real data.Additional considerations on the acoustic behavior of the target environment (specific noise types, noise levels, etc.) is also taken into account to generate the data.This dataset can virtually be made as big as required to train the network.
Step 2. Fine tuning the network with real data: We train the network on a reduced subset of the database captured in the target physical environment using the weights obtained in Step 1 as initialization.

Semi-Synthetic Dataset Generation
In this step we extract audio signals from any available close-talk (anechoic) corpus, and use them to generate semi-synthetic data.There are many available datasets suitable for this task (freely of commercially distributed).Our semi-synthetic dataset can thus be made as big as required for training the CNN.
For this task, we randomly generate position vectors q = q x q y q z of the acoustic source using a uniform distribution that covers the physical space (room) that will be used.
The loss function we use to train the network is the mean squared error between the estimated position given by the network (s i ) and the target position vector (q i ).It follows the expression: where Θ represents the weights of the network.Equation ( 4) is minimized in function of the unknown weights using iterative optimization based on the Stochastic Gradient Descent (SGD) algorithm [60].We finally obtain the target weights θ ∈ Θ once a termination criterion is met in the optimization.More details are given in Section 4 about the training algorithm.
In order to realistically simulate the signals received in the microphones from a given source position we have to consider two main issues: • Signal propagation considerations: This is affected by the impulse response of the target room.
Different alternatives can be used to simulate this effect, such as convolving the anechoic signals with real room impulse responses such as in [47], that can be difficult to acquire for general positions in big environments; or using room response simulation methods such as the image method [61] used in [62] for this purpose.• Acoustic noise conditions of the room and recording process conditions: These can be due to additional equipment (computers, fans, air conditioning systems, etc.) present in the room, and to problems in the signal acquisition setup.This can be addressed by assuming additive noise conditions, and selecting a noise type and acoustic effects that should be preferably estimated in the target room.
In our case, and regarding the first issue, we used an initial simple approach, just taking into account the propagation delay from the source position to each of the microphones, that depends on their relative position and the sound speed in the room.
We denote the number of samples we have to shift a signal to simulate the arrival delay suffered at microphone i by N s i = f s d i c where f s is the sampling frequency of the signal, d i is the euclidean distance between the acoustic source and the i microphone and c is the sound speed in air (c = 343m/s in a room at 20C o ).In general N s i is not an integer number.We thus require a way to simulate sub-sample shifts in the signal.In order to implement the delay N s i on x pc (the windowed signal of N samples from the close-talk dataset) to obtain x i we use the following transformation: where we first transform x pc into the frequency domain X pc using the Discrete Fourier Transform operator F .We then change its phase according to N s i by the phase vector D s i and transform the signal back into time domain x i , using the Inverse Discrete Fourier Transform operator F -1 .A i is an amplitude factor applied to the signal that follows a uniform random distribution, and it is different for each microphone, preventing the network from being affected by amplitude differences between the signals captured in different microphones (A i ∈ [0.01, 0.03] in the experimental setup described in Section 4).
Regarding the second issue, we simulate noise and disturbances in the signals arriving to the microphones so that the signal-to-noise ratio and the spectral content of the signals are as similar as possible to those found in the real data.In order to provide an example of the methodology we follow, we refer in this section to the particular case of the IDIAP room (see Section 4.1.1)that will be used in our real data experiments, and the Albayzin Phonetic Corpus (see Section 4.1.2) that will be used for synthetic data generation.
In the IDIAP room, a spectrogram based analysis showed that the recordings are contaminated with a tone at around 25Hz in the spectrum which does not appear in anechoic conditions, probably due to room equipment of electrical noise generated in the recording hardware setup.We have determined that the frequency of this tone actually varies in a range between 20Hz and 30Hz.So, in the synthetic data generation process, we have contaminated the signals from the phonetic corpus with an additive tone of a random frequency in this established range, and we have also added white gaussian noise following the expression: where k s is a scaling factor for the contaminating tone signal (similar to the tone amplitude found in the target room recordings, 0.1 in our case), f 0 ∈ [20,30]Hz, φ 0 ∈ [0, π]rad, η wgn is a white gaussian noise signal, and k η is a noise scaling factor to generate signals with a SNR which is similar to that found in the target room recordings.After this procedure is applied, the semi-synthetic signal data set will be ready to be used in the neural network training procedure.

Fine Tuning Procedure
The previous step takes care of reproducing simple acoustic characteristics of the testing room such as the propagation effects and the presence of specific types and levels of additive noises, but there are other phenomena like multi-path and reverberation propagation which are more complex to simulate.In order to introduce these acoustic behaviors of the target physical environment, our proposal is to carry out a fine tuning procedure of the network model using a short amount of real recorded data in the target room Although there are other methods such as the one proposed in [49], where an unsupervised DNN is implemented for the adaptation of parameters to unknown data, we believe that the fine tuning process implemented is adequate because, in the first place, it is a supervised process with which a better performance is expected to be obtained and, secondly, not all the sequences of the test data set are used, so that only a few are used for the fine tuning process, saving the rest for the test phase.

Experimental Work
In his section we describe the datasets used in both steps of the training strategy described in Section 3.3, and the details associated with it.We then define the experimental setup general conditions, and the error metrics used for comparing our proposal with other state-of-the-art methods and finally present our experimental results, starting from the baseline performance we aim at improving.

IDIAP AV16.3 Corpus: for testing and fine tuning
We have evaluated our proposal using the audio recordings of the AV16.3 database [63], an audio-visual corpus recorded in the Smart Meeting Room of the IDIAP research institute, in Switzerland.We have also used the physical layout of this room for our semi-synthetic data generation process.
The IDIAP Smart Meeting Room is a 3.6m × 8.2m × 2.4m rectangular room with a rectangular table centrally located and measuring 4.8m × 1.2m.On the table's surface there are two circular microphone arrays of 0.1m radius, each of them composed by 8 regularly distributed microphones as shown in figure 2. The centers of both arrays are separated by a distance of 0.8m.The middle point between them is considered as the origin of the coordinate reference system.A detailed description of the meeting room can be found in [64].
The dataset is composed by several sequences of recordings, synchronously sampled at 16 KHz, which a wide range of experimental conditions in the number of speakers involved and their activity.
Some of the available audio sequences are assigned a corresponding annotation file containing the real ground truth positions (3D coordinates) of the speaker's mouth at every time frame in which that speaker was talking.The segmentation of acoustic frames with speech activity was first checked manually at certain time instances by a human operator in order to ensure its correctness, and later extended to cover the rest of recording time by means of interpolation techniques.
The frame shift resolution was defined to be 40 ms.The complete dataset is fully accessible on-line at [65].In this paper we will just focus on all the annotated sequences of this dataset featuring a single speaker, whose main characteristics are shown in Table 2.This allows us to directly compare our performance with the state-of-the-art method presented in [20].Note that the firsts three sequences are performed by a speaker remaining static while speaking at different positions, and the last two ones by a moving speaker, being all of the speakers different.We will refer to these sequences as s01, s02, s03, s11 and s15 for brevity.No constraints * The average speaker height is referenced to the system coordinates and refers to the speaker's mouth height.

Albayzin Phonetic Corpus: for Semi-Synthetic Dataset Generation
The Albayzin Phonetic Corpus [66] consists of 3 sub-corpora of 16 kHz 16 bits signals, recorded by 304 Castilian Spanish speakers in a professional recording studio using high quality close talk microphones.
We use this dataset to generate semi-synthetic data as described in Section 3.3.1.From the 3 sub-corpora, we will be only using the so-called phonetic corpus [67], composed of 6800 utterances of phonetically balanced sentences.This phonetical balance characteristic makes this dataset perfect for generating our semi-synthetic data, as it will cover all possible acoustic contexts.

Training and Fine Tuning Details
In the semi-synthetic dataset generation procedure, described in Section 3.3.1,we generate random positions q with uniformly distributed values in the following intervals: q x ∈ [0, 3.6]m, q y ∈ [0, 8.2]m and q z ∈ [0.92, 1.53]m, which correspond to the possible distribution of the speaker's mouth positions in the IDIAP room [63].
Regarding the optimization strategy for the loss function described by equation ( 4) we employ the ADAM [68] optimizer (variant of SGD with variable learning rate) along 200 epochs with a batch size of 100 samples.7200 different frames of input data per epoch are randomly generated during the training phase and other 800 for validation.
The experiments will be performed with three different window lengths (80ms, 160ms and 320ms), so the training phase will be run once per window length, obtaining three different network models.In each training, 200 audio recordings are randomly chosen and 40 different windows are randomly extracted from each.In the same way, 200 acoustic source position q vectors are randomly generated so that each position generates 40 windows of the same signal.
For the fine tuning procedure described in Section 3.3.2,we will be mainly using sequences s11 and s15, that features a speaker moving in the room while speaking, and also sequences s01, s02 and s03 in a final experiment.
As it will be described in Section 4.6, we will also address experiments trying to assess the relevance of adding additional sequences s01, s02 and s03 to complement the fine tuning data provided by s11 and s15.We will also refer to gender and height issues in the fine tuning and evaluation data.

Experimental Setup
In our experiments, sequences s01, s02 and s03 are used for testing the performance of our network and, as explained above, to complement sequences s11 and s15 for fine tuning.
In this work, we are using a simple microphone array configuration, aimed at evaluating our proposal in a resource-restricted environment, as it was done in [20].In order to do so, we are using 4 microphones (numbers 1, 5, 11 and 15, out of the 16 available in the AV16.3 data set), grouped in two microphone pairs.The selected microphone pairs configurations are shown in Figure 2.c, in which microphones with the same color are considered as belonging to the same microphone pair.We provide results depending on the length of the acoustic frame, for 80ms, 160ms and 320ms, to precisely assess to what extent the improvements are consistent with varying acoustic time resolutions.
The main interest of our experimental work is assessing whether the end-to-end CNN based approach (that we will refer to as CNN) is competitive as compared with state-of-the-art localization methods.We will compare this CNN approach with the standard SRP-PHAT method, and the recent strategy proposed in [20] that we will refer to as GMBF.This GMBF method is based on fitting a generative model to the GCC-PHAT signals using sparse constraints, and it reported significant improvements over SRP-PHAT in the IDIAP dataset [20,69].
After providing baseline results comparing SRP-PHAT, GMBF and our proposal without fine tuning procedure, we will then describe four experiments, that we briefly summarize here: • In the first experiment, we will evaluate the performance improvements when using a single sequence for the fine tuning procedure.• In the second experiment, we will evaluate the differences between the semi-synthetic training plus the fine tuning approach, versus just training the network from scratch.• In the third experiment, we will evaluate the impact of adding an additional fine tuning sequence.• In the last experiment, we will evaluate the final performance improvements when also adding static sequences to the refinement process.

Evaluation metrics
Our CNN based approach yields a set of spatial coordinates s k = s k,x s k,y s k,z that are estimations of the current speaker position as time instant k.These position estimates will be compared, by means of the Euclidean distance, to the ones labeled in a transcription file containing the real positions s k GT (ground truth), of the speaker.
We evaluate performance adopting the same metric used in [20] and developed under the CHIL project [70].It is known as MOTP (Multiple Object Tracking Precision) and is defined as: where N P denotes the total number of position estimations along time, s k the estimated position vector and s k GT the labeled ground truth position vector.We will compare our experimental results, and that of the GMBF method, with that of SRP-PHAT, measuring the relative improvement in MOTP with method, that is defined as follows:

Baseline Results
The baseline results are shown in Table 3 for sequences s01, s02 and s03, and all the evaluated time window sizes (in all the tables showing results in this paper, bold font highlight the best ones for a given data sequence and window length).The Table shows the results achieved by the SRP-PHAT standard algorithm strategy (columns SRP), the alternative described in [20] (columns GMBF), and the proposal in this paper without applying the fine-tuning procedure (columns CNN).We also show the relative improvements of GMBF and CNN as compared with SRP-PHAT.Table 3. Baseline results for the SRP-PHAT strategy (columns SRP); the one in [20] (columns GMBF), and the CNN trained with synthetic data without applying the fine-tuning procedure (columns CNN) for sequences s01, s02 and s03 for different window sizes.Relative improvements as compared to SRP-PHAT are shown below the MOTP values.The main conclusions from the baseline results are: • Best MOTP values for the standard SRP-PHAT algorithm are around 69cm, with averages between 76cm and 96cm.For the GMBF, best MOTP values are around 48cm, with averages between 59cm and 78cm.• MOTP values improve as the frame size increases, as expected, given that better correlation values will be estimated for longer window signal lengths.• The GMBF strategy, as described in [20], achieves very relevant improvements as compared with SRP-PHAT, with average relative improvements around 20%, and peak values of almost 30%.• Our CNN strategy, which at this point is only trained with semi-synthetic data, is very far from reaching the SRP-PHAT or GMBF in terms of performance.This result leads us to think that there are other effects only present in real data, such as reverberation, that are affecting the network.
Given the discussion above, we decided to apply the fine tuning strategy discussed in Section 3.3.2,with the experimental details described in Section 4.2.So, the results shown in Table 3 will be compared with those obtained by our CNN method, under different fine tuning (and training) conditions, and will be described below.

Results and Discussion
The first experiment in which we applied the fine tuning procedure used s15 as the fine tuning subset.
Table 4 shows the results obtained by GMBF (columns GMBF) and CNN with this fine tuning strategy (columns CNNf15 ).From the table results it can be seen that CNNf15 is, most of the times, better than the SRP-PHAT baseline (except in two cases for s03 in which there was a slight degradation).The average performance shows a consistent improvement of CNNf15 compared with SRP-PHAT, between 1.8% and 11.3%.However CNNf15 is still behind GMBF in all cases but one (for s02 and 80ms).Our conclusion is that the fine tuning procedure is able to effectively complement the trained models from synthetic data, leading to results that outperform SRP-PHAT.This is specially relevant as: • The amount of fine tuning data is limited (only 36 seconds, corresponding to 436 frames, as shown in Table 2), thus opening the path to further improvements with a limited data recording effort.• The speaker used for fine tuning was mostly moving while speaking, while in the testing sequences the speakers are static while speaking.This means that the fine tuning material include far more active positions than in the testing sequences, and the network is able to extract the relevant information for the tested positions.• The speaker used for fine tuning is a male, and the obtained results for male speakers (sequences s01 and s03) and the female one (sequence s02) do not seem to show any gender-dependent bias, which means that the gender issue does not seem to play a role in the adequate adaptation of the network models.
When comparing the results of Table 3 and Table 4, and given the large improvement when applying the fine tuning strategy, we could think that the effect of the initial training with semi-synthetic data is limited.From this argument, we run an additional training experiment in which we just trained the network from scratch using s15, aiming at assessing the actual effect of semi-synthetic training+fine tuning versus just training with real room data.
Table 5 shows the comparison between these two options: training from scratch using s15 (columns CNNt15) and semi-synthetic training+fine tuning with s15 (columns CNNf15).The average improvement of the latter approach varies between 1.8% and 11.3% with an average improvement over all window lengths of 5.3%, while the training from scratch average improvement varies between −20.6% and 4.3% with an average value of −7.0%.These differences show that the training+fine tuning proposal outperforms training the network from scratch, thus validating our methodology.In spite of the relevant improvements with the fine tuning approach, they are still far from making this suitable for further competitive exploitation in the ASL scenario (provided we have the GMBF alternative), so that we next aim at increasing the amount of fine tuning material.
In our third experiment, we applied the fine tuning procedure using an additional moving speaker sequence, that is, including s15 and s11 in the fine tuning subset.
Table 6 shows the results obtained by GMBF and CNN fine tuned with s15 and s11 (CNNf15+11 columns).In this case, we see additional improvements over using only s15 for fine tuning, and there is only one case in which CNNf15+11 does not outperforms SRP-PHAT (with a marginal degradation of −0.3%).The CNN based approach shows again an average consistent improvement compared with SRP-PHAT between 7.5% and 16.2%.
In this case, the newly added sequence (s11, with a duration of only 33 seconds) for fine tuning corresponds to a randomly moving male speaker, and the results show that its addition contributes to further improvements in the CNN based proposal, but it is still behind GMBF in all cases but two, but with results getting closer.This suggests that a further increment in the fine tuning material should be considered.
Our last experiment will consist of fine tuning the network including also additional static speaker sequences.To assure that the training (including fine tuning) and testing material are fully independent, we will fine tune with s15, s11 and with the static sequences that are not tested in each experiment run, as shown in Table 7.

Test sequence Fine tuning sequences
Table 8 shows the results obtained for this fine tuning scenario, and the main conclusions are: • The CNN based method exhibits much better average behavior than GMBF for all window sizes.Average absolute improvement against SRP-PHAT for the CNN is more than 10 points higher than for GMBF, reaching 31.3% in the CNN case and 20.7% for GMBF.• Considering individual sequences, CNN is significantly better than GMBF for sequences s01 and s02, and slightly worse for s03.• Considering the best individual result, maximum improvement for the CNN is 41.6% (s01, 320ms), while the top result for GMBF is 29.9% (s03, 320ms).• The effect of adding static sequences is beneficial, as expected, provided that the acoustic tuning examples will be generated from positions which are similar, but not identical, as the speakers have varying heights and their position in the room is not strictly equal from sequence to sequence.• The improvements obtained are significant and come at the cost of additional fine tuning sequences.However, this extra cost is still reasonable, as the extra fine tuning material is of limited duration, around 400 seconds in average (6.65 minutes).Finally, to summarize, Figure 3 shows the average MOTP relative improvements over SRP-PHAT obtained by our CNN proposal using different fine tuning subsets, and its comparison with the GMBF results, for all the signal window sizes.From the results obtained by our proposal, it is clear that the highest contribution to the improvements from the bare CNN training is the fine tuning procedure with limited data (CNNf15, comparing Tables 3 and 4), while the addition of additional fine tuning material consistently improves the results (Tables 6, and 8).It is again worth noticing that these improvements are consistently independent of the gender of the considered speaker and whether there is a match or not between the static or dynamic activity of the speakers being used in the fine tuning subsets.This suggest that the network is actually learning the acoustic cues that are related to the localization problem, so that we can conclude that our proposal is a suitable and promising strategy for solving the ASL task.

Conclusions
We have presented in this paper the first audio localization CNN that is trained end-to-end from the audio signals to the source position.We show that this method is very promising, outperforming the state-of-the-art methods [20,69] and those using SRP-PHAT, given that sufficient fine tuning data is available.In addition, our experiments show that the CNN method exhibits good resistance against varying gender of the speaker and different window sizes compared with the baseline methods.Given that the amount of data recordings for audio localization is limited at the moment, we have thus proposed in the paper to first train the network using semi-synthetic data followed by fine tuning using a small amount of real data.This has been a common strategy in other fields to prevent overfitting, and we show in the paper that it significantly improves the system performance as compared with training the network from scratch using real data.
In a future line of work we plan to improve the generation of semi-synthetic data including reverberation effects and testing in detail the effects of gender and language in the system performance.In addition we plan to include more real data by developing a large corpus for audio localization, that will be made available to the scientific community for research purposes.Also, an extensive evaluation will be carried out to asses the impact of the proposal with more complex acquisition scenarios (comprising a higher number of microphone pairs).

Figure 2 .
Figure 2. (a) Simplified top view of the IDIAP Smart Meeting Room, (b) A real picture of the room extracted from a video frame, (c) Microphone setup used in this proposal

Figure 3 .
Figure 3. MOTP relative improvements over SRP-PHAT for GMBF and CNN using different fine tuning subsets (for all window sizes).

Table 1 .
Network convolutional layers summary

Table 2 .
IDIAP Smart Meeting Room used sequences.

Table 5 .
Results for the CNN proposal, either trained from scratch with sequence s15 (columns CNNt15) or fine tuned with sequence s15 (columns CNNf15).

Table 7 .
Fine tuning material used in the experiment corresponding to Table8columns CNNf15+11+st.