Unsupervised Adaptation of Deep Speech Activity Detection Models to Unseen Domains

: Speech activity detection (SAD) aims to accurately classify audio fragments containing human speech. Current state-of-the-art systems for the SAD task are mainly based on deep learning solutions. These applications usually show a signiﬁcant drop in performance when test data are different from training data due to the domain shift observed. Furthermore, machine learning algorithms require large amounts of labelled data, which may be hard to obtain in real applications. Considering both ideas, in this paper we evaluate three unsupervised domain adaptation techniques applied to the SAD task. A baseline system is trained on a combination of data from different domains and then adapted to a new unseen domain, namely, data from Apollo space missions coming from the Fearless steps challenge. Experimental results demonstrate that domain adaptation techniques seeking to minimise the statistical distribution shift provide the most promising results. In particular, Deep CORAL method reports a 13% relative improvement in the original evaluation metric when compared to the unadapted baseline model. Further experiments show that the cascaded application of Deep CORAL and pseudo-labelling techniques can improve even more the results, yielding a signiﬁcant 24% relative improvement in the evaluation metric when compared to the baseline system.


Introduction
Speech activity detection (SAD) aims to determine whether an audio signal contains speech or not, and its exact location in the signal. This constitutes an essential preprocessing step in several speech-related applications such as speech and speaker recognition, as well as speech enhancement. In many cases, SAD is used as a preliminary block to separate the segments of the signal that contain speech from those that are only noise. This way, enabling the overall system to process only the speech segments. A large number of approaches have been proposed for the SAD task. Traditionally, statistical approaches have been used with relevant results under the assumption of quasi-stationary noise. Several works rely on the extraction of specific acoustic features [1,2]. Conversely, other methods are modelbased [3,4], aiming to estimate a statistical model for the noisy signal. Additionally, some unsupervised approaches can also be cited: based on energy [5], or based on the estimation of the signal long-term spectral divergence [6]. Recently, deep learning approaches are becoming increasingly relevant in the SAD task. The research presented in [7] implements a SAD system based on a multilayer perceptron with energy efficiency as the main concern. A deep neural network (DNN) approach is used in [8] to perform SAD in a multi-room environment. In [9], new optimisation techniques based on the area under the ROC curve are explored in the framework of a deep learning SAD system.
Recurrent neural networks (RNN) are significantly useful when dealing with temporal sequences of information because they are able to model temporal dependencies introducing a feedback loop between the input and output of the neural network. Several applications of long short-term memory (LSTM) networks [10] can be cited in the SAD task [11,12]. Some of our latest solutions for SAD in the context of diarisation applications obtained competitive results applying a bidirectional LSTM-based classifier [13,14]. Convolutional Recurrent Neural Networks (CRNN) combine the capability of convolutional networks to capture frequency and time dependencies simultaneously seeking to extract discriminative features, and the capability of recurrent networks to deal with temporal series. A number of examples of the use of CRNN models in audio processing can be found in the literature [15,16]. CRNN models have been also applied to the SAD task with relevant results [17].
In the last few years, a number of international evaluation campaigns have been proposing the SAD task as one of their challenges, seeking to advance this kind of systems in a variety of challenging domains. In this context, the National Institute of Standards and Technology (NIST) introduced the OpenSAT evaluations starting in 2017 [18]. Three domains were proposed for the SAD task: public safety communications, low resource languages and audio extracted from YouTube videos. Post-evaluations analysis revealed a significant difference in performance among the three domains for most participant teams. Similarly, aiming to motivate the research effort on a demanding domain such as audio from Apollo space missions, a series of annual challenges has been held since 2019 [19] proposing the SAD task among other speech related tasks. This initiative has resulted in the digitisation of the original analogue recordings from the space missions. Part of this data has been made available through the Fearless steps (FS) corpus, consisting of a cumulative 19,000 h of conversational speech coming from the Apollo 11 mission [20]. Audio data belong to 30 different communication channels, with multiple speakers in diverse locations. Most channels show a strong degradation with transmission noise or noise due to tape ageing.
Whereas current SAD state-of-the-art solutions rely on the use of deep learning techniques, these applications depend strongly on the amount of labelled data available. In some specific scenarios, obtaining labelled data can be significantly expensive or even impossible, which is why unsupervised domain adaptation techniques are an active research topic [21,22]. Domain adaptation techniques aim to transfer the knowledge obtained from a source domain and transfer it to a target domain. In addition, unsupervised domain adaptation techniques work under the constraint that no labels are available in this new target domain. Inspired by our previous experiences participating in the Fearless Steps challenge [23,24], that introduced a new audio domain in the research community, in this paper we aim to explore unsupervised domain adaptation techniques in the context of the SAD task. Considering a SAD model trained on different well-known domains in the SAD task, such as broadcast or meetings, with huge amounts of labelled data available, we evaluate three possible ways to adapt the SAD model to a new unseen domain in an unsupervised way. Results are presented using the data provided in the Fearless Steps challenge; however, the techniques and methods described in this paper are described in a general way so that domain adaptation could be done on any possible scenario. Unlike the work presented in [25], where the key idea is to perform a pretraining process on DNN models using unlabelled data, seeking to obtain a shared representation, this work aims to perform unsupervised domain adaptation directly on the model space, with the objective of fine tuning a given model trained previously with labelled out of domain data.
The remainder of the paper is organised as follows. Section 2 introduces the domain adaptation problem, presents some approximations found in the literature on how to solve it, and introduces different methods evaluated. Section 3 presents the experimental setup, describing the neural network architecture, datasets considered and metrics used in the evaluation. In Section 4, results for the baseline SAD system and unsupervised adaptation techniques are described. Finally, a summary of the work and conclusions are presented in Section 5.

Domain Adaptation
Usually machine learning algorithms require large amounts of manually labelled training examples in order to train a reliable model. In real applications, however, obtaining labelled data requires huge efforts and, in some cases, it is even impossible. A simple and straightforward solution is to train a model on a labelled dataset which is somehow related to the target data and then apply it to the data being considered. This approach is likely to lead to substantial drops in performance caused by the domain shift, observed in the different feature and label distributions [26]. In order to solve this problem, several domain adaptation techniques have arisen, aiming to learn from a source dataset and transfer that knowledge obtained to a target dataset. Domain adaptation is currently an active research topic in the machine learning community [27,28]. Focusing on speech technologies applications, several works have also investigated the domain adaptation problem for speech recognition [29], speaker recognition [30] or speech enhancement [31].

Problem Formulation
In the following lines, a formal introduction to the domain adaptation problem and some definitions relevant to the topic are provided. In order to formulate the domain adaptation problem, two domains need to be defined: the source and the target domain. The source domain, D s , represented by a source dataset In the case of unsupervised domain adaptation problems, such as the ones described in this work, no ground-truth labels Y t are available for the target dataset.
Traditionally, in supervised learning problems training samples are assumed to be available. This is the case for the source domain. Accordingly, the learning problem is to determine a classifier f s (Π s , θ s ) that allows obtaining high classification accuracy for test samples by exploiting the available training set Π s . The classifier is described by a set of parameters θ s specific for each family of classifiers.
In the domain adaptation framework, the problem becomes more complex as test samples are drawn for a target domain distribution different from the source domain distribution of training samples. Considering the ideas described, the goal of domain adaptation techniques should be to develop a new classifier f t (Π s , θ s , Π t , θ t ) that obtains an accurate prediction of test samples coming from the target domain by exploiting labelled training samples Π s from the source domain D s and unlabelled samples Π t from the target domain D t . As for supervised classifiers, this new model adopted for classification is described by a set of parameters θ s specific for each family of classifiers, and by a set of parameters θ t which is specific to each domain adaptation technique.

Approaches to Domain Adaptation
In order to better understand the domain adaptation problem, a variety of works over the years have tried to categorise the diverse conditions found for the problem. We refer the reader to the following survey for a more detailed description of this categorisation [32]. The first level of categorisation refers to the relation between source and target domains. Under the assumption that the source and target domain are directly related, transferring knowledge can be performed in a single step. This is usually called one-step domain adaptation. In case that assumption fails, one-step domain adaptation may not be effective. Multi-step domain adaptation [33] aims to connect two unrelated domains via a series of intermediate bridges, and then perform one-step domain adaptation.
In this work, as human speech has characteristics that may not vary among domains, we assume that source and target domains are related. Because of that, we focus on onestep domain adaptation solutions. In this scenario, domain adaptation approaches can be summarised intro three big cases according to the work in [34]: • Discrepancy-based: this family of solutions works under the assumption that finetuning a model using labelled or unlabelled data can reduce the shift between source and target domain. Under this idea, several criteria can be used to perform domain adaptation: some authors use class information to transfer knowledge between two domains [35]. The authors of [36,37] seek to align the statistical distribution shift between source and target domain. Other approaches also aim to improve generalisation by adjusting the architectures of DNNs, such as the work presented in [38]. • Adversarial-based: in this case, a domain discriminator tries to identify whether a data point belongs to the source or the target domain. This is used to encourage domain confusion through an adversarial objective that minimises the distance between source and target domain distributions. Two main groups can be observed when implementing this idea: those relying on generative models such as generative adversarial networks (GAN) [39], or those that rely on non-generative models that aim to obtain domain invariant representation through a domain confusion loss [40]. • Reconstruction-based: This approach is based on the idea that data reconstruction of the source or target samples may be helpful in order to improve the domain adaptation process. This way the reconstructor is able to ensure both specificity of intra-domain representations and indistinguishability of inter-domain representations. Some examples of these methods can be cited, such as the use of an encoder-decoder reconstruction process [41], or an adversarial reconstruction obtained via a cycle GAN model [42].

Unsupervised Domain Adaptation Techniques
Following the categorisation previously explained, in this paper we focus our efforts on the evaluation of one-step, discrepancy-based domain adaptation techniques. Additionally, the three methods presented are fully unsupervised, meaning that no labels for the target domain are needed in order to obtain an adapted model. Descriptions of these methods are presented in the next subsections.

Pseudo-Labelling
The goal of pseudo-labelling (PL) [43] is to generate a set of predicted labels for unlabelled samples with a model trained on labelled data. This idea is an intuitive and straightforward application that can help overcome the challenge of collecting large labelled datasets. Several works in the literature have explored different algorithms for creating pseudo-labels. In [44], pseudo-labels are assigned to unlabelled samples using neighbourhood graphs. The idea of pseudo-labelling is extended in [45] by incorporating confidence scores for unlabelled samples. The authors of [46] present a new optimisation framework to iteratively update the obtained pseudo-labels. Our approach is inspired by the works in [43,47], where pseudo-labels are generated directly as the predictions of a trained neural network.
Our solution for pseudo-labelling domain adaptation can be described according to the three following steps: 1.
Train source model: first, an initial model is trained in a supervised way on the source domain.

2.
Predict target labels: the initial model is then used to predict speech presence or absence for the unlabelled target domain.

3.
Adapt using predicted labels: finally, the initial model is retrained in a supervised way using the pseudo-labels as if they were true labels.
Furthermore, besides performing a fine tuning of the initial source model to obtain the target model, this solution could also be used to train a target model from scratch using the obtained pseudo-labels or a combination of the source labelled data and the obtained pseudo-labels. Pseudo-labelling techniques have been used in several audio processing applications ranging from acoustic classification problems [48], diarisation [49] or speech recognition [50] with relevant results. In this work, we aim to extend the pseudo-labelling techniques to the SAD task and evaluate its performance in the framework of domain adaptation.

Knowledge Distillation
The knowledge distillation (KD) [51] framework was originally proposed as a model compression method in which two DNNs are involved. These two models are usually known as teacher and student model in an analogy to the education process. The main idea of this philosophy is that the teacher model produces soft labels which are used to train the student model. Consequently, the student model should imitate the predictions of the teacher model. In order to do so, Kullback-Leibler Divergence (KLD) loss between student and teacher distributions is minimised. KLD loss can formulated according to the following expression: where i is the example index, x i is the input example, p t (y i |x i ) is the output posterior probability of the label y i from teacher model and p s (y i |x i ) is the output posterior probability of the label y i from student model for the same example. As done in most KD methods [52,53], the teacher model is usually frozen, relying on a pretrained model, in order to reduce complexity. In this case, where only the parameters of the student network need to be optimised, minimising KLD loss expressed in Equation (1) is equivalent to minimising the expression shown in the following equation: where const is a constant term as defined in [54]. As it has just been explained, knowledge is transferred via the minimisation of a loss function whose target is the distribution of class probabilities predicted by the teacher model. This is the output of a softmax function applied on the teacher model logits. However, in most cases, this distribution provides a high probability for the correct class, with the other class probabilities close to zero. In order to address this issue, the authors of [51] introduced the concept of softmax temperature. The probability p i of the class i is calculated from the neural network logits z according to the following equation: where T is the temperature parameter. For the case of T = 1 the standard softmax function is obtained. As T grows, the probability distribution generated becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. The proposed experimental setup for KD based domain adaptation is shown schematically in Figure 1. It can be seen that both models-teacher and student-receive target data examples X t as input. Predictions from both models are obtained via softmax activations using the same temperature parameter t. Soft predictions from the teacher network are used to transfer knowledge to the student network, aiming at mirroring those predictions using the mentioned KLD loss. In this process, the teacher model is frozen and only parameters of the student model are updated. In order to test the system, the teacher model is discarded, and final predictions are obtained using the student model with a standard softmax activation with T = 1. As it is implemented, this version of KD can be interpreted as a soft version of the pseudo-labelling method previously described.  Several examples of the use of KD techniques for solving the domain adaptation task can be found in the literature. The authors of [55] apply KD to improve acoustic models for automatic speech recognition (ASR) in application-specific conditions. Something similar is done in [56], that uses KD algorithms to improve ASR performance in noisy environments. Concerning the SAD task, we can also see several examples of the teacher student architecture being used in domain adaptation solutions [57,58].

Deep CORAL
The recently proposed Correlation Alignment (CORAL) method [37] is an unsupervised adaptation technique that is performed by aligning the second-order statistics of the source and the target distributions. However, this technique relies on a linear transformation and is not end-to-end. In order to address those issues, an extension on the CORAL method named Deep CORAL was proposed [59] with the idea of incorporating the CORAL technique directly into deep neural networks by constructing a differentiable loss functions that minimises the difference between source and target correlations.
The CORAL loss between two domains for a single feature layer is described in the following lines. Suppose a set of training examples coming from the labelled source domain, D s , described by U s = {u s 1 , ... , u s N } with u ∈ R d , and unlabelled target data The number of source and target data are N and M respectively. As described in the original paper, U s and V t represent the d-dimensional deep layer activations of a deep neural network model. Considering the provided definitions, the covariance matrices of the source and target data are given by the following equations: where 1 is a column vector with all elements equal to 1. The CORAL loss is then defined as the distance between the second-order statistics of the source and target features. This is shown in Equation (6): where · 2 F denotes the squared matrix Frobenius norm. The idea behind Deep CORAL adaptation technique is to obtain a set of deep features that are both discriminative enough to train a strong classifier and invariant to the change observed between source and target domains. Minimising a classification loss by itself, as usually done in supervised learning approaches, may lead to overfitting on the source domain and a reduced performance on the target domain. By contrast, minimising the CORAL loss alone may lead to degenerated features. In order to match the conditions stated above, the final loss to be optimised is a combination of both a classification loss and the CORAL loss. The representation of the neural architecture needed to implement Deep CORAL adaptation technique is shown in Figure 2. As shown, source features are forwarded through the DNN model and then a classification loss is computed using source labels. Similarly, target features are used in combination with source features to compute CORAL loss. Network parameters are shared among the two DNN models. Considering the described architecture, the joint optimisation target of classification loss and CORAL loss is described in Equation (7): where L classif is any traditional classification loss function such as cross entropy, z is the number of CORAL loss layers in a deep network and λ i is a weight that trades off adaptation and classification accuracy on the source domain. The sum term depicted aims to represent the possibility of incorporating the CORAL loss on additional layers of the DNN architecture. However, as described in the original paper, authors apply the CORAL loss only to the last classification layer in the DNN architecture. In our experiments, we apply CORAL loss in the same way, simplifying Equation (7) to become Equation (8) in the case where z = 1: More recently, further work has proposed a new approach built upon the Deep CORAL method. The authors of [60] argue that the Euclidean distance used in the original Deep CORAL proposal may not be the most appropriate way to measure the distance between the source and target domains. Knowing that covariance matrices are positive semi-definite, they can be seen as two points lying on a Riemman manifold, and the metrics defined therein should consider its non-Euclidean structure [61]. Therefore, the Euclidean distance as defined in Equation (6) may be seen as only suboptimal in such a space. Considering this information, the log-Euclidean distance is instead a Riemannian metric that better captures the manifold structure. This metric is defined according to the following equation: where log(X) is the matrix logarithm of X [62]. Similarly as the CORAL loss, the log CORAL loss can be obtained by replacing the Euclidean distance in Equation (6) with the log-Euclidean distance. This is shown in Equation (10).
Through the eigenvalue decomposition of matrices C s and C t in Equation (10) we obtain the final expression for Log Deep CORAL loss: where d denotes the dimension of the features whose covariances are intended to align, as previously explained; S and T are the matrices that diagonalise C s and C t , respectively; and s i and t i are the corresponding eigenvalues. The final setup for the log CORAL loss is the same as the one explained for the original CORAL loss, being described in similar terms as the ones presented in Equation (7): a classification loss is combined with the log CORAL loss to obtain the global loss term.

Data Description
The idea behind the baseline model training is to obtain a generic model exposed to a huge variety of data, so that an adaptation to new unseen domains can be performed later by transferring that general knowledge to a target dataset. Table 1 summarises datasets used for training and evaluating the baseline SAD system. As it can be observed, the baseline SAD is trained on a combination of data coming from three domains: broadcast, telephone channel and meetings. For the broadcast domain, the system is trained on a combination of data from previous Albayzín evaluation campaigns (2010 and 2018) and data from the Multi-Genre broadcast (MGB) challenge. For the meetings domain, AMI and ICSI meetings corpora are used. Finally, in order to represent the telephonic domain, the summed partition of NIST 2008 speaker recognition evaluation (SRE) is incorporated into training. In addition, 10% of each training dataset considered is reserved to generate a validation subset containing data across the three domains considered. As described in Table 1, we also reserve an additional dataset from each domain to evaluate the obtained results in that domain with the baseline SAD system. These datasets are the Albayzín 2020 evaluation partition for broadcast domain, CALLHOME dataset for telephonic domain and the dataset originally released for NIST Rich Transcription (RT) 2009 evaluation for meetings domain.
The main goal of this work is to adapt SAD models trained on a variety of domains to a new unseen domain. This new domain is the one introduced in the Fearless Steps challenge, with audio from Apollo mission featuring quite degraded channels and several kinds of transmission noises. In the following lines, we describe partitions provided originally in the second phase of the challenge (FS-02) and explain how they have been used in this work: Note that the evaluation partition provided by the organisation was not used in this work. The scoring of this subset was performed by organisers while running the challenge and labels have not been released publicly, making it impossible to obtain comparable results on this subset at the moment of developing this work. Furthermore, all the obtained results in this work for the Fearless Steps data are under the challenge conditions because participants were allowed to use any available data in addition to the data provided by the organisation to train and tune their systems.

Feature Extraction
As a first preprocessing step, all audio considered for this work was downsampled to 8 kHz and converted to a single-channel input. As input features for the proposed neural network-based SAD system, we consider log Mel-filter bank energies. We use 64 log Mel-coefficients concatenated with the log energy of the frame. Considering an audio input sampled at 8 kHz, Mel filters span across the frequency range between 64 Hz and 4 kHz. Features are computed every 10ms using a 25 ms Hamming window. As a final step, the mean and variance at feature level are used to normalise the corresponding file. The set of features described in this section is shared among all experiments described in this paper.

Neural Network Model
As the main element for the SAD system we opt for a CRNN based classifier. Particularly, we use the variant using 2D convolutions from the models already presented in our previous work [23]. The schematic representation of the proposed CRNN model is described in Figure 3. As it can be observed in the figure, the model is mainly composed of two elements. First, three 2D convolutional blocks are used to process input features. Each of these blocks is integrated by a 2D convolutional layer with 3x3 kernel size and 64 filters. Then, it is followed by a batch normalisation layer [70] and the application of a rectified linear unit (ReLU) [71] activation function. Finally, a max-pooling mechanism is applied considering a 4 × 1 stride, so that only the frequency axis is downsampled. Then the output of the last convolutional block is fed to the RNN block, generated by stacking three bidirectional LSTM layers with 128 neurons each. This block is then followed by a linear layer that generates the speech class score as a single neuron output.

Evaluation Metrics
Two possible errors can be considered when dealing with SAD systems: a false positive (FP), this is the identification of speech in a segment where the reference identifies nonspeech, and a false negative (FN), this is the missed identification of speech in a segment where the reference identifies speech. Using these two errors, the false positive rate (FPR) and false negative rate (FNR) can be computed according to the following equations: where T FP and T FN are, respectively, the total false positive time and total false negative time for the SAD hypothesis, T non-speech represents the total annotated non-speech time in the reference, and T speech represents the total annotated speech time in the reference. Following the evaluation protocol originally proposed in the Fearless Steps challenge [72], results are reported according to the detection cost function (DCF), as shown in the following equation: As it can be observed, false negative errors were considered more important than false positive errors in the original evaluation. In addition to FPR, FNR and DCF, which are metrics depending on the chosen threshold, results of the system are also reported using the area under the ROC curve (AUC) metric, measuring the area underneath the entire receiver operating characteristics (ROC) curve, and the equal error rate (EER), the error rate at which the FNR and FPR is equal. Both metrics provide an overall measurement of performance for all the possible threshold applied to the neural networks scores. Furthermore, the desegregated performance for all possible operating points is described using the detection error trade-off (DET) curve, showing FPR values versus FNR values.

Baseline System
As starting point for the experimentation, the main objective is to obtain a baseline system so that further experiments could be compared against. This baseline model is the one considered as the unadapted model, trained only with out of domain data. This model is fine tuned in the following experiments using the methods previously explained in order to obtain a model adapted to the unseen domain. The experimental setup is built upon the description provided in Section 3. Concerning the details of the optimisation process, adam optimiser is used with a learning rate that decays exponentially from 10 −3 to 10 −4 during 20 epochs. Minibatch size is chosen to maximise the GPU memory usage. Model selection is performed by choosing the best performing model in terms of frame classification accuracy on the validation subset. Unless stated otherwise, these optimisation details are common among all the following experiments described in this paper.
Results for the baseline system trained on broadcast, telephonic and meetings domains in terms of AUC and EER can be observed in Table 2. Additionally, we also report the results of one of our submissions to the original FS02 challenge [23] that was trained using the same experimental setup but with data coming from the training partition provided for the challenge. This result is presented in order to provide an upper bound for the neural architecture performance in case it was trained with in domain data. First, we report results for the three domains that the SAD baseline system has seen in the training process. In general terms, it can be observed that competitive results are obtained on the three domains shown, with EER values in the range 6 to 7%. Focusing on the obtained results for FS02 development partition, it can be observed that, as expected, the baseline system underperforms when compared to the challenge submission trained within domain data. Particularly, a drop in performance close to 50% can be observed in terms of EER. In the following subsections, we aim to fill the gap between the baseline model and the upper bound provided by the system submitted to FS challenge by using unsupervised domain adaptation techniques.
Results shown in Table 2 are complemented with those presented in Figure 4. In this figure, we show the DET curve and EER for the baseline system and the challenge submission system on FS02 development partition. As it can be observed, a similar drop in performance measured by the EER metric is applicable to all points in DET curve. Furthermore, the baseline system tends to provide higher FNR values, whereas the challenge submission curve shows a relatively constant slope in all the displayed range. As a point for comparison, for FPR = 1%, the challenge submission provides FNR = 10% while the baseline unadapted system yields FNR greater than 50%. Now that the baseline system has been appropriately characterised and set in context, in the following subsections we present the results for the three domain adaptation techniques described in the paper.

Pseudo-Label Domain Adaptation
As described in Section 2.3.1, pseudo-labelling is a method traditionally used to alleviate the need for labelled data. In the following experiments, we use it in order to adapt a model to a new unseen domain. The first step needed to perform this technique is to obtain a new set of pseudo-labels for the target data. To do so, we run the FS02 training partition through the previously obtained baseline SAD model and store the speech scores, that are then thresholded accordingly to obtain the final pseudo-labels. In the next step, those pseudo-labels are considered as the ground truth for the target data and used to train a new model in a supervised way.
Even though labels for FS02 training partition data are not used in this paper to train any model, those labels are still available and can be used to obtain an objective evaluation of the pseudo-labels obtained via the baseline model. This evaluation is shown in Table 3 in terms of AUC and EER. Furthermore, as this method relies heavily on the operating point chosen for the pseudo-labels, we also report FPR and FNR for three operating points: one with low FPR, one with balanced FPR and FNR, and one with low FNR. By using these operating points, three sets of pseudo-labels are obtained. Experiments are performed separately for each set of pseudo-labels in order to observe the influence of the operating point in this domain adaptation strategy. As it can be observed, the values for AUC and EER are in line with those obtained with the baseline model on the development partition, yet the EER is slightly greater for the training partition. Concerning the operating points considered, it should be noted that, in general terms, by using these pseudo-labels the neural network is dealing with an approximately 15% of wrong labels in the adaptation process.
Once pseudo-labels have been obtained using the baseline model, several alternatives can be used to obtain a new adapted model. In this paper, we explore two of those alternatives. On the one hand, we train a new model from scratch using the same experimental setup as the one described for the baseline model but using FS02 training partition audio and the obtained pseudo-labels as ground-truth. On the other hand, we also evaluate the possibility of fine-tuning the baseline model via the obtained pseudo-labels for the FS02 training partition, using a learning rate ten times smaller than the one used in the original training process. Table 4 describes the obtained results in terms of AUC and EER for each possible operating point and for both training strategies. When compared to the unadapted baseline system, it can be observed that the results presented using the pseudo-labelling method share two common characteristics: a minor improvement in terms of AUC metrics, while the EER remains similar to the one reported in the baseline system. No significant difference can be observed between the two training strategies presented. Concerning the operating points for the pseudo-labels, the low FPR operating point yields the lowest EER for both training strategies, while at the same time also reporting the lowest AUC values.
In order to further understand the behaviour of the pseudo-labelling strategy, Figure 5 shows the DET curve and EER on the FS02 development partition for the baseline system and the systems trained using pseudo-labelling domain adaptation methods. From Figure 5, it can be seen that the pseudo-labelling technique provides no significant improvement in EER values. On the other hand, we can see that the improvement in AUC metric observed previously comes from the improvement compared to the baseline system in DET curve in the areas with high FPR and FNR values. In general terms, experimental results suggest that pseudo-labelling techniques can help obtaining a DET curve with constant slope, reducing error in areas with high FPR and FNR values, while not significantly modifying the EER value of the system used to obtain pseudo-labels.

Knowledge Distillation Domain Adaptation
According to the theoretical explanation provided in Section 2.3.2, we aim to perform domain adaptation using the knowledge distillation framework applied to the SAD task. In the following, we describe the training process. First, teacher and student models are initialised using the unadapted baseline model. Teacher model weights are frozen during the entire training, and only student model weights are updated. The output of both models is compared using KLD loss after going through softmax activation with a temperature parameter T, shared for teacher and student models. At inference, the softmax layer is used in its standard form with temperature T = 1. Obtained results in terms of AUC and EER using knowledge distillation are shown in Table 5 for various values of the temperature parameter T. A decreasing tendency for EER can be observed when increasing the temperature parameter up to T = 50, achieving a best EER value of 6.05%. However, this tendency is not consistent for the AUC metric, showing values in between 97.80 and 97.86. When compared to the baseline system, best temperature configuration reports a 8.10% relative improvement on the EER value, yet this improvement in EER only leads to a 0.29% relative improvement on the AUC metric. In general terms, an improvement in performance can be observed using the temperature softmax activation as argued by [51], however this improvement is limited.
Results in Table 5 are complemented with those presented in Figure 6. In this Figure, we present the DET curve and EER for some of the best performing knowledge distillation systems compared to the unadapted baseline system. As observed in Figure 6, unlike the pseudo-labelling strategy, knowledge distillation seems to be able to decrease the EER point on the DET curve by using a large temperature parameter. However, it can also be seen that all curves are very close to each other; this is translated in AUC values very similar to those obtained by the baseline system. In general terms, it can also be observed that knowledge distillation does not correct the baseline system tendency to provide high FNR values. This may be due to the fact that KLD loss makes the student network mimic the predictions of the teacher network, so probability distributions and the relations between speech and non-speech observations should remain similar.

Deep CORAL Domain Adaptation
As an additional third alternative to the previously evaluated unsupervised domain adaptation techniques, in the following lines, we evaluate experimentally the feasibility of Deep CORAL and its variations for the SAD task. Following the theoretical explanation provided in Section 2.3.3, we train a new model using the Deep CORAL and Log Deep CORAL techniques using the same experimental setup: the baseline model is used to initialise the new adapted model, that is then fine tuned for 10 epochs using a learning rate decaying exponentially from 10 −4 to 10 −5 (10 times smaller than the one used to train the baseline model). CORAL and log CORAL losses are applied only on the final linear layer of the DNN classifier. Final loss term is then computed using cross entropy loss considering the source labels, and the respective CORAL loss weighted by a factor λ. As described in the original paper, λ value was chosen so that, at the end of the training, the classification loss and the CORAL loss are in the same order of magnitude. Obtained results using both, Deep and Log Deep CORAL methods, are shown in Table 6 in terms of AUC and EER for three λ values in the same order of magnitude. As observed in Table 6, both Deep CORAL and Log Deep CORAL provide the lowest EER values obtained in this work so far through model adaptation. Best result in terms of EER is obtained using Log Deep CORAL method, with an EER of 5.36%, which results in a 18.17% relative improvement when compared to the unadapted baseline system. Concerning the AUC values observed, obtained results are in line with the best performing system using the pseudo-labelling techniques, showing also better values than the knowledge distillation method. As done in previous experiments, we also report the DET curves in order to observe the behaviour of the adapted systems in all possible operating points. This curve is shown for the Deep CORAL (left) and Log Deep CORAL (right) systems in Figure 7 for multiple values of λ compared to the DET curve of the baseline system. As observed, CORAL-based domain adaptation techniques provide the best improvement in the DET curve so far in this paper. When compared to the baseline system, besides significantly decreasing the EER, an overall improvement can be seen in the DET curve for the areas reporting high FNR values. That is the reason why the AUC value reported increased when compared to the baseline system. Both, Deep CORAL and Log Deep CORAL, seem to be insensitive to λ value, showing a similar performance as long as λ remains in the same order of magnitude.
By observing the behaviour of the Log Deep Coral method and the pseudo-label strategies previously characterised, it becomes apparent that both solutions may be complementary. While the Log Deep CORAL DET curve shows no improvement over the baseline for high FPR values, the DET curve for the pseudo-labelling method obtains its best results in that area, making it the one with best performance for high FPR values over the three methods evaluated. This fact suggest that combining both methods, applying them in cascade, might provide even further improvements to the SAD neural network. This idea is evaluated experimentally in the following subsection.

Cascaded Application of Domain Adaptation Methods
In view of results presented in previous subsections, this final experiment evaluates the possibility of applying two domain adaptation methods in a cascaded setup in order to improve even further the results on the new unseen domain. By combining the capabilities of CORAL based adaptation to obtain an overall boost in performance and the pseudo-label adaptation to obtain a DET curve with a constant slope, we use both of them in a cascaded setup. The baseline model is first adapted using the Log Deep CORAL method. Then the adapted model is used to extract a new set of pseudo-labels, which are later used to obtain a final model using both training strategies described in previous experiments, either training a new model from scratch, or fine-tuning the previous model. Table 7 shows an objective evaluation of the pseudo-labels obtained via the Log Deep CORAL model in terms of AUC, EER, FPR and FNR for the same three possible operating points considered in previous experiments. As expected, we can observe that the overall quality of the pseudo-labels has improved when compared to those obtained using the baseline model (see Table 3). Improvements obtained are in line with the ones presented on the FS Development subset when using the Log Deep CORAL method, with the EER metric decreasing from 7.28 to 5.63%. By using this new set of pseudo-labels the adaptation process is performed feeding the neural network with an approximately 11% of wrong labels distributed according to the three operating points shown. Obtained results using this new set of pseudo-labels are described in Table 8 in terms of AUC and EER for each possible operating point and for both training strategies described. The previously observed behaviour of the pseudo-labelling method is repeated in this new experiment. In general terms, compared to Log Deep CORAL model (AUC = 98.25%, EER = 5.48%), it can be seen that pseudo-labelling strategies provide an improvement on the AUC metric while maintaining a similar EER value. As a matter of fact, all the AUC values reported outperform those from previous experiments, with the best case scenario of AUC = 98.86% obtained by training a new model from scratch using low FPR pseudolabels. Concerning the training strategies considered, experimental results suggest that no significant difference can be found between training a new model from scratch or fine tuning the previous stage model. In terms of operating point, the EER value obtained using low FPR is slightly lower for both training strategies, however this difference becomes insignificant when considering the AUC metric.

Discussion
Once all the results for the unsupervised domain adaptation methods have been described, in this subsection we aim to provide a brief discussion on its behaviour, setting them in the context of the original FS challenge and using the original DCF metric in order to obtain a performance comparison. Table 9 presents a summary of the best obtained results using all methods explored (pseudo-labels, KD, Deep CORAL), and the cascaded application of both of them in terms of AUC, EER and DCF metrics as used in the FS challenge. We also report the relative improvement obtained in DCF metric compared to the unadapted baseline system. For comparison, we present the challenge baseline result provided by the FS challenge organisation [72], and our submission to the challenge trained using in domain data. Table 9. AUC (%), EER (%), DCF (%) and DCF relative improvement (%) over the unadapted baseline model on the FS02 development partition using the three evaluated domain adaptation methods, and the cascaded application of two of them with the best performing hyperparameter configuration in terms of DCF metric compared to the original challenge baseline and our submission to the challenge trained using in domain data. Concerning the results of the pseudo-labelling strategy, we can observe a relative improvement in DCF metric between 11% and 12% when compared to the unadapted baseline system. This improvement comes mainly from the correction made to the DET curve, with those systems showing a behaviour with more constant slope. This means that systems adapted using this method show a better performance in areas with high FPR or high FNR while, at the same time, EER remains very similar to that of the baseline system. The knowledge distillation systems offer the lowest improvement in DCF of all methods evaluated. Even though its configuration with high softmax temperature still shows a non-neglectable 5.79% relative improvement compared to the baseline system, this solution shows limited applications. DET curve is really similar to the one obtained by the baseline system, with limited improvement in AUC. Finally, CORAL-based methods show one of the most promising results of this study. The best hyperparameter configuration using Log Deep CORAL achieves a DCF relative improvement of 13.23% compared to the baseline system. These results shows the lowest EER value in this paper in the case of the application of a single technique, while also reporting an overall improvement for the full DET curve.

Model
Best results in terms of DCF are obtained by applying Log Deep CORAL and pseudolabelling in a cascaded setup. In the case of training a new model from scratch using pseudo-labels from the previous step, DCF is 3.65%, which is equivalent to 24.59% relative improvement compared to the unadapted baseline system. Furthermore, experimental results confirm that both methods are complementary. As shown, the total relative improvement with both techniques is equivalent to the sum of relative improvements when used separately.
As a final point in this discussion, and in order to provide a condensed view of the best obtained results in this work, Figure 8 presents DET curve and EER of the three evaluated methods by themselves (left), and the cascaded application of Log Deep CORAL and pseudo-labelling (right) using the best hyperparameter configuration in terms of DCF metric compared to the unadapted baseline and our submission to the challenge trained using in domain data. Observing the comparative of the single method application (left), it can be seen again that the most promising result in terms of improving the DET curve is obtained by the Log Deep CORAL method, while the improvement of the knowledge distillation method is limited. Focusing on the comparison on the cascaded application of Deep CORAL and pseudo-labelling (right), it can be observed that the DET curve obtained for the cascaded application of both methods (in pink) supports again the hypothesis that both techniques are complementary. Applying the pseudo-labelling strategy on top of the Log Deep CORAL model results in a DET curve with a similar EER value but a performance significantly improved in areas with high FPR and high FNR. With this method we achieve the best DET curve in this paper that, as already presented, allows to decrease even further the DCF metric, which is influenced in a higher way by false negative errors. As an example operating point for comparison, focusing on a FPR value of 1%, the unadapted baseline system shows a FNR value greater than 50%. While the Log Deep CORAL model reduces this value to be approximately 40%, the improvement is much more significant with the combination of both adaptation methods, that reports a FNR value of 25%.
Even though there is still a gap between the best obtained results and a model trained using in domain data, in practical scenarios, where no labels are available, experimental results have proved that unsupervised domain adaptation can improve significantly the performance of SAD systems. Best results are obtained using an approach that combines two methods. The application of this solution increases the computational complexity of the adaptation process in training time. However, the increase in complexity introduced by pseudo-labelling is minimal compared to the one already introduced by Log Deep CORAL. The latter implies a training process that computes two covariance matrices for CORAL loss and a classification loss, while the former only requires inference computation on the unlabelled data and then a simple training process with a classification loss. Furthermore, inference complexity remains the same in all cases as there is only need to compute the final adapted model to obtain SAD predictions.

Conclusions and Future Lines
In this paper, we have explored the use of unsupervised domain adaptation techniques in the context of the SAD task. An initial baseline model was trained on a variety of wellknown domains with big amounts of labelled data available. Then, a study was performed on three methods that allow to perform adaptation directly on the model space with the objective of fine tuning the mentioned baseline model using only unlabelled data.
We have used the data provided in the FS challenge, coming from a singular domain such as Apollo space missions, to experimentally validate in the SAD task the methods presented. Yet, the methods are general enough so that they can be easily applied to other datasets. Furthermore, no labels are required for them to be used, significantly reducing the constraints for choosing them in practical applications.
Through the application of Deep CORAL based domain adaptation methods, results show a 13% relative improvement in DCF metric of the original challenge. Furthermore, the cascaded application of Deep CORAL and pseudo-labelling techniques provides the best results in this study, with a significant 24% relative improvement compared to the baseline system. These experimental results suggest that Deep CORAL and pseudo-labelling techniques are complementary. The first one providing an overall improvement in the DET curve and reducing the EER. The second one improves the AUC value by modifying the DET curve so that its slope becomes constant, specially in areas with high FPR and FNR values. The improvements in performance observed allow to substantially reduce the gap for the SAD task between a system trained using in domain data and an approach based on fully unsupervised adaptation.
Even if the knowledge distillation method shows an improvement in performance compared to the unadapated baseline model, this improvement is limited compared to the one observed by CORAL based techniques. This kind of domain adaptation methods, seeking to minimise the statistical distribution shift between source and target domains, provide one of the most promising results in this paper. Some recent work has introduced the use of higher-order statistics for unsupervised domain adaptation [73], generalising the idea presented in Deep CORAL as an arbitrary order moment matching technique. Some of our future work lines may point in this direction, applying this idea to the SAD task.  ICSI Meetings corpus is available at the Linguistic Data Consortium (LDC) under catalogue numbers LDC2004S02 and LDC2004T04 for audio and transcripts respectively. CALLHOME dataset can also be found on the LDC under the catalogue numbers LDC97S42 and LDC97T14 for audio and transcripts respectively. Fearless Steps Challenge data was made available to the challenge participants. Both partitions used in this work can be requested to their respective authors through the following contact email: FearlessSteps@utdallas.edu.