Unsupervised Pre-Training for Voice Activation

Featured Application: The proposed way to use unsupervised pre-training in voice activation could be beneﬁcial in cases of limited data resources, e.g., in low-resource domains or for customizing a product for the end user using his or her voice data. Furthermore, the presented dataset for the Lithuanian language can be used for the further research of voice-related problems in low-resource languages. Abstract: The problem of voice activation is to ﬁnd a pre-deﬁned word in the audio stream. Solutions such as keyword spotter “Ok, Google” for Android devices or keyword spotter “Alexa” for Amazon devices use tens of thousands to millions of keyword examples in training. In this paper, we explore the possibility of using pre-trained audio features to build voice activation with a small number of keyword examples. The contribution of this article consists of two parts. First, we investigate the dependence of the quality of the voice activation system on the number of examples in training for English and Russian and show that the use of pre-trained audio features, such as wav2vec, increases the accuracy of the system by up to 10% if only seven examples are available for each keyword during training. At the same time, the beneﬁts of such features become less and disappear as the dataset size increases. Secondly, we prepare and provide for general use a dataset for training and testing voice activation for the Lithuanian language. We also provide training results on this dataset.


Introduction
Voice activation systems solve the task of finding predefined keywords or keyphrases in an audio stream [1]. This task has attracted both researchers and industry for decades. Since the task of formulating an algorithm for determining whether a keyphrase has been uttered in an audio stream is difficult to formulate, it is not surprising that heuristic algorithms and machine learning methods have long been used for the voice activation problem.
The history of voice activation models has gone through several important stages in parallel with solving a more general problem of automatic speech recognition (ASR). We would like to highlight the following important moments: the beginning of the use of hidden Markov models back in 1989 [2], the use of neural networks since 1990 [3][4][5], the use of pattern matching approaches, in particular dynamic time wrapping (DTW) [6], building systems of voice activation for non-English languages such as Chinese [7], Japanese [8], and Iranian [9], publications describing voice activation systems in mass products [10][11][12][13], as well as publishing open datasets to compare different approaches [14].
Voice activation systems find applications in various areas: telephony [15], speech spoofing detection [16,17] crime analysis [18], the assistance systems in emergency situations [19], automated management of airports [20], and, naturally, personal voice assistants, built-in in mobile phones and home devices [11].
One of the problems of current state-of-the-art solutions is the use of very large datasets for training neural network, e.g., the authors of [10] used hundreds of thousands samples per keyword, and the authors of [21] used millions. This presents a question of how to build a high-quality voice activation system in cases when limited training data are available. This can be useful for the following reasons: • customizing a product by using user-defined keywords, e.g., for personal voice assistants, • creating voice activation for low-resource languages such as Lithuanian, Latvian, and others.
In this work, we propose to use unsupervised pre-training for building voice activation systems with limited training data. We use the wav2vec method [22] and show that it can improve the quality of the resulting system if there are less than 20 samples per keyword for several datasets, namely Google Speech Commands [14] and our private Russian dataset, even though the wav2vec model was trained only on English recordings. We verify this statement on a new Lithuanian dataset [23], which we collected and present in this work.

Low-Resource Keyword Spotting
Keyword spotting in a low-resource setting is a difficult task, which attracts many researches. For example, Reference [24] investigated feature choice for DTW. This research was done to support United Nations humanitarian relief efforts in parts of Africa with severely under-resourced languages. The authors compared multilingual bottleneck features of the model, trained on well-resourced, but out-of-domain languages, and a correspondence autoencoder trained in a zero-resource fashion, as well as their combination. They found that this combination improves the quality of the voice activation system compared to the Mel-frequency cepstral coefficients (MFCC), which are widely used in ASR and voice activation [1].
In [25], the authors applied DTW on Gaussian posteriorgrams from a Gaussian mixture model trained in an unsupervised fashion. The authors of [26] proposed to use tandem acoustic models on different languages to obtain good bottleneck features.

Unsupervised Pre-Training for Speech
Unsupervised pre-training is one of the methods to cope with limited resources and generally improve the quality of the resulting neural network [27]. The idea is to use a large corpus with no pre-existing labels to learn patterns in data and then to fine-tune the model on the data with labels.
There are works on how to apply pre-training in voice-related problems. For example, the authors of [28] used per-layer pre-training to improve the quality of deep neural network-based ASR.
A promising way to perform unsupervised pre-training for speech is to learn audio features instead of using classical MFCCs or log-Mel filter banks features. For example, a problem-agnostic speech encoder [29] is a feature extractor trained by jointly optimizing multiple self-supervised objectives. Autoregressive predictive coding [30] and contrastive predictive coding [31] are feature extractors trained with objective of predicting some future frames or information about them by having access to the information about the current and some past frames. The authors of these works tested audio features on problems of speech recognition, speaker identification, phone classification and speech translation.
In our paper, we use the wav2vec model [22]. It is a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. The authors of [22] reported outperforming the best reported character-based system in the literature while using two orders of magnitude less labeled training data on the ASR task.
To the best of our knowledge, our work is the first attempt to use pre-trained audio features like wav2vec for a voice activation problem in a low-resource setup.

Datasets
We used the following three datasets in our experiments: • English dataset-Google Speech Commands [14], • Russian dataset-private dataset, • Lithuanian dataset-collected by us [23].
The Google Speech Commands dataset [14] was released in August 2017 under a Creative Commons license. The dataset contains around 100,000 one second long utterances of 30 short words by thousands of different people, as well as background noise samples such as pink noise, white noise, and human-made sounds. Following the Google implementation [14], our task is to discriminate among 12 classes: "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", unknown, and silence.

Model
We used two types of audio features and two types of neural network architectures in our experiments. We used either log-Mel filter banks or pre-trained audio features from the wav2vec model [22].
The log-Mel filter banks features are often chosen for building voice activation or speech recognition systems [1,32]. We used the kaldi [33] implementation of feature computation with the following parameters: frame width-25 ms, frame shift-10 ms, number of bins-80. Thus, we got a 98 × 80 feature matrix by computing log-Mel filter banks on one second samples from the datasets. The method torchaudio.compliance.kaldi.fbank can be used in PyTorch [34] to reproduce this computation.
In the case of wav2vec audio features, we uses the pre-trained model from https://github.com/ pytorch/fairseq/blob/master/examples/wav2vec/README.md#wav2vec. We got a 98 × 512 feature matrix as an input for the neural network.
We used the following neural network architectures: a three-layer fully-connected neural network and residual neural networks (ResNets) as described in [32].
Our fully-connected neural network consisted of the following blocks: • fully-connected layer of size 128, • rectified linear unit (ReLU) as an activation function [35], • fully-connected layer of size 64, • ReLU, • flattening of a T × 64 matrix in a 64T vector, where T is the number of frames in a sample (98 in all our experiments), • fully-connected layer of size C, where C is the number of classes to discriminate, • softmax layer.
This neural network architecture is presented on Figure 1. The ResNets that we used in our experiments were based on [36] and repeat the solutions found in [32]. The authors of [36] proposed that it may be easier to learn the residual H(x) = F(x) + x instead of the true mapping F(x), since it is empirically difficult to learn the identity mapping for F when the model has an unnecessary depth. In ResNets, residuals are expressed via connections between layers (see Figure 2), where the input of some layer is added to the output of some downstream layer.
All the layers were zero-padded. For some variants, dilated convolutions were applied to increase the receptive field of the model. The parameters of all used ResNet architectures can be seen in Table 1.  [32]. The number of trainable parameters in the architectures used are reported in Table 2. More details about the residual architectures can be found in [32].  [14] and the log-Mel filter banks. The names of the residual networks follow [32]. FC stands for the fully-connected layer. conv stands for the convolutional layer.

Training Procedure
Our experiments followed exactly the same procedure as the TensorFlow reference for the Google Speech Commands dataset [14]. The Speech Commands Dataset was split into training, validation, and test sets, with 80% training, 10% validation, and 10% test. This resulted in roughly 80,000 examples for training and 10,000 each for validation and testing. For the Russian dataset, these numbers were roughly 320,000 and 40,000. For the Lithuanian dataset [23], we had 326 records for training, 75 for validation, and 88 for testing (we skewed the distribution to ensure more stable test results). For consistency across runs, the SHA1-hashed name of the audio file from the dataset determined the split.
To generate the training data, we followed the Google preprocessing procedure by adding background noise to each sample with a probability of 0.7 at every epoch, where the noise was chosen randomly from the background noises provided in the dataset.
Accuracy was our main metric of quality, which is simply measured as the fraction of classification decisions that are correct. For each input utterance, the model outputs the most likely predicted class.
We ran an extensive random hyperparameter search [38] for all experiments in order to reliably compare audio features and architectures. We used stochastic gradient descent with initial learning rate L, momentum 0.9, and mini-batch size BS (see Appendix A for the specific values of the hyperparameters). The validation metrics (cross-entropy loss and accuracy) were computed every S steps of optimization. The minimal validation accuracy was stored. If the new validation accuracy was bigger than the minimal or if the cross-entropy loss obtained a "not a number" value, the weights of the best (by validation accuracy) step were loaded, but the learning rate dropped by a factor of L . The training process stopped when the learning rate drop happened the sixth time. The test accuracy was computed exactly once: on the best model by the validation accuracy at the end of the training process. We report the test accuracy in this work.
We chose − log 10 L from U{0, 3}, log 2 BS from U{4, 7}, log 2 S from U{3, 12}, and L from U{1.1, 10.0}, where U is a uniform distribution (discrete uniform distribution in the case of S and BS).

Results
In this section, we present only the test metrics in order to not clutter the description. For the hyperparameters' choice, see Appendix A.
In order to get baseline metrics, we ran experiments on full datasets with both log-Mel filter banks and wav2vec features. The best results of these runs are presented in Table 3 for the English dataset. We got slightly better results than in [32]. This can be explained by the following reasons: • The Google Speech Commands dataset [14] was extended since its publication, • we used a more extensive hyperparameter search.
We made the following conclusions from the results: • wav2vec audio features give a competitive result for the voice activation problem with very simple downstream models such as the feedforward neural network, • the profit of unsupervised pre-training vanishes as the model gets more sophisticated and deep.
We repeated the same experiments for the Russian dataset and got similar results (Table 4): ff and ResNet8-narrow as the simplest models got better results with wav2vec audio features. However, the overall best result was still with log-Mel filter banks: 97.22%. The best result of wav2vec runs was 96.62%, which was worse, but still very competitive.
It is worth noting that wav2vec model was trained on the Librespeech dataset [39], which contains only English audio books. It is promising that using this model, it was possible to get good accuracy both on the Russian and Lithuanian datasets (see Table 5). Moreover, we got better results on the Lithuanian dataset using wav2vec than using log-Mel filter banks (90.77% vs. 89.23%). Next, we ran experiments with a small amount of training data. In order to do that, we limited the number of training samples per keyword by 3, 5, 7, 10, and 20 for all the datasets. Note that the limit of 20 is effectively the same as using the whole dataset for the Lithuanian language. The size of the validation and test sets remained the same in order to get reliable and comparable results. We used random search with all the models and report the test accuracy of the best runs. The motivation of these experiments goes as follows. First of all, the authors of [22] reported that they got state-of-the-art results in automatic speech recognition with unsupervised pre-training in the case when limited training data were available. Secondly, our first set of experiments showed that wav2vec audio features are superior when the machine learning model is simple. Simpler models tend to perform better when a dataset is small. Therefore, it might be profitable to use unsupervised pre-trained audio features in this scenario. The results of these experiments are summarized in Table 6. It can be seen that the use of pre-trained audio features as wav2vec increases the system accuracy by approximately 10% when up to 10 samples are used per keyword both for the English and Russian language despite the fact that the model was only trained on English audio records. The increase is even bigger if five samples are used. It almost vanishes if up to 20 samples are used.

Collecting the Lithuanian Dataset
In order to boost voice activation research in Lithuanian, we prepared a dataset in the format of Google Speech Commands [14]. This section describes how we carried out the data collection and preparation.
We asked several volunteers to record these words in the specified order on their mobile devices. We did not restrict the speed of pronunciation, but asked to make pauses between words. We collected 28 records ranging from 24 to 64 s. We checked all the records for the correct order of words. The next step was to segment these records into one second samples. We used Audacity v2.2.1 (Audacity R software is c 1999-2020 Audacity Team. The name Audacity R is a registered trademark of Dominic Mazzoni) in the following way: • apply the "Sound Finder" analysis tool with default parameters (Figure 3), • if 20 sound segments were not found, remove all segments, and manually reselect them, • listen to each sound segment; move the start or the end of the segment if necessary, • make sure that segments have numbers from 1 to 20 as labels, • export labels to a separate text file. The resulting text file has the following format: "s i e i label", where s i is the start of the i-th segment in seconds and e i is the end. Using these files, we prepared the dataset with a following algorithm: • skip the current segment if e i − s i > 1, because the word itself is longer than one second, • skip the current segment if s i+1 − e i−1 < 1, because such a one second interval will contain parts of neighboring words, • for each segment, compute the range [A i , B i ] where the start of the one second interval can be chosen (ε = 0.1 in order to make at least a short pause before the start of the word, e 0 = 0; s i+1 for the last segment is equal to the whole utterance duration): • pick S i uniformly from [A i , B i ], and cut the segment [S i , S i + 1] as a separate sample for the i-th word.
For each audio segment between the words with a duration bigger than one second, we uniformly picked exactly one one second sub-segment and used it as a background noise. We got 292 no-speech segments in such a fashion.
The raw recordings, text files with labels, code to perform the cutting, and the dataset itself are available at https://github.com/kolesov93/lt_speech_commands.

Conclusions
In this work, we proposed to use pre-trained audio features for the voice activation systems in the case of limited training data. The experiments on the Google Speech Commands dataset [14] showed that the proposed audio features improve the accuracy of the voice activation system by 10% when the number of samples per keyword is seven or less and by 29% if the number of samples per keyword is five or less both for the English and Russian datasets. It is also worth noting that we only used the wav2vec model pre-trained on English audio recordings. The improvement however vanished when the whole datasets were used, which may indicate the limits of the proposed method. Furthermore, we collected a Lithuanian dataset [23] for voice activation and reproduced our results on it.
In future works, other methods of unsupervised pre-training for voice activation can be investigated and compared to the wav2vec method. Additionally, the influence of domain mismatch between unsupervised pre-training of audio features and the downstream voice activation task can be studied.
Author Contributions: Conceptualization, methodology, software, writing, original draft preparation, visualization: A.K.; writing, review and editing, supervision, project administration, data curation: D.Š. All authors read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In this Appendix, we provide the choice of the hyperparameters for the results presented in Section 3.4. See Table A1 for the results on the Google Speech Commands dataset [14], Table A2 for the results on the Russian dataset, and Table A3 for the results on the Lithuanian dataset [23].