A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages

Sun, Lixu; Yolwas, Nurmemet; Jiang, Lina

doi:10.3390/app13084836

Open AccessArticle

A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages

by

Lixu Sun

^1,2

,

Nurmemet Yolwas

^1,2,* and

Lina Jiang

^1,2

¹

Xinjiang Multilingual Information Technology Laboratory, Urumqi 830017, China

²

College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(8), 4836; https://doi.org/10.3390/app13084836

Submission received: 30 March 2023 / Revised: 5 April 2023 / Accepted: 6 April 2023 / Published: 12 April 2023

(This article belongs to the Special Issue Machine Learning and Deep Learning in Speech Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Building an effective automatic speech recognition system typically requires a large amount of high-quality labeled data; However, this can be challenging for low-resource languages. Currently, self-supervised contrastive learning has shown promising results in low-resource automatic speech recognition, but there is no discussion on the quality of negative sample sets in speech contrastive learning. In this paper, we propose the false negatives impact elimination (FNIE) method to filter false negative samples and improve the quality of the negative sample set in speech. FNIE compares the support vector with the negative sample vector set and optimizes the corresponding loss function, allowing the model to learn better speech representations and achieve superior results in low-resource speech recognition. Experiments demonstrate that FNIE effectively filters negative samples, enhances the quality of the negative sample set, and improves the accuracy of speech recognition. The quality of the negative sample set significantly affects the model’s learning ability, and using too many negative samples can deteriorate it. In a low-resource setting, our FNIE method achieved a relative improvement of 2.98% in WER on the English dataset, 14.3% in WER on the Uyghur dataset, and 4.04% in CER on the Mandarin dataset compared to the baseline model.

Keywords:

contrastive learning; negative samples; wav2vec 2.0; low-resource

1. Introduction

Automatic speech recognition is designed to realize the transformation from speech sequences to text sequences. In recent years, compared with the architectures of traditional automatic speech recognition [1], the end-to-end frameworks have shown better recognition effects in the field of speech recognition [2,3,4,5]. Unlike traditional automatic speech recognition models, which consist of acoustics, pronunciation, and language modeling, the end-to-end model has a simpler architecture and a more streamlined training process [6]. However, the current end-to-end model requires a large amount of high-quality labeled data to train the model in order to achieve a good recognition result. This poses a very significant challenge for low-resource languages that lack sufficient labeled data to train end-to-end models [7]. Therefore, it is worthwhile to explore methods for utilizing limited labeled speech data to improve the performance of low-resource automatic speech recognition systems.

The current mainstream approach to solve this problem is to use self-supervised learning methods. Self-supervised learning focuses on extracting supervised information from large-scale unlabeled data by using auxiliary tasks, and training the network with this constructed supervised information. This enables the network to learn representations that are useful for downstream tasks [8,9]. Among them, contrastive learning plays a great role in the field of self-supervised learning, which constructs representations by learning to encode the similarity or dissimilarity of two things [10,11,12]. It looks at which pairs of data points are “similar” and “different” in order to learn higher-level features, and has a wide range of applications in many fields, such as computer vision [13], emotion recognition [14], natural language processing [15], and domain adaptation [16]. The quality of the representations learned by contrastive learning is largely influenced by the quality of negative samples. In the field of computer vision, two properties of good computer vision representations are alignment and uniform distribution [17], and there are also many ways to select positive samples [18], filter negative samples [19], increase the number of negative samples [20], and generate harder negative samples [21]. A similar approach is used in natural language processing to find high-quality sets of negative samples [21,22].

Many models using contrastive learning methods have also been proposed in the field of automatic speech recognition. Contrastive Predictive Coding (CPC) [23] uses an autoregressive model and noise-contrastive estimation [24] to discard the lower-level information and noise at the lower levels and extract the higher-dimensional speech representations to predict future information. wav2vec [25] proposes a noise–contrast learning binary classification task using the ideas proposed by CPC, thus allowing wav2vec to train on a large amount of unlabeled data and achieve better feature extraction results in speech recognition tasks. Based on wav2vec, vq-wav2vec [26] realizes the conversion process of feature space from infinite continuous to finite discrete through the proposed quantization module, replaces traditional acoustic features with speech representation generated by combination with BERT, and then trains the speech recognition model in a supervised manner, which improves the latest performance of The Wall Street Journal and TIMIT benchmarks. wav2vec 2.0 [27] uses comparative learning to propose a new self-supervised architecture, combining the Gumbel softmax quantization module [28] and BERT in vq-wav2vec into a model, which can be fine-tuned directly in speech downstream tasks with only one step of pre-training, and shows the feasibility of speech recognition with low resources. JUST [29] builds on wav2vec 2.0 with self-supervised use of contrastive loss and MLM loss and supervised use of RNN-T loss [30] for joint training to achieve higher accuracy in multilingual low-resource situations. wav2vec-S [31] proposes use of the semi-supervised pre-training method of wav2vec 2.0 to build a better low-resource speech recognition pre-training model.

Many relevant contrastive learning models have been proposed in low-resource speech recognition, but the selection of negative samples for speech remains unmentioned. Changes in speech are continuous and minor within a small range. Thus, the approach of defining negative samples as samples from different parts of the same speech tends to ignore the fact that they may also have the same pronunciation as the positive samples. This subtle change is still very important for low-resource speech recognition. Therefore, we propose the false negatives impact elimination (FNIE) method to discover potential false negative samples in speech contrastive learning and optimize the corresponding loss function to eliminate the impact of false negative samples, which improves the quality of the negative sample set and thus allows the model to learn better speech representations. We evaluate the results of our method for English, Mandarin, Uyghur, and Uzbek in the low-resource case and compare the effect of the number of negative samples on the model. The experiments show that our method learns better speech representations and achieves higher accuracy in low-resource speech recognition compared to the original model for the same number of elements in the final negative sample set. The FNIE method can also further improve the accuracy of low-resource speech recognition when increasing the number of negative samples deteriorates the learning effect.

The main contributions of this paper are as follows:

We propose the false negatives impact elimination (FNIE) method and optimize the corresponding loss function to improve the quality of the negative sample set of speech, allowing the model to learn better speech representations and achieve better results in low-resource speech recognition;
We found that simply increasing the number of negative samples does not improve the model’s ability to learn speech representations;
By training our model on a variety of low-resource languages, we have validated its feasibility and effectiveness.

2. Related Work

The research on contrastive learning and low-resource learning is a rapidly growing field with numerous promising approaches and applications [32,33,34,35]. In this section, we provide a brief overview of the research related to speech contrastive learning, which can be divided into three parts: CPC, wav2vec, and wav2vec 2.0.

2.1. Contrastive Predictive Coding

Aaron et al. proposed a general unsupervised comparative learning framework. Its purpose is to extract useful representation information from high-dimensional sequence data, compare the context representation information extracted from the data with the representation information of samples at future time, and obtain the key representation information that best predicts the future, which is called Contrastive Predictive Coding (CPC). The model first cuts the audio sequence into suitable sequence fragments

X = x_{1}, \dots, x_{T}

, and the fragments are mapped into the hidden space Z using a nonlinear encoder

g_{e n c}

with

z_{t} = g_{e n c} (x_{t})

. The autoregressive model

g_{a r}

is then used to fuse the historical temporal information z to obtain the context representation C,

c_{t} = g_{a r} (z_{\leq t})

. The model architecture of Contrastive Predictive Coding using speech as input is shown in Figure 1.

The encoder and autoregressive models are optimized using InfoNCE loss. At each time t there is given a set

X = x_{1}, \dots, x_{N}

of N random samples containing one positive sample from

p (x_{t + k} | c_{t})

and

N - 1

negative samples from the proposal distribution

p (x_{t + k})

, with the following loss function

L_{N} = - \underset{X}{E} [l o g \frac{e x p (z_{t + k}^{T} h_{k} (c_{t}))}{\sum_{x_{j} \in X} e x p (z_{j}^{T} h_{k} (c_{t}))}]

(1)

where

h_{k} (\cdot)

is a linear transformation for every step k.

2.2. wav2vec

wav2vec is an unsupervised pre-training method for speech recognition that learns representations of raw audio. It trains on a large amount of unlabeled data to learn speech representations. The resulting representations are then put into an acoustic model instead of traditional acoustic features such as log-mel filterbank features. wav2vec consists of two main modules: the

e n c o d e r

network and the

c o n t e x t

network. First, give an original voice signal

x_{i} \in X

, using the

e n c o d e r

to generate a low-frequency feature representation

z_{i} \in Z

. Then mix multiple latent representations

z_{i} \dots z_{i - v}

as the

c o n t e x t

inputs produce a single contextualized tensor

c_{i} = g (z_{i} \dots z_{i - v})

. The framework of the model is shown in Figure 2.

The model takes the sample

z_{i + k}

from the next k steps as a positive sample and

\tilde{z}

from Z with a random distribution

p_{n}

as a negative sample. For each step

k = 1, \dots, K

, the following contrastive loss function is proposed:

L_{k} = - \sum_{i = 1}^{T - k} (l o g σ (z_{i + k}^{⊤} h_{k} (c_{i})) + λ \underset{\tilde{z} \sim p_{n}}{E} [l o g σ (- {\tilde{z}}^{⊤} h_{k} (c_{i}))])

(2)

where

σ (x) = 1 / (1 + e x p (- x))

,

h_{k} (\cdot)

is a step-specific affine transformation for every step k, T is the length of the audio sequence, and

λ

is the number of negative samples.

2.3. wav2vec 2.0

The basic framework of wav2vec 2.0 is illustrated in Figure 3. This framework is designed for self-supervised learning of representations from raw audio data. The original speech audio X is encoded by a multilayer convolutional neural network, and then mask the generated late speech representations with the same method as masked language modeling

Z = z_{1}, \dots, z_{t}

. The masked latent speech representations are fed to a Transformer network to build contextualized representations

C = c_{1}, \dots, c_{t}

, and fed to a quantization network to generate quantized representations Q, to calculate contrastive loss and optimize the model.

Its corresponding contrast loss is

L_{m} = - l o g \frac{e x p (s i m (c_{t}, q_{t}) / κ)}{\sum_{\tilde{q} \sim Q_{t}} e x p (s i m (c_{t}, \tilde{q}) / κ)}

(3)

where

s i m (a, b) = a^{T} b / | | a | | | | b | |

is used to calculate the cosine similarity of contextualized representations and quantized representations, and

k a p p a

is a super parameter.

3. Mathods

3.1. Model

The process of our method is shown in Figure 4. The original audio signal X is passed through

e n c o d e r

to extract features and generate latent speech representations

Z = z_{1}, \dots, z_{t}

for t time-steps. Before Z is fed into the

T r a n s f o r m e r

and

Q u a n t i z a t i o n

modules, we randomly select starting time steps for the mask with a proportion of p across all time steps, and then mask the consecutive M time steps for the selected indices. During the masking operation, the latent speech representations Z at each time step can be considered as a starting time step, so the selected mask span may overlap. We put Z into

T r a n s f o r m e r

and

Q u a n t i z a t i o n

by masking operation to produce contextual vectors

C = c_{1}, \dots, c_{i}

and quantized representation

Q = q_{1}, \dots, q_{i}

. The

q_{i}

of the corresponding step is used as a positive sample of the contextual vector

c_{i}

, then K quantized vectors are randomly selected from Q as the negative sample

\bar{Q}

of

c_{i}

. Put the unmasked Z into

T r a n s f o r m e r

with no gradient calculation, then masking the output vectors with the same mask to produce support vectors

S = s_{1}, \dots, s_{i}

.

(S, \bar{Q})

eliminates the false negative samples by the

F N I E

and generates a new set of

K^{'}

negative samples

\bar{Q^{'}}

, which perform the calculation of the contrast loss.

We hypothesize that the main difference between negative and false negative samples lies in the fact that false negative samples have a higher degree of similarity to positive samples, which can lead the model to learn in the wrong direction. To address this issue, we propose the FNIE method, which filters out false negative samples that exhibit significant similarities with positive samples.

3.2. FNIE

Due to the nature of speech, the selection of distractors may have a risk of approaching context representation and being similar to quantized representation. We propose the false negatives impact elimination method and modify the corresponding loss function to alleviate this problem. The FNIE method includes a strategy for detecting false negative samples, and a method for using false negatives to improve contrastive learning. We propose the method as shown in Figure 5.

3.2.1. Finding False Negative Samples

Although the pronunciation of the same phonemic units may be affected by different factors, the change can be minimized within a certain time window due to the characteristics of speech. Based on this idea, we propose a strategy for identifying potential false negative samples as follows:

For each time step t, the original audio feature $z_{t}$ is transformed by a Transformer function. $g : Z - > C^{a}$ to produce the support vector $c_{t}^{a}$ .
For each negative sample ${\bar{q}}_{i, k} \in \bar{Q}$ and each support vector $s_{i}$ , calculate the similarity score, $s c o r e_{i, k} = s i m (s_{i}, {\bar{q}}_{i, k})$ .
For each negative sample and the corresponding support vector, calculate the similarity and sort the scores in descending order: $s c o r e_d e c_{i, k} = s o r t_{s_{i} \in S} (s c o r e_{i, k})$ .
Define a set of potential false negative samples $F_{i, N}$ , where the negative samples are the N most similar to the support vector at that step, $F_{i, N} = {{\bar{q}}_{i, k} | t o p (s c o r e_d e c_{i, k}, N)}$ , and $\bar{Q^{'}} = \bar{Q_{i}} - F_{i, N}$ .

where

s i m (a, b) = a^{T} b / | | a | | | | b | |

, and the Transformer must stop gradient calculation in this step. As shown in Figure 6, we provide a sample of finding parts for the FNIE.

3.2.2. Delete

With FNIE, we find the false negative samples that exist in speech contrastive learning. To cooperate with FNIE, we propose the following loss function accordingly.

L_{m}^{d e l} = - log \frac{exp (s i m (c_{i}, q_{i}) / τ)}{\sum_{\tilde{q} ϵ \bar{Q^{'}}} exp (s i m (c_{i}, q_{i}) / τ)}

(4)

We believe that the delete method is very stable when the final number of negative samples is the same, and it can effectively mitigate the interference from the same utterance with the same pronunciation from different parts. By this method, the model can exclude the effect of false negative samples, which enables the model to learn better phonetic representations and achieve better results in the subsequent downstream tasks.

3.2.3. Assimilate

Eliminating false negative samples can indeed mitigate the negative impact of contrast, but this process still has shortcomings in ignoring information similar to false negative samples and positive samples. False negative samples may come from the same phonetic unit corresponding to different words, so we believe that false negative samples that have been identified can be treated as positive samples. However, the number of identical sounds in a sentence is always limited, so they must have different weights for positive sample processing even if the samples are in the same set

F_{i, N}

. We propose the following loss functions:

\begin{matrix} L_{m}^{a s s} = & - \{γ (log \frac{exp (s i m (c_{i}, q_{i}) / τ)}{\sum_{\tilde{q} ϵ \bar{Q^{'}}} exp (s i m (c_{i}, q_{i}) / τ)}) \\ + α (log \frac{exp (s i m (c_{i}, q_{i}) / τ)}{\sum_{\tilde{q} ϵ F_{i, 1}} exp (s i m (c_{i}, q_{i}) / τ)}) \\ + ε (log \frac{exp (s i m (c_{i}, q_{i}) / τ)}{\sum_{\tilde{q} ϵ F_{i, 2}} exp (s i m (c_{i}, q_{i}) / τ)})\} \end{matrix}

(5)

where if

N > 1

, then

γ + α + ε = 1

and if

N = 1

, then

γ + α = 1, ε = 0

. Compared with the simple deletion of false negative samples, the assimilation method has higher requirements for the quality of selected false negative samples, and the assimilation of excessive false negative samples will seriously degrade the learning quality of the pre-trained model. The similarity is different in assimilation, and we believe its impact on loss function is different. In addition to the positive samples, the most similar false negative samples should occupy a higher proportion, and the false negative samples in the remaining set of false negative samples should occupy a relatively small proportion.

4. Experimental Methods

4.1. Corpus

In this paper, we used speech datasets of four languages to complete our experiments, namely, Mandarin (MA), English (EN), Uyghur (Uy), and Uzbek (Uz). The dataset details are shown in the following Table 1.

The English speech dataset used in this study is the Librispeech dataset [36], which is an audiobook dataset containing both text and speech data. It consists of approximately 1000 h of 16 kHz speech data, which are segmented, trimmed, aligned, and organized into audio files of approximately 10 s duration along with their corresponding text annotations. For pre-training, we use the train-clean-100 subset of the dataset, which includes around 100 h of speech data and corresponding text annotations, out of which 20 h are randomly selected for fine-tuning. The test-clean subset, consisting of 5.4 h of speech data and corresponding text annotations, is used as the test dataset.

The Mandarin speech datasets used were AISHELL-1 [37] and THCHS-30 [38]. The AISHELL-1-train subset was used as the model pre-trained unlabeled speech data, which contains approximately 150 h of 16 kHz speech data, 340 speakers, and 120,098 utterances. The THCHS-30 train was used as the fine-tuning dataset, and THCHS-30-test was used as the test dataset.

For Uyghur, we used the validated part of Common Voice Corpus 11.0-Uyghur [39], a subset containing about 107 h of speech and corresponding text data, of which 20 h were randomly selected as a fine-tuning dataset and 5 h as a test dataset.

The validated part of Common Voice Corpus 11.0-Uzbek [39] was used for Uzbek. This subset contains about 97 h of speech and corresponding text data, of which 20 h were randomly selected as a fine-tuning dataset and 5 h as a test dataset.

4.2. Implementation Details

All experiments were conducted using fairseq [40] and were conducted on 2 NVIDIARTX A5000 Graphics Cards. For the pre-training stage, we used the open-source wav2vec 2.0 base model. The CNN layer in our model comprises of 7 hidden layers, where each layer includes a temporal convolution, layer normalization, and a GELU activation function. The temporal convolution of each block has 512 channels, the width of the convolution kernel is (10, 3, 3, 3, 3, 2, 2, 2), the stride size is (5, 2, 2, 2, 2), the stride length is about 20 ms, and the receptive field is 25 ms. The self-attention layer consists of a 12-layer, 768-dimensional Transformer layer with 8 self-attention heads. The encoder layer dropout is 0.5, and dropout input and dropout features are 0.1. For the mask operation, the value of p is 0.065, and M is 10. The quantization module has two codebooks with 320 entries in each, and each entry has a dimension of 128. For English, Uyghur, and Uzbek, we used gradient accumulation to implement a batch size of 11.2 m samples, and for Mandarin, we used gradient accumulation to implement a batch size of 22.4 m samples, with the peak learning rate 0.0001 and warm-up steps 32,000. The pre-training is 100 epochs, batch size is 8, and the other parameters were basically the same as the base model configuration in [27]. For the loss function (3), we set

α = 0.1

, and

ϵ = 0.05

. The network structure of the model is shown in Figure 7.

In the fine-tuning stage, Mandarin fine-tuned for 100 epochs, Uyghur fine-tuned for 200 epochs, Uzbek fine-tuned for 400 epochs, and English fine-tuned for 30 k steps. Using gradient accumulation to implement a batch size of 7.68 m samples for English, Mandarin, and Uyghur, using gradient accumulation to implement a batch size of 5.76 m samples for Uzbek, with the peak learning rate 0.00005, 8 batch size, and warm-up steps 3000, and the other parameters were basically the same as the base model configuration in the fine-tune experiments of [27]. The same language uses the same language model, where EN language model is 4-g from LibriSpeech-train-clean-960h text made by kenLM [41], MA language model is 4-g from The People’s Daily text made by kenLM, and UG and UZ are 3-g from respective text made by kenLM.

5. Experimental Results

In this paper, we use wav2vec 2.0 framework as a baseline system for pre-training and self-supervised comparative learning pre-training for different languages in the proposed method. By training different numbers of

N, K, K^{'}

, it can be demonstrated that our proposed FNIE method can learn speech representations more effectively and is equally applied in multiple languages.

5.1. Results of Deleting False Negative Samples

Our experimental results are shown in Table 2. We found that deleting a small number of false negative samples had a positive effect on the model’s optimization and improved the accuracy of the model in low-resource speech recognition. However, deleting too many false negative samples had a negative effect on the model’s optimization. In the English dataset, deleting two false negative samples resulted in a relative improvement of 2.69% compared to the baseline model. In Mandarin, deleting four false negative samples resulted in a 3.94% improvement relative to the original model. However, if too many false negative samples are deleted, it can lead to the deletion of difficult negative samples, which can negatively affect the model’s ability to learn speech representation by reducing its distance from the embedding space for negative samples. We believe that this occurs because negative samples with similar but slightly different pronunciations from

c_{i}

are removed, reducing the quality of the negative sample set and affecting the model’s learning of difficult negative samples. This, in turn, leads to a slight reduction in the model’s ability to learn speech representations. If the speech data contain more but shorter segments, then the number of false negative samples that need to be deleted should be increased accordingly.

5.2. Results of Assimilating False Negative Samples

Compared to the deletion strategy, the assimilation strategy is more sensitive to the quality of the false negative samples. As shown in Table 3, in the English dataset, when assimilating one false negative sample, there is a relative improvement of 2.98% over the original model, and in the Mandarin dataset, a relative improvement of 4.04% is still obtained, with the final number of negative samples remaining the same. However, after assimilation of two false negative samples, both the English and Mandarin models have a negative effect on the model. That is, the assimilation strategy is only suitable for more reliable false negative samples (i.e., those with very high similarity scores), and excessive false negative samples will interfere with the model, making it difficult for the model to capture the characteristics of positive samples. This is consistent with our original assumption that the negative sample with the highest similarity will improve the model’s learning ability and have a larger weight, while other samples assimilated as positive samples will have a smaller weight. Compared to the method of deleting false negative samples, the method of assimilating false negative samples can achieve higher accuracy in longer speech segments. However, this method requires higher quality for the false negative samples.

5.3. Discussion on the Number of Negative Samples

We compared the effects of using different numbers of negative samples on our model and present the results in Table 4. From the table, it is evident that simply increasing the number of negative samples does not necessarily improve the model’s performance. In fact, it may worsen the model’s ability to learn representation and reduce the accuracy of speech recognition. Therefore, we suggest that improving the quality of the set of negative samples is the right approach to enhance the model’s representation learning ability and improve the speech recognition effect. Increasing the number of negative samples alone does not necessarily improve the model’s representation learning ability. This also shows that it is possible to improve low-resource language speech recognition effects using our method.

5.4. Results of Different Languages

To validate the effectiveness and reliability of the proposed method, we conducted experiments on Uyghur and Uzbek languages, and the results are shown in Table 5. On Uyghur and Uzbek languages, our proposed method achieved relatively better experimental results. Because of the short duration of the audio data for the two languages mentioned above, for the selection of negative samples, the original model is more likely to select negative samples with similar pronunciation for

c_{i}

. This results in a relatively low quality of the negative sample set, but using the FNIE method can significantly improve the quality of the negative sample set. We believe that the assimilation of false negative samples is more effective in improving the model’s ability to learn speech representations for longer speech clips, and the deletion of false negative samples is more effective for shorter speech clips. Additionally, this experiment demonstrates that our proposed method is widely applicable and has good effectiveness and reliability, and also verifies the feasibility of our proposed method.

6. Conclusions

In this work, we propose a method to improve the quality of the negative sample set of speech and optimize the corresponding loss function to boost the model’s ability to learn speech representations. Our proposed method leads to improved speech recognition accuracy in low-resource scenarios. Additionally, we demonstrate that simply increasing the number of negative samples does not necessarily enhance the model’s performance. To improve the model’s representation learning ability and speech recognition accuracy, it is crucial to improve the quality of the negative sample set. Our proposed method, FNIE, is feasible for multiple languages.

Our work investigates the possibility of enhancing low-resource speech recognition by improving the quality of negative sample sets. However, our work also has certain limitations, such as the potential risk of removing true negative samples while eliminating false negative samples. In future work, we will continue to explore ways to have better quality in the set of negative samples of speech in contrastive learning and aim to improve the accuracy of automatic speech recognition in low-resource languages.

Author Contributions

Conceptualization, N.Y.; Software, L.S.; Validation, L.S.; Formal analysis, L.S.; Investigation, L.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China—Research on Key Technologies of Speech Recognition of Chinese and Western Asian Languages under Resource Constraints (Grant No. 62066043), and the National Language Commission key Project—Research on Speech Keyword Search Technology of Chinese and Western Asian Languages (Grant No. ZDI135-133).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011; IEEE Signal Processing Society: Piscataway, NJ, USA, 2011. [Google Scholar]
Li, J.; Ye, G.; Das, A.; Zhao, R.; Gong, Y. Advancing acoustic-to-word CTC model. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5794–5798. [Google Scholar]
Karita, S.; Chen, N.; Hayashi, T.; Hori, T.; Inaguma, H.; Jiang, Z.; Someki, M.; Soplin, N.E.Y.; Yamamoto, R.; Wang, X.; et al. A comparative study on transformer vs rnn in speech applications. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 449–456. [Google Scholar]
Nakatani, T. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration. In Proceedings of the Proceedings Interspeech, Graz, Austria, 15–19 September 2019; Volume 2019. [Google Scholar]
Dong, L.; Xu, S.; Xu, B. Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5884–5888. [Google Scholar]
Kim, C.; Gowda, D.; Lee, D.; Kim, J.; Kumar, A.; Kim, S.; Garg, A.; Han, C. A review of on-device fully neural end-to-end automatic speech recognition algorithms. In Proceedings of the 2020 54th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 1–5 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 277–283. [Google Scholar]
David, M.; Simons, G.F.; Fennig, C.D. Ethnologue: Languages of the World, 26th ed.; SIL International: Dallas, TX, USA, 2023; Available online: http://www.ethnologue.com (accessed on 16 December 2022).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: London, UK, 2020; pp. 1597–1607. [Google Scholar]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 2020, 33, 9912–9924. [Google Scholar]
Song, X.; Huang, L.; Xue, H.; Hu, S. Supervised prototypical contrastive learning for emotion recognition in conversation. arXiv 2022, arXiv:2210.08713. [Google Scholar]
Zhang, R.; Ji, Y.; Zhang, Y.; Passonneau, R.J. Contrastive Data and Learning for Natural Language Processing. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts, Online, 10–15 July 2022; pp. 39–47. [Google Scholar]
Thota, M.; Leontidis, G. Contrastive domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2209–2218. [Google Scholar]
Wang, T.; Isola, P. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; PMLR: London, UK, 2020; pp. 9929–9939. [Google Scholar]
Hoffmann, D.T.; Behrmann, N.; Gall, J.; Brox, T.; Noroozi, M. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 897–905. [Google Scholar]
Huynh, T.; Kornblith, S.; Walter, M.R.; Maire, M.; Khademi, M. Boosting contrastive self-supervised learning with false negative cancellation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2785–2795. [Google Scholar]
Kalantidis, Y.; Sariyildiz, M.B.; Pion, N.; Weinzaepfel, P.; Larlus, D. Hard negative mixing for contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21798–21809. [Google Scholar]
Zhang, Y.; Zhang, R.; Mensah, S.; Liu, X.; Mao, Y. Unsupervised sentence representation via contrastive learning with mixing negatives. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 11730–11738. [Google Scholar]
Cao, R.; Wang, Y.; Liang, Y.; Gao, L.; Zheng, J.; Ren, J.; Wang, Z. Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embeddin. arXiv 2022, arXiv:2202.13093. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010; pp. 297–304. [Google Scholar]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
Baevski, A.; Schneider, S.; Auli, M. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv 2019, arXiv:1910.05453. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with gumbel-softmax. arXiv 2016, arXiv:1611.01144. [Google Scholar]
Bai, J.; Li, B.; Zhang, Y.; Bapna, A.; Siddhartha, N.; Sim, K.C.; Sainath, T.N. Joint unsupervised and supervised training for multilingual asr. In Proceedings of the ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 6402–6406. [Google Scholar]
Rao, K.; Sak, H.; Prabhavalkar, R. Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 193–199. [Google Scholar]
Zhu, H.; Wang, L.; Wang, J.; Cheng, G.; Zhang, P.; Yan, Y. Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR. arXiv 2021, arXiv:2110.04484. [Google Scholar]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Li, M.; Huang, P.Y.; Chang, X.; Hu, J.; Yang, Y.; Hauptmann, A. Video pivoting unsupervised multi-modal machine translation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3918–3932. [Google Scholar] [CrossRef] [PubMed]
Afham, M.; Rodrigo, R. Visual-Semantic Contrastive Alignment for Few-Shot Image Classification. arXiv 2022, arXiv:2210.11000. [Google Scholar]
Yan, C.; Chang, X.; Li, Z.; Guan, W.; Ge, Z.; Zhu, L.; Zheng, Q. Zeronas: Differentiable generative adversarial networks search for zero-shot learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9733–9740. [Google Scholar] [CrossRef] [PubMed]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5206–5210. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 1–3 November 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–5. [Google Scholar]
Wang, D.; Zhang, X. Thchs-30: A free chinese speech corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. arXiv 2019, arXiv:1904.01038. [Google Scholar]
Heafield, K. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, Edinburgh, UK, 30–31 July 2011; pp. 187–197. [Google Scholar]

Figure 1. Overview of Contrastive Predictive Coding, audio as input.

Figure 2. Overview of wav2vec: wav2vec uses two layers of CNNs stacked on top of each other to extract audio features from the original audio input X and predict the next-step information.

Figure 3. The figure illustrates the wav2vec 2.0 framework, which aims to learn both contextualized speech representations and an inventory of discretized speech units in a joint manner.

Figure 4. Illustration of the optimized model framework. Support Vectors and Negatives calculate the similarity and take the top-k similarity as false negative samples, which are deleted or absorbed as positive samples.

Figure 5. Overview of the proposed method. On the left is the original sample selection, in the middle is the discovery of false negative samples, and on the right are the two strategies of FNIE.

Figure 6. A sample of finding parts for the FNIE. The Transformer Encoder generates support vectors, and quantization generates negatives. The set of negative samples used for the final loss value calculation is generated by FNIE.

Figure 7. Illustration of the overall network structure of our experimental model.

Table 1. Corpus Information.

Datasets	Language	Durations	Format	Sampling Rate	Speakers	Units
LibriSpeech-train-clean-100 h	English	100 h	FLAC	16 kHZ	251	subword
AISHELL-1	Mandarin	150 h	WAV	16 kHZ	340	character
THCHS-30	Mandarin	33.5 h	WAV	16 kHZ	40	character
Common Voice Corpus 11.0-Uyghur	Uyghur	113 h	MP3	32 kHZ	747	subword
Common Voice Corpus 11.0-Uzbek	Uzbek	258 h	MP3	32 kHZ	2025	subword

Table 2. Comparison of deleting different numbers of false negative samples.

Method	K	$K^{'}$	N	Mandarin
Method	K	$K^{'}$	N	CER%
wav2vec 2.0	100	100	0	27.853
FNIE	101	100	1	27.791
	102	100	2	27.335
	104	100	4	26.755
Method	$K$	$K^{'}$	$N$	English
Method	$K$	$K^{'}$	$N$	CER%	WER%
wav2vec 2.0	100	100	0	11.753	23.998
FNIE	101	100	1	12.115	24.062
	102	100	2	11.805	23.353
	104	100	4	12.18	24.254

Table 3. Comparison of assimilating different numbers of false negative samples.

Method	K	$K^{'}$	N	Mandarin
Method	K	$K^{'}$	N	CER%
wav2vec 2.0	100	100	0	27.853
FNIE	101	100	1	26.725
	102	100	2	27.9
Method	$K$	$K^{'}$	$N$	English
Method	$K$	$K^{'}$	$N$	CER%	WER%
wav2vec 2.0	100	100	0	11.753	23.998
FNIE	101	100	1	11.694	23.315
	102	100	2	11.862	24.023

Table 4. Comparison of different numbers of negative samples.

Method	Loss	K	$K^{'}$	N	English
Method	Loss	K	$K^{'}$	N	CER%	WER%
FNIE	ass	101	100	1	11.694	23.315
	del	102	100	2	11.805	23.353
wav2vec 2.0	-	100	100	0	11.753	23.998
	-	101	101	0	11.294	24.587
	-	102	102	0	11.328	24.909
	-	104	104	0	11.517	25.145

Table 5. Comparison of different languages.

Method	Loss	K	$K^{'}$	N	Uyghur
Method	Loss	K	$K^{'}$	N	WER%
wav2vec 2.0	-	100	100	0	5.76
FNIE	ass	101	100	1	5.423
	del	102	100	2	4.936
Method	Loss	$K$	$K^{'}$	$N$	Uzbek
Method	Loss	$K$	$K^{'}$	$N$	WER%
wav2vec 2.0	-	100	100	0	16.789
FNIE	ass	101	100	1	12.019
	del	102	100	2	10.902

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, L.; Yolwas, N.; Jiang, L. A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages. Appl. Sci. 2023, 13, 4836. https://doi.org/10.3390/app13084836

AMA Style

Sun L, Yolwas N, Jiang L. A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages. Applied Sciences. 2023; 13(8):4836. https://doi.org/10.3390/app13084836

Chicago/Turabian Style

Sun, Lixu, Nurmemet Yolwas, and Lina Jiang. 2023. "A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages" Applied Sciences 13, no. 8: 4836. https://doi.org/10.3390/app13084836

APA Style

Sun, L., Yolwas, N., & Jiang, L. (2023). A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages. Applied Sciences, 13(8), 4836. https://doi.org/10.3390/app13084836

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method Improves Speech Recognition with Contrastive Learning in Low-Resource Languages

Abstract

1. Introduction

2. Related Work

2.1. Contrastive Predictive Coding

2.2. wav2vec

2.3. wav2vec 2.0

3. Mathods

3.1. Model

3.2. FNIE

3.2.1. Finding False Negative Samples

3.2.2. Delete

3.2.3. Assimilate

4. Experimental Methods

4.1. Corpus

4.2. Implementation Details

5. Experimental Results

5.1. Results of Deleting False Negative Samples

5.2. Results of Assimilating False Negative Samples

5.3. Discussion on the Number of Negative Samples

5.4. Results of Different Languages

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI