# Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Review of Related Work

#### 2.1. Feature Extraction Based on CNN

#### 2.2. End-to-End Speech Recognition Based on LSTM-CTC

## 3. Proposed Method

#### 3.1. Cross-Language Data Pre-processing and Feature Extraction

#### 3.1.1. Tujia Language Corpus

#### 3.1.2. Extended Speech Corpus

#### 3.1.3. Feature Extraction

#### 3.2. End-to-End Speech Recognition Model Based on Transfer Learning

Algorithm 1 Training End-to-end Speech Recognition Model Based on Transfer Learning |

Input: ${D}_{t}=\left\{{x}_{n},{y}_{m}\right\},{D}_{s}=\left\{{x}_{{n}^{\prime}},{y}_{{m}^{\prime}}\right\}$, training set ${x}_{n},{x}_{{n}^{\prime}}$, input features ${y}_{m},{y}_{{m}^{\prime}}$, output labels $\mu $, learning rate Output: ${\theta}_{1}$, CDS parameters ${\theta}_{2}$, TLDS parameters Initialize CDS parameters ${\theta}_{1}$ While model does not converge ${\theta}_{1}$ doRead a SortaGrad batch $S={\left\{{x}_{i},{y}_{j}\right\}}_{b=1}^{B}$ from ${D}_{t},{D}_{s}$ Train model using $S$, ${\theta}_{1}\leftarrow {\theta}_{1}-\frac{\mu}{B}{\displaystyle {\sum}_{b=1}^{B}{L}_{CTC}}({x}_{i},{y}_{j},{\theta}_{1})$ Calculate $\left\{{w}_{t},{w}_{s},{w}_{c}\right\}$ of the softmax layer using equation (5) Calculate ${L}_{CTC}$ using Equation (3) for $S$ End while Initialize TLDS parameters ${\theta}_{2}$ While model does not converge ${\theta}_{2}$ do Read a SortaGrad batch ${S}^{\prime}={\left\{{x}_{n},{y}_{m}\right\}}_{{b}^{\prime}=1}^{{B}^{\prime}}$ from ${D}_{t}$ Train model using ${S}^{\prime}$, ${\theta}_{2}\leftarrow {\theta}_{2}-\frac{\mu}{{B}^{\prime}}{\displaystyle {\sum}_{{b}^{\prime}=1}^{{B}^{\prime}}{L}_{CTC}}({x}_{n},{y}_{m},{\theta}_{2})$ Calculate $\left\{{w}_{t}^{\prime},{w}_{c}^{\prime}\right\}$ of the softmax layer using Equation (5) Calculate ${L}_{CTC}^{\prime}$ using Equation (3) for ${S}^{\prime}$ End while |

## 4. Experiments and Result

#### 4.1. Experimental Environment

#### 4.2. Parameters of the Models

#### 4.3. Experimental Results

## 5. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Xu, S. The course and prospect of endangered language studies in China. J. Northwest Univ. Natl. Philos. Soc. Sci.
**2015**, 83–90. [Google Scholar] [CrossRef] - Rosenberg, A.; Audhkhasi, K.; Sethy, A.; Ramabhadran, B.; Picheny, M. End-to-end speech recognition and keyword search on low-resource languages. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 5280–5284. [Google Scholar]
- Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep Speech: Scaling up end-to-end speech recognition. arXiv, 2014; arXiv:1412.5567. [Google Scholar]
- Zhang, Y.; Chan, W.; Jaitly, N. Very Deep Convolutional Networks for End-to-End Speech Recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4845–4849. [Google Scholar]
- Lu, L.; Kong, L.; Dyer, C.; Smith, N.A. Multitask Learning with CTC and Segmental CRF for Speech Recognition. In Proceedings of the INTERSPEECH 2017, Stockholm, Sweden, 20–24 August 2017; pp. 954–958. [Google Scholar]
- Ochiai, T.; Watanabe, S.; Hori, T.; Hershey, J.R. Multichannel End-to-end Speech Recognition. arXiv, 2017; arXiv:1703.04783. [Google Scholar]
- Parcollet, T.; Zhang, Y.; Morchid, M.; Trabelsi, C.; Linarès, G.; De Mori, R.; Bengio, Y. Quaternion Convolutional Neural Networks for End-to-End Automatic Speech Recognition. arXiv, 2018; arXiv:1806.07789. [Google Scholar]
- Ghoshal, A.; Swietojanski, P.; Renals, S. Multilingual training of deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7319–7323. [Google Scholar]
- Heigold, G.; Vanhoucke, V.; Senior, A.; Nguyen, P.; Ranzato, M.A.; Devin, M.; Dean, J. Multilingual acoustic models using distributed deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8619–8623. [Google Scholar]
- Evgeniou, T.; Pontil, M. Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 22–25 August 2004; pp. 109–117. [Google Scholar]
- Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng.
**2010**, 22, 1345–1359. [Google Scholar] [CrossRef] [Green Version] - Dalmia, S.; Sanabria, R.; Metze, F.; Black, A.W. Sequence-based Multi-lingual Low Resource Speech Recognition. arXiv, 2018; arXiv:1802.07420. [Google Scholar]
- Kim, S.; Hori, T.; Watanabe, S. Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4835–4839. [Google Scholar]
- Chen, W.; Hasegawa-Johnson, M.; Chen, N.F. Topic and Keyword Identification for Low-resourced Speech Using Cross-Language Transfer Learning. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 2047–2051. [Google Scholar]
- Latif, S.; Rana, R.; Younis, S.; Qadir, J.; Epps, J. Transfer Learning for Improving Speech Emotion Classification Accuracy. arXiv, 2018; arXiv:1801.06353. [Google Scholar]
- Deng, L. An Overview of Deep-Structured Learning for Information Processing. In Proceedings of the Asian-Pacific Signal and Information Processing-Annual Summit and Conference (APSIPA-ASC), Xi’an, China, 19–22 October 2011. [Google Scholar]
- Bengio, Y. Learning Deep Architectures for AI. Found. Trends Mach. Learn.
**2009**, 2, 1–127. [Google Scholar] [CrossRef] [Green Version] - Mohamed, A.R.; Dahl, G.; Hinton, G. Deep Belief Networks for phone recognition. In Nips Workshop on Deep Learning for Speech Recognition and Related Applications; MIT Press: Whister, BC, Canada, 2009; Volume 39. [Google Scholar]
- Lin, H.; Ou, Z. Partial-tied-mixture Auxiliary Chain Models for Speech Recognition Based on Dynamic Bayesian Networks. In Proceedings of the IEEE International Conference on Systems, Taipei, Taiwan, 8–11 October 2006. [Google Scholar]
- Pundak, G.; Sainath, T.N. Lower Frame Rate Neural Network Acoustic Models. In Proceedings of the INTERSPEECH, San Francisc, CA, USA, 8–12 September 2016; pp. 22–26. [Google Scholar]
- WÖLfel, M.; Mcdonough, J. Speech Feature Extraction. In Distant Speech Recognition; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2009. [Google Scholar]
- Lee, J.H.; Jung, H.Y.; Lee, T.W. Speech feature extraction using independent component analysis. In Proceedings of the IEEE International Conference on Acoustics, Istanbul, Turkey, 5–9 June 2000. [Google Scholar]
- Li, H.; Xu, X.; Wu, G.; Ding, C.; Zhao, X. Research on speech emotion feature extraction based on MFCC. J. Electron. Meas. Instrum.
**2017**. (In Chinese) [Google Scholar] [CrossRef] - Yu, D.; Seltzer, M. Improved Bottleneck Features Using Pretrained Deep Neural Networks. In Proceedings of the Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 237–240. [Google Scholar]
- Maimaitiaili, T.; Dai, L. Deep Neural Network based Uyghur Large Vocabulary Continuous Speech Recognition. J. Data Acquis. Process.
**2015**, 365–371. [Google Scholar] [CrossRef] - Liu, X.; Wang, N.; Guo, W. Keyword Spotting Based on Deep Neural Networks Bottleneck Feature. J. Chin. Comput. Syst.
**2015**, 36, 1540–1544. [Google Scholar] - Ozeki, M.; Okatani, T. Understanding Convolutional Neural Networks in Terms of Category-Level Attributes. In Computer Vision—ACCV 2014; Springer International Publishing: Cham, Switzerland, 2014; pp. 362–375. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and Understanding Convolutional Networks. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2013; Volume 8689, pp. 818–833. [Google Scholar]
- Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Penn, G. Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Kyoto, Japan, 25–30 March 2012. [Google Scholar]
- Abdelhamid O Deng, L.; Yu, D. Exploring Convolutional Neural Network Structures and Optimization Techniques for Speech Recognition. In Proceedings of the INTERSPEECH, Lyon, France, 25–29 August 2013; pp. 3366–3370. [Google Scholar]
- Sercu, T.; Puhrsch, C.; Kingsbury, B.; LeCun, Y. Very deep multilingual convolutional neural networks for LVCSR. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 20–25 March 2016; pp. 4955–4959. [Google Scholar]
- Sercu, T.; Goel, V. Advances in Very Deep Convolutional Neural Networks for LVCSR. arXiv, 2016; arXiv:1604.01792. [Google Scholar]
- Graves, A.; Mohamed, A.R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 March 2013; pp. 6645–6649. [Google Scholar]
- Graves, A. Supervised Sequence Labelling with Recurrent Neural Networks; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv, 2014; arXiv:1412.3555. [Google Scholar]
- Graves, A.; Gomez, F. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Miao, Y.; Gowayyed, M.; Metze, F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, AZ, USA, 13–17 December 2016; pp. 167–174. [Google Scholar]
- Zeghidour, N.; Usunier, N.; Synnaeve, G.; Collobert, R.; Dupoux, E. End-to-End Speech Recognition from the Raw Waveform. arXiv, 2018; arXiv:1806.07098. [Google Scholar]
- Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Soplin, N.E.Y.; Heymann, J.; Wiesner, M.; Chen, N.; et al. ESPnet: End-to-End Speech Processing Toolkit. arXiv, 2018; arXiv:1804.00015. [Google Scholar]
- Xu, S. Grammatical and semantic representation of spatial concepts in the Tujia language. J. Minor. Lang. China
**2013**, 1, 35–45. [Google Scholar] - Xu, S. Features of Change in the Structure of Endangered Languages: A Case Study of the South Tujia Language. J. Yunnan Natl. Univ. (Soc. Sci.)
**2012**, 29. (In Chinese) [Google Scholar] [CrossRef] - Wang, D.; Zhang, X. THCHS-30: A Free Chinese Speech Corpus. arXiv, 2015; arXiv:1512.01882. [Google Scholar]
- Wu, C.; Wang, B. Extracting Topics Based on Word2Vec and Improved Jaccard Similarity Coefficient. In Proceedings of the IEEE Second International Conference on Data Science in Cyberspace, Shenzhen, China, 26–29 June 2017. [Google Scholar]
- Inc, G. Convolutional, long short-term memory, fully connected deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Brisbane, Australia, 19–24 April 2015. [Google Scholar]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv, 2015; arXiv:1512.02595. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]

**Figure 4.**Cross-language end-to-end speech recognition model based on transfer learning for the Tujia language.

**Figure 5.**The change in the cost of BiRNN, BiGRU, and BiLSTM’s as the number of training passes increases.

Label Type | Label Content |
---|---|

Broad IPA | lai^{55} xuã^{55} lã^{55} ti^{21} xua^{21}, mɨe^{35} su^{21} le^{53} |

Chinese one-to-one translation | 今天早晨（的话）天亮（过） |

Chinese translation | 今天早上天亮以后 |

Label Type | Label Content |
---|---|

Chinese Character | 菜 做好 了 一碗 清蒸 武昌鱼 一碗 蕃茄 炒鸡蛋 一碗 榨菜 干 子 炒肉丝 |

Chinese Pinyin | cai4 zuo4 hao3 le5 yi4 wan3 qing1 zheng1 wu3 chang1 yu2 yi4 wan3 fan1 qie2 chao3 ji1 dan4 yi4 wan3 zha4 cai4 gan1 zi3 chao3 rou4 si1 |

Narrow IPA | tsʰai(51) tsuo(51) xɑu(214) lɤ i(51) uan(214) tɕʰiŋ(55) tʂəŋ(55) u(214) tʂʰɑŋ(55) y(35) i(51) uan(214) fan(55) tɕʰiɛ(35) tʂʰɑu(214) tɕi(55) tan(51) i(51) uan(214) tʂa(51) tsʰai(51) kan(55) tsɿ(214) tʂʰɑu(214) rou(51) sɿ(55) |

Broad IPA | tsʰai(51) tsuo(51) xau(214) lɤ i(51) uan(214) tɕʰiŋ(55) tʂəŋ(55) u(214) tʂʰaŋ(55) y(35) i(51) uan(214) fan(55) tɕʰiɛ(35) tʂʰau(214) tɕi(55) tan(51) i(51) uan(214) tʂa(51) tsʰai(51) kan(55) tsi(214) tʂʰau(214) rou(51) si(55) |

Narrow IPA | Broad IPA |
---|---|

ɑ | a |

iou | iu |

uei | ui |

iɛn | ian |

ɿ | i |

ʅ | i |

Convolutional Layer | First Convolutional Layer | Second Convolutional Layer | |
---|---|---|---|

Parameters | |||

Filter size | 11 × 41 | 11 × 21 | |

Number of input channels | 1 | 1 | |

Number of output channels | 32 | 32 | |

Stride size | 3 × 2 | 1 × 2 | |

Padding size | 5 × 20 | 5 × 10 |

**Table 5.**BiLSTM (Bi-directional long short-term memory)-CTC (connectionist temporal classification) parameters.

Network | Parameter Type | Parameter Content |
---|---|---|

BiLSTM | minibatch size | 16 |

LSTM layers | 3 | |

cells of per layers | 512 | |

learning rate | 0.001 | |

CTC | Decoder | beam search |

RNN Cell | Dev | Test |
---|---|---|

BiRNN (Bi-directional recurrent neural network) | 42.37% | 53.37% |

BiGRU (Bi-directional long short-term memory) | 37.09% | 51.95% |

BiLSTM (Bi-directional gated recurrent unit) | 35.82% | 48.30% |

Model | Dev | Test |
---|---|---|

IDS (Improved Deep Speech 2) | 35.82% | 48.30% |

CDS (Cross-language Deep Speech 2) | 40.11% | 50.26% |

TLDS (Transfer Learning Deep Speech 2) | 31.11% | 46.19% |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Yu, C.; Chen, Y.; Li, Y.; Kang, M.; Xu, S.; Liu, X.
Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language. *Symmetry* **2019**, *11*, 179.
https://doi.org/10.3390/sym11020179

**AMA Style**

Yu C, Chen Y, Li Y, Kang M, Xu S, Liu X.
Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language. *Symmetry*. 2019; 11(2):179.
https://doi.org/10.3390/sym11020179

**Chicago/Turabian Style**

Yu, Chongchong, Yunbing Chen, Yueqiao Li, Meng Kang, Shixuan Xu, and Xueer Liu.
2019. "Cross-Language End-to-End Speech Recognition Research Based on Transfer Learning for the Low-Resource Tujia Language" *Symmetry* 11, no. 2: 179.
https://doi.org/10.3390/sym11020179