A Lite Romanian BERT: ALR-BERT
Abstract
:1. Introduction
2. Related Works
3. ALR-BERT: Training the Romanian Language Model Using ALBERT
3.1. CORPUS
- OPUS—OPUS is a collection of translated texts from the Web [15]. It is an open source parallel corpus that was compiled without human intervention. It includes a wide range of text types, including medical prescriptions, legal papers, and movie subtitles. The OPUS corpus comprises about 4 GB of Romanian text in total.
- OSCAR—OSCAR, or Open Super-large Crawled ALMAnaCH corpora, is a massive multilingual corpus derived from the Common Crawl corpus using language categorization and filtering [16]. There are approximately 11 GB of text in the Romanian portion. It contains scrambled sentences that have been de-duplicated.
- Wikipedia—The Romanian Wikipedia is publicly available for download. We used the Wikipedia dump from February 2020 which included approximately 0.4 GB of content after cleanup.
3.2. ALR-BERT
3.3. Pre-Training
4. Evaluations
4.1. Simple Universal Dependencies
4.2. Results
Ablation Studies
- The ALBERT model states that factorized embeddings perform well where cross-layer parameters were not shared as well as the case where they were shared. Larger embedding sizes provide greater performance in the absence of sharing. At an embedding size of 128 dimensions, speed improvements are satisfied with sharing.
- For cross-layer parameter sharing, the ALBERT model shows various cross-layer parameter sharing analysis: (a) no cross-layer sharing was seen; (b) cross-layer sharing was observed for the feedforward segments only; (c) performing sharing for the attention segments; and (d) performing sharing for all subsegments. It turns out that sharing the parameters for the attention segments is the most effective [4], but sharing the parameters for the feed-forward segments has very little effect. This exemplifies the importance of the attention process in transformer models. However, because all-segment sharing greatly reduces the number of parameters while providing only slightly inferior performance than attention-only sharing, the authors chose to adopt all-segment sharing instead.
- If model uses the NSP technique on an SOP task, then the performance is low. Of course, NSP on NSP performs well, as does SOP on SOP. However, when SOP is run on NSP, it works really well. This implies that SOP catches sentence coherence that NSP may not, and therefore SOP produces a better outcome than NSP.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the NIPS 2014, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL 2019, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 2020, arXiv:1909.11942. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP; Association for Computational Linguistics: Brussels, Belgium, November 2018. [Google Scholar]
- Williams, A.; Nangia, N.; Bowman, S.R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference; NAACL: New Orleans, LA, USA, 2018. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; Association for Computational Linguistics: Austin, TX, USA, 2016; pp. 2383–2392. [Google Scholar]
- Rajpurkar, P.; Jia, R.; Liang, P. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Melbourne, Australia, July 2018; pp. 784–789. [Google Scholar]
- Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT—-NAACL, Edmonton, AL, Canada, 31 May–1 June 2003; pp. 142–147. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; Hu, G. Pre-Training with Whole Word Masking for Chinese BERT. arXiv 2019, arXiv:1906.08101. [Google Scholar] [CrossRef]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Dumitrescu, S.; Avram, A.M.; Pyysalo, S. The birth of Romanian BERT. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 5–10 July 2020; pp. 4324–4328. [Google Scholar]
- Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 21–27 May 2012; European Language Resources Association (ELRA): Luxembourg, 2012; pp. 2214–2218. [Google Scholar]
- Suarez, P.J.O.; Sagot, B.; Romary, L. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures. In Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019, Cardiff, UK, 22 July 2019; Leibniz-Institut für Deutsche Sprache: Mannheim, Germany, 2019; pp. 9–16. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Mititelu, V.B.; Ionn, R.; Simionescu, R.; Irimia, E.; Perez, C.A. The romanian treebank annotated according to universal dependencies. In Proceedings of the Tenth International Conference on Natural Language Processing (HRTAL’16), Dubrovnik, Croatia, 29 September–1 October 2016. [Google Scholar]
- Zeman, D.; Hajic, J.; Popel, M.; Potthast, M.; Straka, M.; Ginter, F.; Nivre, J.; Petrov, S. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In Proceedings of the CoNLL 2018, Brussels, Belgium, 31 October–1 November 2019. [Google Scholar]
Corpus | Lines | Words | Size |
---|---|---|---|
OPUS | 55.1 M | 635.0 M | 3.8 GB |
OSCAR | 33.6 M | 1725.8 M | 11 GB |
Wikipedia | 1.5 M | 60.5 M | 0.4 GB |
Total | 90.2 M | 2421.3 M | 15.2 GB |
Model | Tokenized Sentence |
---|---|
M-BERT (uncased) | cinci bici ##cl ##isti au pl ##eca ##t din cr ##ai ##ova spre so ##par ##lita . |
M-BERT (cased) | Ci ##nci bi ##ci ##cl ##i ##ti au pl ##eca ##t din C ##rai ##ova spre ##op ##r ##li ##a . |
Romanian BERT (uncased) | cinci bicicliti au plecat din craiova spre opr ##lia . |
Romanian BERT (cased) | Cinci bicicliti au plecat din Craiova spre o ##p ##r ##lia . |
Model | UPOS | XPOS | MLAS | AllTags |
---|---|---|---|---|
M-BERT (cased) | 93.87 | 89.89 | 90.01 | 87.04 |
Romanian BERT (cased) | 95.56 | 95.35 | 92.78 | 93.22 |
ALR-BERT (cased) | 87.38 | 84.05 | 79.82 | 78.82 |
Model | UPOS | XPOS | MLAS | AllTags |
---|---|---|---|---|
M-BERT (cased) | 97.95 | 96.12 | 96.61 | 95.69 |
Romanian BERT (cased) | 98.24 | 96.96 | 97.08 | 96.60 |
ALR-BERT (cased) | 95.03 | 89.92 | 91.96 | 88.75 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nicolae, D.C.; Yadav, R.K.; Tufiş, D. A Lite Romanian BERT: ALR-BERT. Computers 2022, 11, 57. https://doi.org/10.3390/computers11040057
Nicolae DC, Yadav RK, Tufiş D. A Lite Romanian BERT: ALR-BERT. Computers. 2022; 11(4):57. https://doi.org/10.3390/computers11040057
Chicago/Turabian StyleNicolae, Dragoş Constantin, Rohan Kumar Yadav, and Dan Tufiş. 2022. "A Lite Romanian BERT: ALR-BERT" Computers 11, no. 4: 57. https://doi.org/10.3390/computers11040057
APA StyleNicolae, D. C., Yadav, R. K., & Tufiş, D. (2022). A Lite Romanian BERT: ALR-BERT. Computers, 11(4), 57. https://doi.org/10.3390/computers11040057