# Language Representation Models: An Overview

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Works

## 3. Targeted Literature Review

## 4. Key Concepts of Neural Language Models

#### 4.1. Word Representations

#### 4.2. Sequence-To-Sequence Models

#### 4.3. Encoder-Decoder Models

- Encoder: Extracts features by reading and converting the input into distributed representations, with one feature vector associated with each word position.
- Context: Either a feature vector or a list of feature vectors based on the extracted features. If it is a list of feature vectors, it has the benefit that each feature vector can be accessed independently of its position in the input.
- Decoder: Sequentially processes the context to generate the final output and solve the task.

#### 4.4. Attention

#### 4.5. Transfer Learning and Pre-Trained Models

#### 4.6. Bidirectional Representations

`<MASK>`token or with a random token or leaves them unchanged. This process is called masking. MLM aims to predict the original vocabulary of the masked token only on the basis of its context. After the prediction, the masked is replaced with the predicted token, except in cases where the masked token was not replaced in the first place. The embedding of the token enables the incorporation of information about the left and the right neighbouring tokens: in other words, both directions of its context. The idea behind not always replacing the tokens selected for masking with

`<MASK>`aims to mitigate the mismatch between pre-training and fine-tuning since the

`<MASK>`token does not appear during fine-tuning. Furthermore, there are other ways to achieve bidirectionality than MLM. XLNet (Generalized Auto-regressive Pre-training for Language Understanding) [27]. It uses permutations of each sentence to take the two contexts into account and uses two-stream self-attention to capture the tokens’ positional information.

`IsNext`) or not (label:

`NotNext`).

#### 4.7. Disentangled Bidirectional Representations

- “a new store opened beside the new mall”
- “a new
`<mask1>`opened beside the new`<mask2>`”

#### 4.8. Token Replacement

`[MASK]`token. The generator then learns to predict the original identities of the masked-out tokens and replaces the masked-out tokens with its predictions. Then, the discriminator is trained to identify tokens that have been replaced by the generator’s tokens. Figure 11 depicts the process of masking, replacing, and discriminating tokens. Ref. [12] showed that this approach works well on relatively smaller computing resources and still outperforms previous models such as BERT.

#### 4.9. Parameter Reduction

#### 4.10. Multi-Task Learning

**SQuAD:**question:`<question>`context:`<context>`**CNN/Daily Mail:**summarize:`<text>`**WMT English to German:**translate English to German:`<text>`**MRPC:**mrpc sentence1:`<sentence1>`sentence2:`<sentence2>`

`T5-base`. It consists of an encoder-decoder model. The encoder and the decoder are both similar in size to

`bert-base`[11], p. 11, resulting in a model twice as large as

`bert-base`because BERT only consists of an encoder. In addition to these novel features, T5 incorporates the techniques of the previously discussed ALBERT [10] (Section 4.9).

#### 4.11. Use Case of Transformer Models

#### 4.12. Further Techniques

## 5. Limitations

## 6. Discussion

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

MLM | Masked Language Modelling |

NLP | Natural Language Processing |

NLU | Natural Language Understanding |

NSP | Next Sentence Prediction |

RNN | Recurrent Neural Networks |

SOP | Sentence-Order Prediction |

## References

- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv
**2018**, arXiv:1804.07461. [Google Scholar] - Jing, K.; Xu, J. A Survey on Neural Network Language Models. arXiv
**2019**, arXiv:1906.03591. [Google Scholar] - Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; Huang, X. Pre-Trained Models for Natural Language Processing: A Survey. Sci. China Technol. Sci.
**2020**, 63, 1872–1897. [Google Scholar] [CrossRef] - Babić, K.; Martinčić-Ipšić, S.; Meštrović, A. Survey of Neural Text Representation Models. Information
**2020**, 11, 511. [Google Scholar] [CrossRef] - Naseem, U.; Razzak, I.; Khan, S.K.; Prasad, M. A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models. Trans. Asian Low-Resour. Lang. Inf. Process.
**2020**, 20, 74:1–74:35. [Google Scholar] [CrossRef] - Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv
**2016**, arXiv:1409.0473. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; pp. 5998–6008. [Google Scholar]
- He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-Enhanced BERT with Disentangled Attention. arXiv
**2020**, arXiv:2006.03654. [Google Scholar] - GLUE Benchmark. 2021. Available online: https://gluebenchmark.com/ (accessed on 2 January 2021).
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv
**2020**, arXiv:1909.11942. [Google Scholar] - Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv
**2020**, arXiv:1910.10683. [Google Scholar] - Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-Training Text Encoders as Discriminators Rather Than Generators. arXiv
**2020**, arXiv:2003.10555. [Google Scholar] - Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv
**2019**, arXiv:1810.04805. [Google Scholar] - Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Stanford, CA, USA, 25–29 October 2014; Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, Red Hook, NY, USA, 5–10 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 2, pp. 3111–3119. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. In Neurocomputing: Foundations of Research; MIT Press 55 Hayward St.: Cambridge, MA, USA, 1986; pp. 533–536. ISBN 978-0-262-01097-9. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; Adaptive Computation and Machine Learning; The MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Olah, C. Understanding LSTM Networks. 2015. Available online: https://research.google/pubs/pub45500/ (accessed on 2 January 2021).
- Cho, K.; van Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv
**2014**, arXiv:1409.1259. [Google Scholar] - Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv
**2014**, arXiv:1409.3215. [Google Scholar] - Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv
**2014**, arXiv:1406.1078. [Google Scholar] - Li, J.; Luong, M.T.; Jurafsky, D. A Hierarchical Neural Autoencoder for Paragraphs and Documents. arXiv
**2015**, arXiv:1506.01057. [Google Scholar] - Alammar, J. The Illustrated Transformer. Available online: https://jalammar.github.io/illustrated-transformer/ (accessed on 2 January 2021).
- Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck, D. Music Transformer. arXiv
**2018**, arXiv:1809.04281. [Google Scholar] - Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-Attention with Relative Position Representations. arXiv
**2018**, arXiv:1803.02155. [Google Scholar] - Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. pre-print. Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 2 January 2021).
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv
**2020**, arXiv:1906.08237. [Google Scholar] - Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv
**2019**, arXiv:1907.11692. [Google Scholar] - He, P.; Liu, X.; Gao, J.; Chen, W. Microsoft DeBERTa Surpasses Human Performance on SuperGLUE Benchmark. 6 January 2021. Available online: https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark (accessed on 2 January 2021).
- You, Y.; Li, J.; Reddi, S.; Hseu, J.; Kumar, S.; Bhojanapalli, S.; Song, X.; Demmel, J.; Keutzer, K.; Hsieh, C.J. Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes. arXiv
**2014**, arXiv:1904.00962. [Google Scholar] - Liu, X.; He, P.; Chen, W.; Gao, J. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July 2019; pp. 4487–4496. [Google Scholar] [CrossRef] [Green Version]
- McCann, B.; Keskar, N.S.; Xiong, C.; Socher, R. The Natural Language Decathlon: Multitask Learning as Question Answering. arXiv
**2018**, arXiv:1806.08730. [Google Scholar] - Liu, Y. Fine-Tune BERT for Extractive Summarization. arXiv
**2019**, arXiv:1903.10318. [Google Scholar] - Schomacker, T.; Tropmann-Frick, M.; Zukunft, O. Application of Transformer-Based Methods to Latin Text Analysis.
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners. Open AI Blog
**2019**, 9, 24. [Google Scholar] - Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. arXiv
**2020**, arXiv:2005.14165. [Google Scholar] - Gao, T.; Fisch, A.; Chen, D. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Stroudsburg, PA, USA, 1–6 August 2021; pp. 3816–3830. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. arXiv
**2019**, arXiv:1910.13461. [Google Scholar] - Ziegler, D.M.; Stiennon, N.; Wu, J.; Brown, T.B.; Radford, A.; Amodei, D.; Christiano, P.; Irving, G. Fine-Tuning Language Models from Human Preferences. arXiv
**2020**, arXiv:1909.08593. [Google Scholar]

**Figure 1.**Selected overview of the milestones in neural language models of the last 5 years. Chronologically ordered.

**Figure 3.**Example comparison of the representations of the same span of characters in a contextualized and in a not contextualized model.

**Figure 4.**Schematic depiction of a Recurrent Neural Network (RNN) (modified from [18]).

**Figure 7.**Visualization of Additive Attention mechanism. Modified from [6]. The emphasized text gives further information about parts of the process.

**Figure 8.**Example of how self-attention influences the representation of the word “its” (modified from [7]).

**Figure 11.**Illustration of the replaced token detection mechanism used in ELECTRA (modified from [12]).

**Figure 12.**Example of inputs and outputs of T5 (Source: [11], (p. 3)).

**Figure 13.**Illustration of the architecture used in [34].

**Table 1.**Comparison of the GLUE Score (* as reported by authors in the papers), the amount of steps and amount of parameters of the baseline version of the key models, which are further discussed in this paper.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Schomacker, T.; Tropmann-Frick, M.
Language Representation Models: An Overview. *Entropy* **2021**, *23*, 1422.
https://doi.org/10.3390/e23111422

**AMA Style**

Schomacker T, Tropmann-Frick M.
Language Representation Models: An Overview. *Entropy*. 2021; 23(11):1422.
https://doi.org/10.3390/e23111422

**Chicago/Turabian Style**

Schomacker, Thorben, and Marina Tropmann-Frick.
2021. "Language Representation Models: An Overview" *Entropy* 23, no. 11: 1422.
https://doi.org/10.3390/e23111422