End-to-End Transformer-Based Models in Textual-Based NLP
Abstract
:1. Introduction
2. Background
2.1. Recurrent Neural Network (RNN)
2.2. The Encoder–Decoder Framework
2.3. The Transformer
2.3.1. Attentions Mechanism
2.3.2. Positional Encoding (PE)
2.3.3. Position-Wise Feed-Forward Networks
2.3.4. Full Transformer Encoder–Decoder
2.4. The Pre-Train and Fine-Tune Framework
Algorithm 1: Pre-training and fine-tuning steps for a TB model |
2.4.1. Pre-Training Procedure
- Masked Language Modeling (MLM): Through the use of unique mask tokens, MLM masks the sentence. The SoftMax layer receives the masked token vectors from MLM and computes the probability distribution over the vocabulary before applying the cross-entropy loss. MLM makes use of tokens from both settings. The original tokens are predicted by MLM using masked token vectors. MLM is approached as a task for classifying tokens at the token level over the masked tokens. It has two drawbacks. Since only 15% of the tokens in each training sample are masked token vectors, masked token prediction only uses a small portion of the training signal. Additionally, there is a pause between the pre-training and fine-tuning stages because the model only sees the unique mask token during pre-training.
- Causal Language Modeling (CLM): Alternatively known as unidirectional LM, CLM predicts the next word based on the context. Both a left-to-right and a right-to-left sequence can be handled by CLM. All of the words on the left are included in the context of a CLM that reads from left to right, whereas all of the words on the right are included in a CLM that reads from right to left. Each training sample used in CLM contains 100% tokens for learning. Because it cannot be used in both contexts, CLM has this major drawback. A bidirectional context cannot be used to train standard CLM because doing so would allow a token to see itself, which would make prediction easy.
- Translation Language Modeling (TLM): Alternatively known as XMLM and also referred to as cross-lingual MLM, in TLM, random masking is applied to tokens from both sentences. TLM aids the model’s cross-linguistic mapping learning because the prediction of masked tokens requires context from both sentences. Only 15% of the tokens in each training sample is used by TLM.
- Replaced Token Detection (RTD): It indicates which tokens were replaced. Using the tokens produced by the generator model trained using MLM objectives, RTD corrupts the sentence. RTD is a token-level binary classification task that asks the model to decide whether or not each token has been replaced. For learning purposes, RTD uses 100% of the tokens in each training sample. The only problem with RTD is that it needs a separate generator to tamper with the input sequence, which is computationally expensive to train. Despite this, RTD is sample efficient.
- Shuffled Token Detection (STD): Identification of the shuffled tokens is a task that requires token-level discrimination. The words are randomly shuffled in STD with a 0.15 probability. For learning purposes, STD uses 100% of the tokens in each training sample.
- Random Token Substitution (RTS): It involves identifying the randomly substituted tokens. Here, 15% of the tokens in RTS are at random replaced with different tokens from the vocabulary. Since RTS is sample-efficient, it can corrupt the input sequence without the need for a separate generator model. In each training sample used in RTS for learning, tokens are used 100% of the time.
- Swapped Language Modeling (SLM): With a 0.15 probability, it tampers with the sequence by adding tokens that are randomly selected from the vocabulary. Since only 15% of the input tokens are used, SLM is not sample efficient.
- lternate Language Modeling (ALM): The task of cross-linguistic language model training is a pre-training task. Predicting the masked tokens in the code-switched sentences produced from parallel sentences is the goal of ALM. The settings for masking the tokens in ALM are the same as those in MLM. The model learns relationships between languages much more effectively by receiving pre-training on sentences that have been code-switched.
- Sentence Boundary Objective (SBO): Predicting the masked tokens using the span boundary tokens and position embeddings is part of the pre-training task. When performing tasks that require span-based extraction, such as entity extraction, coreference resolution, and question answering, SBO improves the model’s performance. SBO only conceals token spans that are consecutive. Tokens representing span boundaries and position embeddings are used in the prediction of masked tokens.
- Next Sentence Prediction (NSP): It is a pre-training task at the sentence level that aids the model in understanding relationships between sentences. Finding consecutive sentences is a part of a binary sentence pair classification task. For training purposes, the sentence pairs are produced so that 50% of the instances are consecutive and the other 50% are not. The topic and coherence predictions involved in NSP enable the model to learn sentence-level semantics. In order to promote learning, NSP only uses 50% of the tokens in each training sample.
- Sentence Order Prediction (SOP): It is a pre-training task at the sentence level that only considers sentence coherence. The training instances are generated using NSP in such a way that 50% of them are switched out while the other 50% are not. SOP only uses 50% of the tokens in each training sample to facilitate learning.
- Sequence-to-Sequence LM (SSLM): Both the left-side words in the predicted target sequence and every word in the input masked sequence are included in the context. The encoder inputs the masked sequence, and the decoder predicts the masked words sequentially from left to right. SSLM is sample inefficient because it only reconstructs the masked tokens. Only 15% of tokens are used by SSLM to facilitate learning in each training sample.
- Denoising Auto Encoder (DAE): It is an auto-encoder model that learns to forecast the original, uncorrupted data point as its output, after being fed a corrupted data point as input. By reassembling the original text from corrupted text, it aids the model’s learning process. Models based on encoder–decoders can be trained using DAE. By offering more training signals for model learning, DAE is more sample effective. Due to the fact that DAE involves reconstructing the entire original text, it offers a stronger training signal. Each training sample used by DAE contains 100% tokens for learning.
- Segment Level Recurrence (SLR): When the model processes the following new segment, the representations calculated for the prior segment are fixed and cached for later use as an extended context. By allowing contextual information to cross segment boundaries, this additional connection increases the largest possible dependency length by N times, where N is the depth of the network. The context fragmentation problem is also solved by this recurrence mechanism, giving tokens at the beginning of a new segment the context they require.
- Gap Sentences Generation (GSG): Abstractive summarization is accomplished using this pre-training objective. S2S models extract gap sentences and use them for pre-training by creating sentences that are disconnected from the rest of the text and concealing entire sentences. Selecting specific sentences with apparent significance performs better than randomly selecting sentences for the generation.
2.4.2. Fine-Tuning Procedure
3. TB Models Applications and Architectures
3.1. Architecture-Based Models (ABM)
3.1.1. Auto-Encoding Transformers (AET)
3.1.2. Auto-Regressive Transformers (ART)
3.1.3. S2S Transformers (S2S)
3.2. Applications-Based Models (AppBM)
3.2.1. Language-Based Models (LBM)
Multilingual
Monolingual
3.2.2. Domain-Based Models (DBM)
Social Media
Programming
Health
Scientific
Legal
Finance
Cybersecurity
3.2.3. Task-Based Models (TBM)
Benchmarks
- RACE [298] is a large-scale reading comprehension dataset with more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students. The dataset can be used for training and test sets for machine comprehension.
- The GLUE [84] (General Language Understanding Evaluation) benchmark is a collection of nine natural language understanding tasks, including single-sentence tasks CoLA and SST-2, similarity and paraphrasing tasks MRPC, STS-B and QQP, and natural language inference tasks MNLI, QNLI, RTE, and WNLI 2.1.
- SQuAD [85] (Stanford Question Answering Dataset) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.
Evaluation Metrics
- Accuracy [309]: Simply put, accuracy reflects how frequently the classifier predicts correctly. Accuracy is determined by dividing the total number of forecasts by the proportion of true predictions.
- Precision [310]: Precision reveals how many of the labels that were predicted with accuracy ended up being positive. When false positives are more problematic than false negatives, precision is helpful.
- Recall (Sensitivity) [310]: Recall describes how many of the actual positive cases our model was able to properly anticipate. When false negative is more important than false positive, it is a valuable metric. In medical situations, it is crucial because even if a false alarm is raised, the real positive cases should not go unnoticed. The proportion of true positives to all other positive results is known as a recall for a label.
- F1 Score [310]: It provides a synthesis of the precision and recall measurements. it reaches its optimum when precision and recall are equal. The harmonic mean of recall and precision is the F1 Score.
- AUC-ROC [311]: A probability curve called the Receiver Operator Characteristic (ROC) separates the “signal” from the “noise” by plotting the TPR (True Positive Rate) versus the FPR (False Positive Rate) at different threshold values. A classifier’s capacity to distinguish between classes is measured by the area under the curve (AUC).
- Confusion Matrix: A performance indicator for ML classification issues when the output can be two or more classes is the confusion matrix. It is a table with combinations of values that were expected and actual.
- BLEU [312]: In NLP tasks, the metric known as BLEU is frequently used to assess how generated texts from an NLG model differ from reference texts. Its value was in the neighborhood of 0.0 and 1.0. The value of BLEU is 1.0 if two sentences perfectly match one another. The BLEU value is 0.0 when there is absolutely no overlap between them. In particular, BLEU counts the number of n-gram matches between the reference text and the generated text.
- ROUGE [313]: Recall-oriented understanding for sting evaluation is known by its acronym, ROUGE. It is an automated summary evaluation method that uses a set of indicators to gauge the effectiveness of the texts that are generated automatically. In order to determine how closely the automatically generated texts resemble a set of reference texts, the system compares the generated texts with the reference texts, counts the overlapping basic units (n-grams), and calculates the corresponding score. The denominator is the only distinction between and as shown in Table 6. The total number of n-grams in the texts generated is the denominator of BLEU-n, which focuses on precision. The total number of n-grams found in the reference texts serves as the denominator because ROUGE-n is recall-oriented. Better recall-oriented quality is indicated by a higher value.
- Perplexity (PPL) [314]: The advantages and disadvantages of a linguistic probability model can be evaluated using a word-level technique PPL. A Language Probability Model (LPM) is a probability distribution on a given text, i.e., the likelihood of the th word given the n preceding words in the text. When a language probability model is trained on the training set of reference texts for the TG task, it is applied to predict the generated text. The generated text is more fluid when the PPL value is lower.
- Distinct-n [315]: An n-gram-based statistic called distinct-n is used in various scenes that aim to increase the diversity of generated texts. The diversity rises as the value increases, where the total number of n-grams in the generated texts serves as the denominator and the numerator is the number of n-grams that appear just once in the generated texts. The value of Distinct-n is 1 when the total number of n-grams and the count of unique n-grams are equal.
- TER [322]: It counts how many edits, including insertions, deletions, shifts, and substitutions, were necessary to convert the MT output into the reference.
- CHRF [323]: Instead of using the word n-grams, it compares the MT output with the reference using the character n-grams. This facilitates matching word morphological variants.
- YISI-1 [324]: Using contextual word embeddings such as BERT determines the semantic similarity of phrases in the MT output with the reference.
- YISI-2 [325]: It is identical to YISI-1, with the exception that it calculates the degree to which the MT output and the source are comparable using cross-lingual embeddings.
- Enhanced Sequential Inference Model (ESIM) [326]: It is a trained neural model that first determines sentence representations from BERT embeddings and then determines how similar the two strings are to one another.
- Word Error Rate (WER MWER) [327]: This is the accepted measurement for automatic speech recognition evaluation. WER is calculated by dividing the length of the reference translation by the Levenshtein distance [328] between the words of the system output and the words of the reference translation. In order to determine the best alignment between the MT output and the reference translation, the Levenshtein distance is calculated using dynamic programming, with each word in the MT output aligning to either 1 or 0 words in the reference translation and vice versa.
- NIST [329]: As a better alternative to BLEU, the NIST precision measure was developed. Ngram occurrences should be given more weight based on their significance, and unwanted consequences of the shortness penalty of BLEU should be minimized. The frequency of the n-gram in the references is used to calculate significance. Regarding BLEU, numerous reference translations are pooled, although NIST believes that infrequent n-grams are more significant than those that occur frequently. The brevity penalty is intended to counteract BLEU’s small favoritism of brief candidates.
- Metric for Evaluation of Translation with Explicit ORdering (METEOR) [330]: It is a test that was created expressly to address a number of BLEU’s recognized flaws. BLEU is typically a precision-oriented measure, whereas METEOR is recall-oriented. Unlike BLEU, which only computes precision, METEOR computes both recall and precision before combining the two to compute the harmonic mean, heavily favoring recall.
- Position-Independent Error Rate (PER) [331]: It is an effort to overcome the WER’s word-ordering restriction by treating the reference and hypothesis as bags of words. This allows words from the hypothesis to be aligned to terms in the reference regardless of position. As a result, it is guaranteed that the PER of an MT output will be lower than or equal to the WER of the MT output. The drawback of this type is that it cannot tell a correct translation from one where the words have been jumbled.
NLP Downstream Tasks
4. Discussion
- Attention is limited to handling fixed-length text sequences. Before being entered as input into the system, the text must be divided into a predetermined number of segments or chunks. This may lead to the loss of context as a result of text chunking.
- The inability of TB models to process extended sequences is mainly caused by the computational and memory cost of the self-attention module.
4.1. Architecture Level
4.1.1. TB Model Configurations Evaluation
- RQ1: Does the model size have an impact on the model’s performance?
- RQ2: Does the number of blocks and attention layers have an effect on the model’s performance?
4.1.2. Tokenization Evaluation
- RQ1: What is the effect of preprocessing techniques on TB models?
- RQ2: What differentiates the tokenization of the TB models from the traditional methods?
- First, TB models’ success is due to a pre-training task that was self-supervised and promoted general language comprehension without considering the particular requirements of ranking tasks. There is a continuum between the initial self-supervised training task and the final interaction ranker. Isolating ranking-aware pre-training tasks may result in gains in both effectiveness and efficiency, especially when there is a dearth of data on the target task. By ranking tasks, we mean the downstream tasks that require the model to iterate during fine-tuning and learn by ranking patterns in each iteration until achieving the best results. In other terms, the ranking task is the reduction of the loss function used in each pre-training objective.
- Second, TB models combine a lengthy sequence with numerous layers, but it is not clear what value this rich semantics adds in terms of ranking. For example: the deep layers of BERT resulted in some, albeit modest, performance improvements, but it is still unclear how exactly the model learns to accurately understand the patterns. Some refer to the MLM pre-training step. The masking method is what differentiate BERT from the original Transformer encoder block, but even the masking mechanism is not well explained in term of what patterns are masked and how the model learns these patterns.
- Third, TB models create distinct sequence representations by employing a different form of tokenization than feature extraction techniques such as stemming and lemmatizing. For example, BERT tokenizer’s constrained vocabulary affects a large number of long-tail tokens, resulting in significant efficiency gains at the expense of a negligible drop in effectiveness.
4.1.3. Improving the Transformer Architecture
- RQ1: How to improve the TB model’s time, training speed, memory, complexity, and efficiency?
- RQ2: What techniques were proposed to improve the Transformer architecture?
4.2. Application Level
4.2.1. Benchmarks Evaluation
- RQ1: Are the currently used benchmarks to evaluate the TB models efficient?
- RQ2: Pre-raining vs. fine-tuning datasets for TB models?
4.2.2. Downstream Tasks Evaluation
- RQ1: What is the difference between fine-tuning pre-trained TB models and using feature-based approaches?
- RQ2: What techniques can be used to improve the fine-tuning phase?
4.2.3. Multilinguality Evaluation
- RQ1: How efficient are the cross-lingual TB models?
- RQ2: Is using a monolingual model better than relying on a multilingual one?
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
AE | Auto-Encoding |
AET | Auto-Encoding Transformers |
AI | Artificial Intelligence |
ALM | Alternate Language Modeling |
AR | Auto-Regression |
ART | Auto-Regression Transformers |
bBPE | Byte Level Byte Pair Encoding |
BERT | Bidirectional Encoder Representations from Transformers |
BPE | Byte Pair Encoding |
CLM | Causal Language Modeling |
CNN | Convolutional Neural Networks |
DAE | Denoising Auto Encoder |
DBM | Domain-Based Models |
DL | Deep Learning |
DNN | Deep Neural Networks |
DS | Document Summarization |
GELU | Gaussian Error Linear Unit |
GRU | Gated Recurrent Networks |
GSG | Gap Sentences Generation |
K | Key |
LBM | Language-Based Models |
LM | Language Model |
LSTM | Long Short-Term Memory |
MAE | Mean Absolute Error |
ML | Machine Learning |
MLM | Masked Language Model |
MSE | Mean Square Error |
MT | Machine Translation |
NER | Named Entity Recognition |
NLP | Natural Language Processing |
NSP | Next Sentence Prediction |
PE | Positional Encoding |
Q | Query |
QA | Question Answering |
ReLU | Rectified Linear Unit |
RNN | Recurrent Neural Networks |
RTD | Replaced Token Detection |
RTS | Random Token Substitution |
SA | Sentiment Analysis |
SBO | Sentence Boundary Objective |
SLM | Swapped Language Modeling |
SLR | Segment Level Recurrence |
SOP | Sentence Order Prediction |
SSLM | Sequence-to-Sequence LM |
STD | Shuffled Token Detection |
TB | Transformer-Based |
TBM | Task Based Models |
TC | Text Classification |
TG | Text Generation |
TLM | Translation Language Modeling |
V | Value |
WSO | Word Structural Objective |
References
- Mitkov, R. The Oxford Handbook of Computational Linguistics; Oxford University Press: Oxford, UK, 2022. [Google Scholar]
- Wilie, B.; Vincentio, K.; Winata, G.I.; Cahyawijaya, S.; Li, X.; Lim, Z.Y.; Soleman, S.; Mahendra, R.; Fung, P.; Bahar, S.; et al. Indonlu: Benchmark and resources for evaluating indonesian natural language understanding. arXiv 2020, arXiv:2009.05387. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Dumitrescu, S.D.; Avram, A.M.; Pyysalo, S. The birth of Romanian BERT. arXiv 2020, arXiv:2009.08712. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [Green Version]
- Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.H.; Jin, D.; Naumann, T.; McDermott, M. Publicly available clinical BERT embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar]
- Peng, Y.; Yan, S.; Lu, Z. Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar]
- Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. (HEALTH) 2021, 3, 1–23. [Google Scholar] [CrossRef]
- Yang, Y.; Uy, M.C.S.; Huang, A. FinBERT: A Pretrained Language Model for Financial Communications. arXiv 2020, arXiv:2006.08097. [Google Scholar]
- Gururangan, S.; Marasovic, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
- Caselli, T.; Basile, V.; Mitrovic, J.; Granitzer, M. Hatebert: Retraining bert for abusive language detection in english. arXiv 2010, arXiv:2010.12472. [Google Scholar]
- Zhou, J.; Tian, J.; Wang, R.; Wu, Y.; Xiao, W.; He, L. Sentix: A sentiment-aware pre-trained model for cross-domain sentiment analysis. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; pp. 568–579. [Google Scholar]
- Muller, M.; Salathe, M.; Kummervold, P.E. Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv 2020, arXiv:2005.07503. [Google Scholar]
- Barbieri, F.; Camacho-Collados, J.; Neves, L.; Espinosa-Anke, L. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. arXiv 2020, arXiv:2010.12421. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning (Adaptive Computation and Machine Learning Series); The MIT Press Cambridge: Cambridge, MA, USA, 2016; Volume 19, pp. 226, 305–307. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef] [Green Version]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
- Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. arXiv 2015, arXiv:1506.07503. [Google Scholar]
- Firat, O.; Cho, K.; Bengio, Y. Multi-way, multilingual neural machine translation with a shared attention mechanism. arXiv 2016, arXiv:1601.01073. [Google Scholar]
- Choi, H.; Cho, K.; Bengio, Y. Fine-grained attention mechanism for neural machine translation. Neurocomputing 2018, 284, 171–176. [Google Scholar] [CrossRef] [Green Version]
- Kumar, V.; Choudhary, A.; Cho, E. Data augmentation using pre-trained transformer models. arXiv 2020, arXiv:2003.02245. [Google Scholar]
- Shao, T.; Guo, Y.; Chen, H.; Hao, Z. Transformer-Based Neural Network for Answer Selection in Question Answering. IEEE Access 2019, 7, 26146–26156. [Google Scholar] [CrossRef]
- Kowsher, M.; Sobuj, M.S.I.; Shahriar, M.F.; Prottasha, N.J.; Arefin, M.S.; Dhar, P.K.; Koshiba, T. An Enhanced Neural Word Embedding Model for Transfer Learning. Appl. Sci. 2022, 12, 2848. [Google Scholar] [CrossRef]
- Bensoltane, R.; Zaki, T. Towards Arabic aspect-based sentiment analysis: A transfer learning-based approach. Soc. Netw. Anal. Min. 2022, 12, 7. [Google Scholar] [CrossRef]
- Prottasha, N.J.; Sami, A.A.; Kowsher, M.; Murad, S.A.; Bairagi, A.K.; Masud, M.; Baz, M. Transfer Learning for Sentiment Analysis Using BERT Based Supervised Fine-Tuning. Sensors 2022, 22, 4157. [Google Scholar] [CrossRef] [PubMed]
- Sasikala, S.; Ramesh, S.; Gomathi, S.; Balambigai, S.; Anbumani, V. Transfer learning based recurrent neural network algorithm for linguistic analysis. Concurr. Comput. Pract. Exp. 2022, 34, e6708. [Google Scholar] [CrossRef]
- Taneja, K.; Vashishtha, J. Comparison of Transfer Learning and Traditional Machine Learning Approach for Text Classification. In Proceedings of the IEEE 2022 9th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 23–25 March 2022; pp. 195–200. [Google Scholar]
- Qiao, Y.; Zhu, X.; Gong, H. BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 2022, 38, 648–654. [Google Scholar] [CrossRef]
- Qasim, R.; Bangyal, W.H.; Alqarni, M.A.; Ali Almazroi, A. A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. 2022, 2022, 3498123. [Google Scholar] [CrossRef]
- Li, N.; Liu, S.; Liu, Y.; Zhao, S.; Liu, M. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 6706–6713. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. arXiv 2020, arXiv:2009.06732. [Google Scholar] [CrossRef]
- Lin, T.; Wang, Y.; Liu, X.; Qiu, X. A survey of transformers. arXiv 2021, arXiv:2106.04554. [Google Scholar] [CrossRef]
- Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv 2021, arXiv:2108.05542. [Google Scholar]
- Gillioz, A.; Casas, J.; Mugellini, E.; Khaled, O.A. Overview of the Transformer-based Models for NLP Tasks. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; pp. 179–183. [Google Scholar] [CrossRef]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; Online, OpenAI. 2018. Available online: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 2 October 2022).
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 5753–5763. [Google Scholar]
- Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2019, arXiv:1904.09223. [Google Scholar]
- Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Tian, H.; Wu, H.; Wang, H. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding. arXiv 2020, arXiv:1907.12412. [Google Scholar] [CrossRef]
- Wang, Z.; Ma, Y.; Liu, Z.; Tang, J. R-transformer: Recurrent neural network enhanced transformer. arXiv 2019, arXiv:1907.05572. [Google Scholar]
- Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing transformers for reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual Event, 13–18 July 2020; pp. 7487–7498. [Google Scholar]
- Lakew, S.M.; Cettolo, M.; Federico, M. A comparison of transformer and recurrent neural networks on multilingual neural machine translation. arXiv 2018, arXiv:1806.06957. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Chung, J.; Gulçehre, Ç.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Dai, Z.; Yang, Z.; Yang, Y.; Carbonell, J.; Le, Q.V.; Salakhutdinov, R. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv 2019, arXiv:1901.02860. [Google Scholar]
- Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Lample, G.; Conneau, A. Cross-lingual language model pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar]
- Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Kudo, T.; Richardson, J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv 2018, arXiv:1808.06226. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Gao, J.; Zhou, M.; Hon, H.W. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 13063–13075. [Google Scholar]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. Electra: Pre-training text encoders as discriminators rather than generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Panda, S.; Agrawal, A.; Ha, J.; Bloch, B. Shuffled-token detection for refining pre-trained roberta. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Online, 6–11 June 2021; pp. 88–93. [Google Scholar]
- Di Liello, L.; Gabburo, M.; Moschitti, A. Efficient pre-training objectives for transformers. arXiv 2021, arXiv:2104.09694. [Google Scholar]
- Chi, Z.; Dong, L.; Wei, F.; Wang, W.; Mao, X.L.; Huang, H. Cross-lingual natural language generation via pre-training. Artif. Intell. 2020, 34, 7570–7577. [Google Scholar] [CrossRef]
- Yang, J.; Ma, S.; Zhang, D.; Wu, S.; Li, Z.; Zhou, M. Alternating language modeling for cross-lingual pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, New York Hilton Midtown, NY, USA, 7–12 February 2020; Volume 34, pp. 9386–9393. [Google Scholar]
- Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv 2019, arXiv:1910.10683. [Google Scholar]
- Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv 2020, arXiv:2010.11934. [Google Scholar]
- Song, K.; Tan, X.; Qin, T.; Lu, J.; Liu, T.Y. MASS: Masked Sequence to Sequence Pre-training for Language Generation. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
- Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A pretrained language model for scientific text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
- Huang, K.; Altosaar, J.; Ranganath, R. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv 2019, arXiv:1904.05342. [Google Scholar]
- Zhang, X.; Wei, F.; Zhou, M. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 5059–5069. [Google Scholar] [CrossRef] [Green Version]
- Goyal, S.; Choudhary, A.R.; Chakaravarthy, V.; ManishRaje, S.; Sabharwal, Y.; Verma, A. PoWER-BERT: Accelerating BERT inference for Classification Tasks. arXiv 2020, arXiv:2001.08950. [Google Scholar]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Deng, H.; Ju, Q. FastBERT: A Self-distilling BERT with Adaptive Inference Time. arXiv 2020, arXiv:2004.02178. [Google Scholar]
- Wu, X.; Lv, S.; Zang, L.; Han, J.; Hu, S. Conditional BERT contextual augmentation. In International Conference on Computational Science; Springer: Berlin/Heidelberg, Germany, 2019; pp. 84–95. [Google Scholar]
- Wu, C.S.; Hoi, S.; Socher, R.; Xiong, C. Tod-bert: Pre-trained natural language understanding for task-oriented dialogues. arXiv 2020, arXiv:2004.06871. [Google Scholar]
- Mackenzie, J.; Benham, R.; Petri, M.; Trippas, J.R.; Culpepper, J.S.; Moffat, A. CC-News-En: A Large English News Corpus. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, 19–23 October 2020; pp. 3077–3084. [Google Scholar]
- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 353–355. [Google Scholar] [CrossRef] [Green Version]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Austin, TX, USA, 1–5 November 2016; pp. 2383–2392. [Google Scholar]
- Reddy, S.; Chen, D.; Manning, C.D. CoQA: A Conversational Question Answering Challenge. Trans. Assoc. Comput. Linguist. 2019, 7, 249–266. [Google Scholar] [CrossRef]
- Yang, L.; Zhang, M.; Li, C.; Bendersky, M.; Najork, M. Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Online, 19–23 October 2020; pp. 1725–1734. [Google Scholar]
- Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
- He, P.; Liu, X.; Gao, J.; Chen, W. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. arXiv 2021, arXiv:2006.03654. [Google Scholar]
- Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv 2020, arXiv:2004.02984. [Google Scholar]
- de Wynter, A.; Perry, D. Optimal Subarchitecture Extraction For BERT. arXiv 2020, arXiv:2010.10499. [Google Scholar]
- Xin, J.; Tang, R.; Lee, J.; Yu, Y.; Lin, J.J. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 2246–2251. [Google Scholar]
- Kanade, A.; Maniatis, P.; Balakrishnan, G.; Shi, K. Learning and Evaluating Contextual Embedding of Source Code. In Proceedings of the 37th International Conference on Machine Learning (ICML), Online, 13–18 July 2020; pp. 5110–5121. [Google Scholar]
- Hou, L.; Huang, Z.; Shang, L.; Jiang, X.; Liu, Q. DynaBERT: Dynamic BERT with Adaptive Width and Depth. arXiv 2020, arXiv:2004.04037. [Google Scholar]
- Zhang, W.; Hou, L.; Yin, Y.; Shang, L.; Chen, X.; Jiang, X.; Liu, Q. TernaryBERT: Distillation-aware Ultra-low Bit BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
- Kim, S.; Gholami, A.; Yao, Z.; Mahoney, M.W.; Keutzer, K. I-BERT: Integer-only BERT Quantization. arXiv 2021, arXiv:2101.01321. [Google Scholar]
- Jiang, Z.; Yu, W.; Zhou, D.; Chen, Y.; Feng, J.; Yan, S. ConvBERT: Improving BERT with Span-based Dynamic Convolution. arXiv 2020, arXiv:2008.02496. [Google Scholar]
- Iandola, F.N.; Shaw, A.E.; Krishna, R.; Keutzer, K. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? In Proceedings of the SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, Online, 20 November 2020; pp. 124–135. [Google Scholar]
- Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. arXiv 2020, arXiv:2004.13922. [Google Scholar]
- Bai, H.; Zhang, W.; Hou, L.; Shang, L.; Jin, J.; Jiang, X.; Liu, Q.; Lyu, M.R.; King, I. BinaryBERT: Pushing the Limit of BERT Quantization. arXiv 2021, arXiv:2012.15701. [Google Scholar]
- Yin, Y.; Chen, C.; Shang, L.; Jiang, X.; Chen, X.; Liu, Q. AutoTinyBERT: Automatic Hyper-parameter Optimization for Efficient Pre-trained Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 2–5 August 2021; pp. 5146–5157. [Google Scholar]
- Keskar, N.S.; McCann, B.; Varshney, L.R.; Xiong, C.; Socher, R. Ctrl: A conditional transformer language model for controllable generation. arXiv 2019, arXiv:1909.05858. [Google Scholar]
- Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
- Ahmad, W.U.; Chakraborty, S.; Ray, B.; Chang, K.W. Unified pre-training for program understanding and generation. arXiv 2021, arXiv:2103.06333. [Google Scholar]
- Abdelfattah, A.; Tomov, S.; Dongarra, J. Investigating the benefit of FP16-enabled mixed-precision solvers for symmetric positive definite matrices using GPUs. In Computational Science—ICCS 2020. ICCS 2020; Lecture Notes in Computer Science; Springer Nature: Cham, Switzerland, 2020; Volume 12138, pp. 237–250. [Google Scholar] [CrossRef]
- Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; Article No.: 1051. pp. 11328–11339. [Google Scholar]
- Bi, B.; Li, C.; Wu, C.; Yan, M.; Wang, W.; Huang, S.; Huang, F.; Si, L. Palm: Pre-training an autoencoding & autoregressive language model for context-conditioned generation. arXiv 2020, arXiv:2004.07159. [Google Scholar]
- Gaschi, F.; Plesse, F.; Rastin, P.; Toussaint, Y. Multilingual Transformer Encoders: A Word-Level Task-Agnostic Evaluation. In Proceedings of the WCCI2022—IEEE World Congress on Computational Intelligence, Padoue, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar]
- Chi, Z.; Dong, L.; Ma, S.; Mao, S.H.X.L.; Huang, H.; Wei, F. mt6: Multilingual pretrained text-to-text transformer with translation pairs. arXiv 2021, arXiv:2104.08692. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzman, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
- Patel, J.M. Introduction to common crawl datasets. In Getting Structured Data from the Internet; Apress: Berkeley, CA, USA, 2020; pp. 277–324. [Google Scholar] [CrossRef]
- Chi, Z.; Huang, S.; Dong, L.; Ma, S.; Singhal, S.; Bajaj, P.; Song, X.; Wei, F. XLM-E: Cross-lingual language model pre-training via ELECTRA. arXiv 2021, arXiv:2106.16138. [Google Scholar]
- Jiang, X.; Liang, Y.; Chen, W.; Duan, N. XLM-K: Improving Cross-Lingual Language Model Pre-Training with Multilingual Knowledge. arXiv 2021, arXiv:2109.12573. [Google Scholar] [CrossRef]
- Barbieri, F.; Anke, L.E.; Camacho-Collados, J. Xlm-t: A multilingual language model toolkit for twitter. arXiv 2021, arXiv:2104.12250. [Google Scholar]
- Barbieri, F.; Espinosa-Anke, L.; Camacho-Collados, J. XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. In Proceedings of the Language Resources and Evaluation Conference (LREC), Marseille, France, 20–25 June 2022; pp. 20–25. [Google Scholar]
- Goyal, N.; Du, J.; Ott, M.; Anantharaman, G.; Conneau, A. Larger-scale transformers for multilingual masked language modeling. arXiv 2021, arXiv:2105.00572. [Google Scholar]
- Khanuja, S.; Bansal, D.; Mehtani, S.; Khosla, S.; Dey, A.; Gopalan, B.; Margam, D.K.; Aggarwal, P.; Nagipogu, R.T.; Dave, S.; et al. Muril: Multilingual representations for indian languages. arXiv 2021, arXiv:2103.10730. [Google Scholar]
- Huang, H.; Liang, Y.; Duan, N.; Gong, M.; Shou, L.; Jiang, D.; Zhou, M. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. arXiv 2019, arXiv:1909.00964. [Google Scholar]
- Koto, F.; Rahimi, A.; Lau, J.H.; Baldwin, T. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. arXiv 2020, arXiv:2011.00677. [Google Scholar]
- Le, H.; Vial, L.; Frej, J.; Segonne, V.; Coavoux, M.; Lecouteux, B.; Allauzen, A.; Crabbe, B.; Besacier, L.; Schwab, D. Flaubert: Unsupervised language model pre-training for french. arXiv 2019, arXiv:1912.05372. [Google Scholar]
- Rybak, P.; Mroczkowski, R.; Tracz, J.; Gawlik, I. KLEJ: Comprehensive benchmark for polish language understanding. arXiv 2020, arXiv:2005.00630. [Google Scholar]
- Park, S.; Moon, J.; Kim, S.; Cho, W.I.; Han, J.; Park, J.; Song, C.; Kim, J.; Song, Y.; Oh, T.; et al. Klue: Korean language understanding evaluation. arXiv 2021, arXiv:2105.09680. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Nguyen, D.Q.; Nguyen, A.T. PhoBERT: Pre-trained language models for Vietnamese. arXiv 2020, arXiv:2003.00744. [Google Scholar]
- Martin, L.; Muller, B.; Suarez, P.J.O.; Dupont, Y.; Romary, L.; de La Clergerie, E.V.; Seddah, D.; Sagot, B. CamemBERT: A tasty French language model. arXiv 2019, arXiv:1911.03894. [Google Scholar]
- Malmsten, M.; Borjeson, L.; Haffenden, C. Playing with Words at the National Library of Sweden–Making a Swedish BERT. arXiv 2020, arXiv:2007.01658. [Google Scholar]
- Dadas, S.; Perelkiewicz, M.; Poswiata, R. Pre-training polish transformer-based language models at scale. In Proceedings of the Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Zakopane, Poland, 12–14 October 2020; Proceedings Part II. pp. 301–314. [Google Scholar] [CrossRef]
- de Vries, W.; van Cranenburgh, A.; Bisazza, A.; Caselli, T.; van Noord, G.; Nissim, M. Bertje: A dutch bert model. arXiv 2019, arXiv:1912.09582. [Google Scholar]
- Virtanen, A.; Kanerva, J.; Ilo, R.; Luoma, J.; Luotolahti, J.; Salakoski, T.; Ginter, F.; Pyysalo, S. Multilingual is not enough: BERT for Finnish. arXiv 2019, arXiv:1912.07076. [Google Scholar]
- Polignano, M.; Basile, P.; De Gemmis, M.; Semeraro, G.; Basile, V. Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. In Proceedings of the 6th Italian Conference on Computational Linguistics, CLiC-it 2019, Bari, Italy, 13–15 November 2019; Volume 2481, pp. 1–6. [Google Scholar]
- Souza, F.; Nogueira, R.; Lotufo, R. BERTimbau: Pretrained BERT models for Brazilian Portuguese. In Intelligent Systems. BRACIS 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; Volume 12319, pp. 403–417. [Google Scholar] [CrossRef]
- Kuratov, Y.; Arkhipov, M. Adaptation of deep bidirectional multilingual transformers for russian language. arXiv 2019, arXiv:1905.07213. [Google Scholar]
- Bhattacharjee, A.; Hasan, T.; Samin, K.; Rahman, M.S.; Iqbal, A.; Shahriyar, R. Banglabert: Combating embedding barrier for low-resource language understanding. arXiv 2021, arXiv:2101.00204. [Google Scholar]
- Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT and MARBERT: Deep bidirectional transformers for Arabic. arXiv 2020, arXiv:2101.01785. [Google Scholar]
- Farahani, M.; Gharachorloo, M.; Farahani, M.; Manthouri, M. Parsbert: Transformer-based model for persian language understanding. Neural Process. Lett. 2021, 53, 3831–3847. [Google Scholar] [CrossRef]
- Antoun, W.; Baly, F.; Hajj, H. Aragpt2: Pre-trained transformer for arabic language generation. arXiv 2020, arXiv:2012.15520. [Google Scholar]
- Roy, A.; Sharma, I.; Sarkar, S.; Goyal, P. Meta-ED: Cross-lingual Event Detection using Meta-learning for Indian Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022. [Google Scholar] [CrossRef]
- Lowphansirikul, L.; Polpanumas, C.; Jantrakulchai, N.; Nutanong, S. Wangchanberta: Pretraining transformer-based thai language models. arXiv 2021, arXiv:2101.09635. [Google Scholar]
- Carmo, D.; Piau, M.; Campiotti, I.; Nogueira, R.; Lotufo, R. PTT5: Pretraining and validating the T5 model on Brazilian Portuguese data. arXiv 2020, arXiv:2008.09144. [Google Scholar]
- Wagner, J.; Wilkens, R.; Idiart, M.A.P.; Villavicencio, A. The brWaC Corpus: A New Open Resource for Brazilian Portuguese. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. Araelectra: Pre-training text discriminators for arabic language understanding. arXiv 2020, arXiv:2012.15516. [Google Scholar]
- Cahyawijaya, S.; Winata, G.I.; Wilie, B.; Vincentio, K.; Li, X.; Kuncoro, A.; Ruder, S.; Lim, Z.Y.; Bahar, S.; Khodra, M.L.; et al. Indonlg: Benchmark and resources for evaluating indonesian natural language generation. arXiv 2021, arXiv:2104.08200. [Google Scholar]
- Lee, H.; Yoon, J.; Hwang, B.; Joe, S.; Min, S.; Gwon, Y. Korealbert: Pretraining a lite bert model for korean language understanding. In Proceedings of the IEEE 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5551–5557. [Google Scholar]
- Straka, M.; Naplava, J.; Strakova, J.; Samuel, D. RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model. In Proceedings of the 24th International Conference on Text, Speech, and Dialogue (TSD 2021), Olomouc, Czech Republic, 6–9 September 2021; Springer: Cham, Switzerland, 2021; pp. 197–209. [Google Scholar]
- Canete, J.; Chaperon, G.; Fuentes, R.; Ho, J.H.; Kang, H.; Perez, J. Spanish pre-trained bert model and evaluation data. In Proceedings of the Practical Machine Learning for Developing Countries Workshop (PML4DC) at the Eleventh International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26 April 2020. [Google Scholar]
- Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 75–86. [Google Scholar]
- Caselli, T.; Basile, V.; Mitrovic, J.; Kartoziya, I.; Granitzer, M. I Feel Offended, Don’t Be Abusive! Implicit/Explicit Messages in Offensive and Abusive Language. In Proceedings of the Twelfth Language Resources and Evaluation Conference (LREC), Marseille, France, 11–16 May 2020; pp. 6193–6202. [Google Scholar]
- Basile, V.; Bosco, C.; Fersini, E.; Nozza, D.; Patti, V.; Pardo, F.M.R.; Rosso, P.; Sanguinetti, M. SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 54–63. [Google Scholar]
- Nguyen, D.Q.; Vu, T.; Nguyen, A.T. BERTweet: A pre-trained language model for English Tweets. arXiv 2020, arXiv:2005.10200. [Google Scholar]
- Rahali, A.; Akhloufi, M.A.; Therien-Daniel, A.M.; Brassard-Gourdeau, E. Automatic Misogyny Detection in Social Media Platforms using Attention-based Bidirectional-LSTM. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–20 October 2021; pp. 2706–2711. [Google Scholar]
- Sawhney, R.; Neerkaje, A.T.; Gaur, M. A Risk-Averse Mechanism for Suicidality Assessment on Social Media. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 628–635. [Google Scholar]
- Ta, H.T.; Rahman, A.B.S.; Najjar, L.; Gelbukh, A.F. Multi-Task Learning for Detection of Aggressive and Violent Incidents from Social Media. In Proceedings of the 2022 Iberian Languages Evaluation Forum, IberLEF 2022, A Coruna, Spain, 20 September 2022. [Google Scholar]
- Sakhrani, H.; Parekh, S.; Ratadiya, P. Contextualized Embedding based Approaches for Social Media-specific Sentiment Analysis. In Proceedings of the 2021 International Conference on Data Mining Workshops (ICDMW), Auckland, New Zealand, 7–10 December 2021; pp. 186–193. [Google Scholar]
- Ahmed, T.; Kabir, M.; Ivan, S.; Mahmud, H.; Hasan, K. Am I Being Bullied on Social Media? An Ensemble Approach to Categorize Cyberbullying. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Online, 15–18 December 2021; pp. 2442–2453. [Google Scholar]
- Perez, J.M.; Furman, D.A.; Alemany, L.A.; Luque, F.M. RoBERTuito: A pre-trained language model for social media text in Spanish. arXiv 2021, arXiv:2111.09453. [Google Scholar]
- Wang, C.; Gou, J.; Fan, Z. News Recommendation Based On Multi-Feature Sequence Transformer. In Proceedings of the 2021 11th International Conference on Information Technology in Medicine and Education (ITME), Wuyishan, China, 19–21 November 2021; pp. 132–139. [Google Scholar]
- Aljohani, A.A.; Rakrouki, M.A.; Alharbe, N.; Alluhaibi, R. A Self-Attention Mask Learning-Based Recommendation System. IEEE Access 2022, 10, 93017–93028. [Google Scholar] [CrossRef]
- Bhumika; Das, D. MARRS: A Framework for multi-objective risk-aware route recommendation using Multitask-Transformer. In Proceedings of the 16th ACM Conference on Recommender Systems, Seattle, WA, USA, 18–23 September 2022. [Google Scholar]
- Ghorbanpour, F.; Ramezani, M.; Fazli, M.A.; Rabiee, H.R. FNR: A Similarity and Transformer-Based Approach to Detect Multi-Modal Fake News in Social Media. arXiv 2021, arXiv:2112.01131. [Google Scholar]
- Chen, B.; Chen, B.; Gao, D.; Chen, Q.; Huo, C.; Meng, X.; Ren, W.; Zhou, Y. Transformer-based Language Model Fine-tuning Methods for COVID-19 Fake News Detection. arXiv 2021, arXiv:2101.05509. [Google Scholar]
- Mehta, D.; Dwivedi, A.; Patra, A.; Kumar, M.A. A transformer-based architecture for fake news classification. Soc. Netw. Anal. Min. 2021, 11, 39. [Google Scholar] [CrossRef]
- Hande, A.; Puranik, K.; Priyadharshini, R.; Thavareesan, S.; Chakravarthi, B.R. Evaluating Pretrained Transformer-based Models for COVID-19 Fake News Detection. In Proceedings of the 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 8–10 April 2021; pp. 766–772. [Google Scholar]
- Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar]
- Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. Graphcodebert: Pre-training code representations with data flow. arXiv 2020, arXiv:2009.08366. [Google Scholar]
- Phan, L.; Tran, H.; Le, D.; Nguyen, H.; Anibal, J.; Peltekian, A.; Ye, Y. Cotext: Multi-task learning with code-text transformer. arXiv 2021, arXiv:2105.08645. [Google Scholar]
- Feng, Z.; Guo, D.; Tang, D.; Duan, N.; Feng, X.; Gong, M.; Shou, L.; Qin, B.; Liu, T.; Jiang, D.; et al. Codebert: A pre-trained model for programming and natural languages. arXiv 2020, arXiv:2002.08155. [Google Scholar]
- Sherstinsky, A. Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. arXiv 2018, arXiv:1808.03314. [Google Scholar] [CrossRef] [Green Version]
- O’Shea, K.; Nash, R. An Introduction to Convolutional Neural Networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
- Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. AMMU—A Survey of Transformer-based Biomedical Pretrained Language Models. J. Biomed. Inform. 2022, 126, 103982. [Google Scholar] [CrossRef]
- Journal, I. Transformer Health Monitoring System Using Internet of Things. In Proceedings of the 2018 2nd IEEE International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES), Delhi, India, 22–24 October 2018. [Google Scholar] [CrossRef]
- Roitero, K.; Bozzato, C.; Mea, V.D.; Mizzaro, S.; Serra, G. Twitter goes to the Doctor: Detecting Medical Tweets using Machine Learning and BERT. In Proceedings of the Workshop on Semantic Indexing and Information Retrieval for Health from Heterogeneous Content Types and Languages Co-Located with 42nd European Conference on Information Retrieval, SIIRH@ECIR 2020, Lisbon, Portugal, 14 April 2020. [Google Scholar]
- Li, Y.; Rao, S.; Solares, J.R.A.; Hassaine, A.; Ramakrishnan, R.; Canoy, D.; Zhu, Y.; Rahimi, K.; Salimi-Khorshidi, G. BEHRT: Transformer for Electronic Health Records. Sci. Rep. 2020, 10, 7155. [Google Scholar] [CrossRef]
- Li, Y.; Mamouei, M.; Salimi-Khorshidi, G.; Rao, S.; Hassaine, A.; Canoy, D.; Lukasiewicz, T.; Rahimi, K. Hi-BEHRT: Hierarchical Transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. arXiv 2021, arXiv:2106.11360. [Google Scholar] [CrossRef]
- Taghizadeh, N.; Doostmohammadi, E.; Seifossadat, E.; Rabiee, H.R.; Tahaei, M.S. SINA-BERT: A pre-trained Language Model for Analysis of Medical Texts in Persian. arXiv 2021, arXiv:2104.07613. [Google Scholar]
- Balouchzahi, F.; Sidorov, G.; Shashirekha, H.L. ADOP FERT-Automatic Detection of Occupations and Profession in Medical Texts using Flair and BERT. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) Co-Located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021), IberLEF@SEPLN 2021, Malaga, Spain, 21 September 2021. [Google Scholar]
- Kim, Y.; Kim, J.H.; Lee, J.M.; Jang, M.J.; Yum, Y.J.; Kim, S.; Shin, U.; Kim, Y.M.; Joo, H.J.; Song, S. A pre-trained BERT for Korean medical natural language processing. Sci. Rep. 2022, 12, 13847. [Google Scholar] [CrossRef] [PubMed]
- Wada, S.; Takeda, T.; Manabe, S.; Konishi, S.; Kamohara, J.; Matsumura, Y. A pre-training technique to localize medical BERT and enhance BioBERT. arXiv 2020, arXiv:2005.07202. [Google Scholar]
- Mutinda, F.W.; Nigo, S.; Wakamiya, S.; Aramaki, E. Detecting Redundancy in Electronic Medical Records Using Clinical BERT. In Proceedings of the 26th Annual Conference of the Association for Natural Language Processing (NLP2020), Online, 16–19 March 2020; Available online: https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/E3-3.pdf (accessed on 2 October 2022).
- Davari, M.; Kosseim, L.; Bui, T.D. TIMBERT: Toponym Identifier For The Medical Domain Based on BERT. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020. [Google Scholar]
- Wu, Z.L.; Ge, S.; Wu, X. A BERT-Based Framework for Chinese Medical Entity Type Inference. 2020. Available online: https://bj.bcebos.com/v1/conference/ccks2020/eval_paper/ccks2020_eval_paper_1_1_3.pdf (accessed on 2 October 2022).
- Guo, Y.; Ge, Y.; Al-Garadi, M.A.; Sarker, A. Pre-trained Transformer-based Classification and Span Detection Models for Social Media Health Applications. In Proceedings of the Sixth Social Media Mining for Health (SMM4H) Workshop and Shared Task, Mexico City, Mexico, 10 June 2021; pp. 52–57. [Google Scholar]
- Çelikten, A.; Bulut, H. Turkish Medical Text Classification Using BERT. In Proceedings of the 2021 29th Signal Processing and Communications Applications Conference (SIU), Istanbul, Turkey, 9–11 June 2021; pp. 1–4. [Google Scholar]
- Wang, X.; Tao, M.; Wang, R.; Zhang, L. Reduce the medical burden: An automatic medical triage system using text classification BERT based on Transformer structure. In Proceedings of the 2021 2nd International Conference on Big Data and Artificial Intelligence and Software Engineering (ICBASE), Zhuhai, China, 24–26 September 2021; pp. 679–685. [Google Scholar]
- Aji, A.F.; Nityasya, M.N.; Wibowo, H.A.; Prasojo, R.E.; Fatyanosa, T.N. BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter. In Proceedings of the Sixth Social Media Mining for Health (SMM4H) Workshop and Shared Task, Mexico City, Mexico, 10 June 2021; pp. 58–64. [Google Scholar]
- Lahlou, C.; Crayton, A.; Trier, C.; Willett, E.J. Explainable Health Risk Predictor with Transformer-based Medicare Claim Encoder. arXiv 2021, arXiv:2105.09428. [Google Scholar]
- Qin, Q.; Zhao, S.; Liu, C. A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records. Complex. 2021, 2021, 6631837:1–6631837:11. [Google Scholar] [CrossRef]
- Li, Z.; Yun, H.; Guo, Z.; Qi, J. Medical Named Entity Recognition Based on Multi Feature Fusion of BERT. In Proceedings of the 2021 4th International Conference on Big Data Technologies, Zibo, China, 24–26 September 2021. [Google Scholar]
- Xue, K.; Zhou, Y.; Ma, Z.; Ruan, T.; Zhang, H.; He, P. Fine-tuning BERT for Joint Entity and Relation Extraction in Chinese Medical Text. In Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), San Diago, CA, USA, 18–21 November 2019; pp. 892–897. [Google Scholar]
- He, Y.; Zhu, Z.; Zhang, Y.; Chen, Q.; Caverlee, J. Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
- Kieuvongngam, V.; Tan, B.; Niu, Y. Automatic Text Summarization of COVID-19 Medical Research Articles using BERT and GPT-2. arXiv 2020, arXiv:2006.01997. [Google Scholar]
- Heo, T.S.; Yoo, Y.; Park, Y.; Jo, B.C. Medical Code Prediction from Discharge Summary: Document to Sequence BERT using Sequence Attention. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 1239–1244. [Google Scholar]
- Wang, J.; Zhang, G.; Wang, W.; Zhang, K.; Sheng, Y. Cloud-based intelligent self-diagnosis and department recommendation service using Chinese medical BERT. J. Cloud Comput. 2021, 10, 4. [Google Scholar] [CrossRef]
- Roy, A.; Pan, S. Incorporating medical knowledge in BERT for clinical relation extraction. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Adrian Schiegl, D.T. Disease-Symptom Relation Extraction from Medical Text Corpora with BERT. Available online: https://web.archive.org/web/20210629045352/https://repositum.tuwien.at/bitstream/20.500.12708/17874/1/Schiegl%20Adrian%20-%202021%20-%20Disease-Symptom%20relation%20extraction%20from%20medical%20text...pdf (accessed on 2 October 2022).
- Gao, S.; Du, J.; Zhang, X. Research on Relation Extraction Method of Chinese Electronic Medical Records Based on BERT. In Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence, Tianjin, China, 23–26 April 2020. [Google Scholar]
- Peng, S.; Yuan, K.; Gao, L.; Tang, Z. Mathbert: A pre-trained model for mathematical formula understanding. arXiv 2021, arXiv:2105.00377. [Google Scholar]
- Liu, X.; Yin, D.; Zhang, X.; Su, K.; Wu, K.; Yang, H.; Tang, J. Oag-bert: Pre-train heterogeneous entity-augmented academic language models. arXiv 2021, arXiv:2103.02410. [Google Scholar]
- Liu, H.; Qiu, Q.; Wu, L.; Li, W.; Wang, B.; Zhou, Y. Few-shot learning for name entity recognition in geological text based on GeoBERT. Earth Sci. Inform. 2022, 15, 979–991. [Google Scholar] [CrossRef]
- Xu, Z.; Li, J.; Yang, Z.; Li, S.; Li, H. SwinOCSR: End-to-end optical chemical structure recognition using a Swin Transformer. J. Cheminform. 2022, 14, 41. [Google Scholar] [CrossRef] [PubMed]
- Quatra, M.L.; Cagliero, L. Transformer-based highlights extraction from scientific papers. Knowl. Based Syst. 2022, 252, 109382. [Google Scholar] [CrossRef]
- Glazkova, A.; Glazkov, M. Detecting Generated Scientific Papers using an Ensemble of Transformer Models. arXiv 2022, arXiv:2209.08283. [Google Scholar]
- Balabin, H.; Hoyt, C.T.; Birkenbihl, C.; Gyori, B.M.; Bachman, J.A.; Kodamullil, A.T.; Ploger, P.G.; Hofmann-Apitius, M.; Domingo-Fernandez, D. STonKGs: A sophisticated transformer trained on biomedical text and knowledge graphs. Bioinformatics 2021, 38, 1648–1656. [Google Scholar] [CrossRef] [PubMed]
- Phan, L.; Anibal, J.T.; Tran, H.; Chanana, S.; Bahadroglu, E.; Peltekian, A.; Altan-Bonnet, G. SciFive: A text-to-text transformer model for biomedical literature. arXiv 2021, arXiv:2106.03598. [Google Scholar]
- Parrilla-Gutierrez, J.M. Predicting Real-time Scientific Experiments Using Transformer models and Reinforcement Learning. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 502–506. [Google Scholar]
- Ghosh, S.; Chopra, A. Using Transformer based Ensemble Learning to classify Scientific Articles. In Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2021; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12705. [Google Scholar] [CrossRef]
- Zaratiana, U.; Holat, P.; Tomeh, N.; Charnois, T. Hierarchical Transformer Model for Scientific Named Entity Recognition. arXiv 2022, arXiv:2203.14710. [Google Scholar]
- Santosh, T.Y.S.; Chakraborty, P.; Dutta, S.; Sanyal, D.K.; Das, P.P. Joint Entity and Relation Extraction from Scientific Documents: Role of Linguistic Information and Entity Types. In Proceedings of the 2nd Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents (JCDL 2021), Online, IL, USA, 30 September 2021. [Google Scholar]
- Kubal, D.R.; Nagvenkar, A. Effective Ensembling of Transformer based Language Models for Acronyms Identification. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Online, 9 February 2021. [Google Scholar]
- Tian, X.; Wang, J. Retrieval of Scientific Documents Based on HFS and BERT. IEEE Access 2021, 9, 8708–8717. [Google Scholar] [CrossRef]
- Grail, Q. Globalizing BERT-based Transformer Architectures for Long Document Summarization. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 1792–1810. [Google Scholar]
- Leivaditi, S.; Rossi, J.; Kanoulas, E. A benchmark for lease contract review. arXiv 2020, arXiv:2010.10386. [Google Scholar]
- Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The muppets straight out of law school. arXiv 2020, arXiv:2010.02559. [Google Scholar]
- Paul, S.; Mandal, A.; Goyal, P.; Ghosh, S. Pre-training Transformers on Indian Legal Text. arXiv 2022, arXiv:2209.06049. [Google Scholar]
- Thanh, N.H.; Nguyen, L.M. Logical Structure-based Pretrained Models for Legal Text Processing. Available online: https://www.scitepress.org/Papers/2022/108520/108520.pdf (accessed on 2 October 2022).
- Savelka, J.; Ashley, K.D. Discovering Explanatory Sentences in Legal Case Decisions Using Pre-trained Language Models. arXiv 2021, arXiv:2112.07165. [Google Scholar]
- Shaheen, Z.; Wohlgenannt, G.; Muromtsev, D. Zero-Shot Cross-Lingual Transfer in Legal Domain Using Transformer Models. In Proceedings of the 2021 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 26–29 July 2021; pp. 450–456. [Google Scholar]
- Garneau, N.; Gaumond, E.; Lamontagne, L.; Deziel, P.L. CriminelBART: A French Canadian legal language model specialized in criminal law. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, São Paulo, Brazil, 21–25 June 2021. [Google Scholar]
- Peric, L.; Mijic, S.; Stammbach, D.; Ash, E. Legal Language Modeling with Transformers. In Proceedings of the Automated Semantic Analysis of Information in Legal Text at 33rd International Conference on Legal Knowledge and Information Systems (ASAIL@JURIX), Online Event, Brno, Czech Republic, 9–11 December 2020. [Google Scholar]
- Cemri, M.; Çukur, T.; Koç, A. Unsupervised Simplification of Legal Texts. arXiv 2022, arXiv:2209.00557. [Google Scholar]
- Klaus, S.; Hecke, R.V.; Naini, K.D.; Altingovde, I.S.; Bernabe-Moreno, J.; Herrera-Viedma, E.E. Summarizing Legal Regulatory Documents using Transformers. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 2426–2430. [Google Scholar]
- Yoon, J.; Junaid, M.; Ali, S.; Lee, J. Abstractive Summarization of Korean Legal Cases using Pre-trained Language Models. In Proceedings of the 2022 16th International Conference on Ubiquitous Information Management and Communication (IMCOM), Seoul, Korea, 3–5 January 2022; pp. 1–7. [Google Scholar]
- Aumiller, D.; Almasian, S.; Lackner, S.; Gertz, M. Structural text segmentation of legal documents. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, São Paulo, Brazil, 21–25 June 2021; pp. 2–11. [Google Scholar]
- Mullick, A.; Nandy, A.; Kapadnis, M.N.; Patnaik, S.; Raghav, R. Fine-grained Intent Classification in the Legal Domain. arXiv 2022, arXiv:2205.03509. [Google Scholar]
- Prasad, N.; Boughanem, M.; Dkaki, T. Effect of Hierarchical Domain-specific Language Models and Attention in the Classification of Decisions for Legal Cases. In Proceedings of the CIRCLE (Joint Conference of the Information Retrieval Communities in Europe), Samatan, Gers, France, 4–7 July 2022. [Google Scholar]
- Nghiem, M.Q.; Baylis, P.; Freitas, A.; Ananiadou, S. Text Classification and Prediction in the Legal Domain. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 20–25 June 2022; pp. 4717–4722. [Google Scholar]
- Braun, D.; Matthes, F. Clause Topic Classification in German and English Standard Form Contracts. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), Online, 26 May 2022. [Google Scholar]
- Papaloukas, C.; Chalkidis, I.; Athinaios, K.; Pantazi, D.A.; Koubarakis, M. Multi-granular Legal Topic Classification on Greek Legislation. arXiv 2021, arXiv:2109.15298. [Google Scholar]
- Bambroo, P.; Awasthi, A. LegalDB: Long DistilBERT for Legal Document Classification. In Proceedings of the 2021 International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Bhilai, India, 19–20 February 2021; pp. 1–4. [Google Scholar]
- Shaheen, Z.; Wohlgenannt, G.; Filtz, E. Large Scale Legal Text Classification Using Transformer Models. arXiv 2020, arXiv:2010.12871. [Google Scholar]
- Ni, Z. Key Information Extraction of Food Environmental Safety Criminal Judgment Documents Based on Deep Learning. J. Environ. Public Health 2022, 2022, 4661166. [Google Scholar] [CrossRef]
- Kim, M.Y.; Rabelo, J.; Okeke, K.; Goebel, R. Legal Information Retrieval and Entailment Based on BM25, Transformer and Semantic Thesaurus Methods. Rev. Socionetw. Strateg. 2022, 16, 157–174. [Google Scholar] [CrossRef]
- Trias, F.; Wang, H.; Jaume, S.; Idreos, S. Named Entity Recognition in Historic Legal Text: A Transformer and State Machine Ensemble Method. In Proceedings of the Natural Legal Language Processing Workshop 2021, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Thanh, N.H.; Nguyen, P.M.; Vuong, T.H.Y.; Bui, M.Q.; Nguyen, M.C.; Dang, B.; Tran, V.D.; Nguyen, L.M.; Satoh, K. Transformer-Based Approaches for Legal Text Processing. Rev. Socionetw. Strateg. 2022, 16, 135–155. [Google Scholar] [CrossRef]
- Sun, M.; Guo, Z.; Deng, X. Intelligent BERT-BiLSTM-CRF Based Legal Case Entity Recognition Method. In Proceedings of the ACM Turing Award Celebration Conference; China (ACM TURC 2021), Hefei, China, 30 July–1 August 2021; pp. 186–191. [Google Scholar] [CrossRef]
- Caballero, E.Q.; Rahman, M.S.; Cerny, T.; Rivas, P.; Bejarano, G. Study of Question Answering on Legal Software Document using BERT based models. In Proceedings of the LatinX in Natural Language Processing Research Workshop, Seattle, WA, USA, 10 July 2022. [Google Scholar]
- Khazaeli, S.; Punuru, J.; Morris, C.; Sharma, S.; Staub, B.; Cole, M.; Chiu-Webster, S.; Sakalley, D. A Free Format Legal Question Answering System. In Proceedings of the Natural Legal Language Processing Workshop 2021, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Vold, A.; Conrad, J.G. Using transformers to improve answer retrieval for legal questions. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law, Online, 21–25 June 2021. [Google Scholar]
- Huang, Y.; Shen, X.; Li, C.; Ge, J.; Luo, B. Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer. arXiv 2021, arXiv:2112.06370. [Google Scholar]
- Dong, Q.; Niu, S. Legal Judgment Prediction via Relational Learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, 11–15 July 2021. [Google Scholar]
- Sukanya, G.; Priyadarshini, J. A Meta Analysis of Attention Models on Legal Judgment Prediction System. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 2021, 12, 531–538. [Google Scholar] [CrossRef]
- Masala, M.; Iacob, R.C.A.; Uban, A.S.; Cidotã, M.A.; Velicu, H.; Rebedea, T.; Popescu, M.C. jurBERT: A Romanian BERT Model for Legal Judgement Prediction. In Proceedings of the Natural Legal Language Processing Workshop 2021, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Salaun, O.; Langlais, P.; Benyekhlef, K. Exploiting Domain-Specific Knowledge for Judgment Prediction Is No Panacea. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Online, 1–3 September 2021. [Google Scholar]
- Zhu, K.; Guo, R.; Hu, W.; Li, Z.; Li, Y. Legal Judgment Prediction Based on Multiclass Information Fusion. Complexity 2020, 2020, 3089189:1–3089189:12. [Google Scholar] [CrossRef]
- Lian, M.; Li, J. Financial product recommendation system based on transformer. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; Volume 1, pp. 2547–2551. [Google Scholar]
- Goel, T.; Chauhan, V.; Verma, I.; Dasgupta, T.; Dey, L. TCS WITM 2021 @FinSim-2: Transformer based Models for Automatic Classification of Financial Terms. In Proceedings of the WWW ’21: Companion Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 311–315. [Google Scholar] [CrossRef]
- Yang, L.; Li, J.; Dong, R.; Zhang, Y.; Smyth, B. NumHTML: Numeric-Oriented Hierarchical Transformer Model for Multi-task Financial Forecasting. In Proceedings of the 36th AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022. [Google Scholar]
- Ding, Q.; Wu, S.; Sun, H.; Guo, J.; Guo, J. Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20), Special Track on AI in FinTech, Yokohama, Japan, 7–15 January 2021. [Google Scholar]
- Yoo, J.; Soun, Y.; Park, Y.; Kang, U. Accurate Multivariate Stock Movement Prediction via Data-Axis Transformer with Multi-Level Contexts. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Virtual Event, Singapore, 14–18 August 2021. [Google Scholar]
- Hu, J. Local-constraint transformer network for stock movement prediction. Int. J. Comput. Sci. Eng. 2021, 24, 429–437. [Google Scholar] [CrossRef]
- Daiya, D.; Lin, C. Stock Movement Prediction and Portfolio Management via Multimodal Learning with Transformer. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3305–3309. [Google Scholar]
- Caron, M.; Müller, O. Hardening Soft Information: A Transformer-Based Approach to Forecasting Stock Return Volatility. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 4383–4391. [Google Scholar]
- Chen, Q. Stock Movement Prediction with Financial News using Contextualized Embedding from BERT. arXiv 2021, arXiv:2107.08721. [Google Scholar]
- Kim, A.S.; Yoon, S. Corporate Bankruptcy Prediction with BERT Model. In Proceedings of the Third Workshop on Economics and Natural Language Processing, Punta Cana, Dominican Republic, 11 November 2021; pp. 26–36. [Google Scholar]
- Wan, C.X.; Li, B. Financial causal sentence recognition based on BERT-CNN text classification. J. Supercomput. 2022, 78, 6503–6527. [Google Scholar] [CrossRef]
- Arslan, Y.; Allix, K.; Veiber, L.; Lothritz, C.; Bissyande, T.F.; Klein, J.; Goujon, A. A Comparison of Pre-Trained Language Models for Multi-Class Text Classification in the Financial Domain. In Proceedings of the WWW ’21: Companion Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 260–268. [Google Scholar] [CrossRef]
- Zhong, A.; Han, Q. Automated Investor Sentiment Classification using Financial Social Media. In Proceedings of the CONF-CDS 2021: The 2nd International Conference on Computing and Data Science, Stanford, CA, USA, 28–30 January 2021; pp. 356–361. [Google Scholar]
- Chapman, C.; Hillebrand, L.P.; Stenzel, M.R.; Deusser, T.; Biesner, D.; Bauckhage, C.; Sifa, R. Towards Generating Financial Reports from Tabular Data Using Transformers. In Machine Learning and Knowledge Extraction. CD-MAKE 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13480. [Google Scholar] [CrossRef]
- Agrawal, Y.; Anand, V.; Gupta, M.; Arunachalam, S.; Varma, V. Goal-Directed Extractive Summarization of Financial Reports. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Gold Coast, QLD, Australia, 1–5 November 2021. [Google Scholar]
- Singh, A.K. PoinT-5: Pointer Network and T-5 based Financial Narrative Summarisation. arXiv 2020, arXiv:2010.04191. [Google Scholar]
- Li, Q.; Zhang, Q. A Unified Model for Financial Event Classification, Detection and Summarization. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence Special Track on AI in FinTech, Yokohama, Japan, 11–17 July 2020; pp. 4668–4674. [Google Scholar] [CrossRef]
- Kamal, S.; Sharma, S. A Comprehensive Review on Summarizing Financial News Using Deep Learning. arXiv 2021, arXiv:2109.10118. [Google Scholar]
- Zhao, L.; Li, L.; Zheng, X. A BERT based Sentiment Analysis and Key Entity Detection Approach for Online Financial Texts. In Proceedings of the 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Dalian, China, 5–7 May 2021; pp. 1233–1238. [Google Scholar]
- Hiew, J.Z.G.; Huang, X.; Mou, H.; Li, D.; Wu, Q.; Xu, Y. BERT-based Financial Sentiment Index and LSTM-based Stock Return Predictability. arXiv 2019, arXiv:1906.09024. [Google Scholar]
- Salunkhe, A.; Mhaske, S. Aspect Based Sentiment Analysis on Financial Data using Transferred Learning Approach using Pre-Trained BERT and Regressor Model. Int. Res. J. Eng. Technol. (IRJET) 2019, 6, 1097–1101. [Google Scholar]
- Qian, T.; Xie, A.; Bruckmann, C. Sensitivity Analysis on Transferred Neural Architectures of BERT and GPT-2 for Financial Sentiment Analysis. arXiv 2022, arXiv:2207.03037. [Google Scholar]
- Ghosh, S.; Naskar, S.K. FiNCAT: Financial Numeral Claim Analysis Tool. In Proceedings of the Companion Proceedings of the Web Conference 2022, Virtual Event, Lyon, France, 25–29 April 2022; pp. 583–585. [Google Scholar] [CrossRef]
- Soong, G.H.; Tan, C.C. Sentiment Analysis on 10-K Financial Reports using Machine Learning Approaches. In Proceedings of the 2021 IEEE 11th International Conference on System Engineering and Technology (ICSET), Shah Alam, Malaysia, 6 November 2021; pp. 124–129. [Google Scholar]
- Guti’errez-Fandino, A.; i Alonso, M.N.; Kolm, P.N.; Armengol-Estap’e, J. FinEAS: Financial Embedding Analysis of Sentiment. J. Financ. Data Sci. 2022. [Google Scholar] [CrossRef]
- Mansar, Y.; Kang, J.; Maarouf, I.E. The FinSim-2 2021 Shared Task: Learning Semantic Similarities for the Financial Domain. In Proceedings of the Companion Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021. [Google Scholar]
- Li, M.; Chen, L.; Zhao, J.; Li, Q. Sentiment analysis of Chinese stock reviews based on BERT model. Appl. Intell. 2021, 51, 5016–5024. [Google Scholar] [CrossRef]
- Li, M.; Chen, L.; Zhao, J.; Li, Q. A Chinese Stock Reviews Sentiment Analysis Based on BERT Model. 2020. Available online: https://www.researchsquare.com/article/rs-69958/latest (accessed on 2 October 2022).
- Hillebrand, L.P.; Deusser, T.; Khameneh, T.D.; Kliem, B.; Loitz, R.; Bauckhage, C.; Sifa, R. KPI-BERT: A Joint Named Entity Recognition and Relation Extraction Model for Financial Reports. arXiv 2022, arXiv:2208.02140. [Google Scholar]
- Liao, L.; Yang, C. Enterprise risk information extraction based on BERT. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; pp. 1454–1457. [Google Scholar]
- Cao, L.; Zhang, S.; Chen, J. CBCP: A Method of Causality Extraction from Unstructured Financial Text. In Proceedings of the 2021 5th International Conference on Natural Language Processing and Information Retrieval (NLPIR), Sanya, China, 17–20 December 2021. [Google Scholar]
- Zhang, Y.; Zhang, H. FinBERT-MRC: Financial named entity recognition using BERT under the machine reading comprehension paradigm. arXiv 2022, arXiv:2205.15485. [Google Scholar]
- Loukas, L.; Fergadiotis, M.; Chalkidis, I.; Spyropoulou, E.; Malakasiotis, P.; Androutsopoulos, I.; Paliouras, G. FiNER: Financial Numeric Entity Recognition for XBRL Tagging. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022. [Google Scholar]
- Reyes, D.; Barcelos, A.; Vieira, R.; Manssour, I.H. Related Named Entities Classification in the Economic-Financial Context. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, Online, 19 April 2021; pp. 8–15. [Google Scholar]
- Liang, Y.C.; Chen, M.; Yeh, W.C.; Chang, Y.C. Numerical Relation Detection in Financial Tweets using Dependency-aware Deep Neural Network. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021), Taoyuan, Taiwan, 15–16 October 2021; pp. 218–225. [Google Scholar]
- Sangaraju, V.R.; Bolla, B.K.; Nayak, D.; Kh, J. Topic Modelling on Consumer Financial Protection Bureau Data: An Approach Using BERT Based Embeddings. In Proceedings of the 2022 IEEE 7th International conference for Convergence in Technology (I2CT), Mumbai, India, 7–9 April 2022; pp. 1–6. [Google Scholar]
- Wang, Z.; Liu, Z.; Luo, L.; Chen, X. A Multi-Neural Network Fusion Based Method for Financial Event Subject Extraction. In Proceedings of the 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Shenzhen, China, 24–26 April 2020; pp. 360–365. [Google Scholar]
- Lin, H.; Wu, J.S.; Huang, Y.S.; Tsai, M.F.; Wang, C.J. NFinBERT: A Number-Aware Language Model for Financial Disclosures (short paper). In Proceedings of the Swiss Text Analytics Conference 2021, Online, Winterthur, Switzerland, 14–16 June 2021. [Google Scholar]
- Liu, Z.; Huang, D.; Huang, K.; Li, Z.; Zhao, J. FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, Special Track on AI in FinTech, Yokohama, Japan, 7–15 January 2021; pp. 4513–4519. [Google Scholar]
- Lu, Q.; Zhang, H.; Kinawi, H.; Niu, D. Self-Attentive Models for Real-Time Malware Classification. IEEE Access 2022, 10, 95970–95985. [Google Scholar] [CrossRef]
- Ameri, K.; Hempel, M.; Sharif, H.R.; Lopez, J.; Perumalla, K.S. CyBERT: Cybersecurity Claim Classification by Fine-Tuning the BERT Language Model. J. Cybersecur. Priv. 2021, 1, 615–637. [Google Scholar] [CrossRef]
- Ampel, B.; Samtani, S.; Ullman, S.; Chen, H. Linking Common Vulnerabilities and Exposures to the MITRE ATT&CK Framework: A Self-Distillation Approach. arXiv 2021, arXiv:2108.01696. [Google Scholar]
- Rahali, A.; Akhloufi, M.A. MalBERT: Malware Detection using Bidirectional Encoder Representations from Transformers. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–20 October 2021; pp. 3226–3231. [Google Scholar]
- Kale, A.S.; Pandya, V.; Troia, F.D.; Stamp, M. Malware classification with Word2Vec, HMM2Vec, BERT, and ELMo. J. Comput. Virol. Hacking Tech. 2022. [Google Scholar] [CrossRef]
- Yesir, S.; Sogukpinar, I. Malware Detection and Classification Using fastText and BERT. In Proceedings of the 2021 9th International Symposium on Digital Forensics and Security (ISDFS), Elazig, Turkey, 28–29 June 2021; pp. 1–6. [Google Scholar]
- Jahromi, M.Z.; Jahromi, A.A.; Kundur, D.; Sanner, S.; Kassouf, M. Data analytics for cybersecurity enhancement of transformer protection. ACM Sigenergy Energy Inform. Rev. 2021, 1, 12–19. [Google Scholar] [CrossRef]
- Jahromi, M.Z.; Jahromi, A.A.; Sanner, S.; Kundur, D.; Kassouf, M. Cybersecurity Enhancement of Transformer Differential Protection Using Machine Learning. In Proceedings of the 2020 IEEE Power and Energy Society General Meeting (PESGM), Virtual Event, 3–6 August 2020; pp. 1–5. [Google Scholar]
- Liu, Y.; Pan, S.; Wang, Y.G.; Xiong, F.; Wang, L.; Lee, V.C.S. Anomaly Detection in Dynamic Graphs via Transformer. arXiv 2021, arXiv:2106.09876. [Google Scholar] [CrossRef]
- Lin, L.H.; Hsiao, S.W. Attack Tactic Identification by Transfer Learning of Language Model. arXiv 2022, arXiv:2209.00263. [Google Scholar]
- Ghourabi, A. A Security Model Based on LightGBM and Transformer to Protect Healthcare Systems From Cyberattacks. IEEE Access 2022, 10, 48890–48903. [Google Scholar] [CrossRef]
- Hemalatha, J.; Roseline, S.A.; Geetha, S.; Kadry, S.N.; Damavsevicius, R. An Efficient DenseNet-Based Deep Learning Model for Malware Detection. Entropy 2021, 23, 344. [Google Scholar] [CrossRef] [PubMed]
- Ranade, P.; Piplai, A.; Mittal, S.; Joshi, A.; Finin, T. Generating Fake Cyber Threat Intelligence Using Transformer-Based Models. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–9. [Google Scholar]
- Alam, M.T.; Bhusal, D.; Park, Y.; Rastogi, N. CyNER: A Python Library for Cybersecurity Named Entity Recognition. arXiv 2022, arXiv:2204.05754. [Google Scholar]
- Evangelatos, P.; Iliou, C.; Mavropoulos, T.; Apostolou, K.; Tsikrika, T.; Vrochidis, S.; Kompatsiaris, Y. Named Entity Recognition in Cyber Threat Intelligence Using Transformer-based Models. In Proceedings of the 2021 IEEE International Conference on Cyber Security and Resilience (CSR), Rhodes, Greece, 26–28 July 2021; pp. 348–353. [Google Scholar]
- Lai, G.; Xie, Q.; Liu, H.; Yang, Y.; Hovy, E. Race: Large-scale reading comprehension dataset from examinations. arXiv 2017, arXiv:1704.04683. [Google Scholar]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Warstadt, A.; Singh, A.; Bowman, S.R. Neural Network Acceptability Judgments. arXiv 2018, arXiv:1805.12471. [Google Scholar] [CrossRef]
- Dolan, W.B.; Brockett, C. Automatically Constructing a Corpus of Sentential Paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Jeju Island, Korea, 14 October 2005. [Google Scholar]
- Iyer, S.; Dandekar, N.; Csernai, K. First Quora Dataset Release: Question Pairs. 18 May 2017. Available online: https://karthikrevanuru.github.io/assets/documents/projects/Quora_Pairs.pdf (accessed on 2 October 2022).
- Williams, A.; Nangia, N.; Bowman, S.R. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 1112–1122. [Google Scholar]
- Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. arXiv 2015, arXiv:1508.05326. [Google Scholar]
- Levesque, H.; Davis, E.; Morgenstern, L. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Rome, Italy, 10–14 June 2012. [Google Scholar]
- Dagan, I.; Glickman, O.; Magnini, B. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop; Springer: Berlin/Heidelberg, Germany, 2005; pp. 177–190. [Google Scholar] [CrossRef]
- Cer, D.; Diab, M.; Agirre, E.; Lopez-Gazpio, I.; Specia, L. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv 2017, arXiv:1708.00055. [Google Scholar]
- Wang, A.; Pruksachatkun, Y.; Nangia, N.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. Superglue: A stickier benchmark for general-purpose language understanding systems. In Proceedings of the NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Article No.: 294. pp. 3266–3280. [Google Scholar]
- Diebold, F.X.; Mariano, R.S. Comparing Predictive Accuracy. J. Bus. Econ. Stat. 2020, 20, 134–144. [Google Scholar] [CrossRef]
- Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
- Woods, K.S.; Bowyer, K. Generating ROC curves for artificial neural networks. IEEE Trans. Med. Imaging 1997, 16, 329–337. [Google Scholar] [CrossRef] [Green Version]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef] [Green Version]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Jelinek, F.; Mercer, R.L.; Bahl, L.R.; Baker, J. Perplexity—A measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 1977, 62. [Google Scholar] [CrossRef] [Green Version]
- Li, J.; Galley, M.; Brockett, C.; Gao, J.; Dolan, W.B. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016. [Google Scholar]
- Kusner, M.J.; Sun, Y.; Kolkin, N.I.; Weinberger, K.Q. From Word Embeddings To Document Distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; Volume 37, pp. 957–966. [Google Scholar]
- Lo, C. MEANT 2.0: Accurate semantic MT evaluation for any output language. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, 7 September 2017; pp. 589–597. [Google Scholar]
- Yujian, L.; Bo, L. A Normalized Levenshtein Distance Metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Dahlmeier, D.; Ng, H.T. TESLA: Translation Evaluation of Sentences with Linear-Programming-Based Analysis. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics (MATR), Uppsala, Sweden, 15–16 July 2010; pp. 354–359. [Google Scholar]
- Agirre, E.; Gonzalez-Agirre, A.; Lopez-Gazpio, I.; Maritxalar, M.; Rigau, G.; Uria, L. SemEval-2016 Task 2: Interpretable Semantic Textual Similarity. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, CA, USA, 16–17 June 2016; pp. 512–524. [Google Scholar]
- Huang, P.S.; He, X.; Gao, J.; Deng, L.; Acero, A.; Heck, L. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information and Knowledge Management, San Francisco, CA, USA, 27 October–1 November 2013. [Google Scholar]
- Agarwal, A.; Lavie, A. Meteor, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output. In Natural Language Processing and Information Systems. NLDB 2009; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2008; Volume 5723. [Google Scholar] [CrossRef]
- Popovic, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar]
- Lo, C. Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation. In Proceedings of the Fifth Conference on Machine Translation, Online, 19–20 November 2020; pp. 895–902. [Google Scholar]
- Lo, C.; Larkin, S. Machine Translation Reference-less Evaluation using YiSi-2 with Bilingual Mappings of Massive Multilingual Language Model. In Proceedings of the Fifth Conference on Machine Translation, Online, 19–20 November 2020; pp. 903–910. [Google Scholar]
- Chen, Q.; Zhu, X.D.; Ling, Z.; Wei, S.; Jiang, H.; Inkpen, D. Enhanced LSTM for Natural Language Inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1657–1668. [Google Scholar]
- Och, F.J. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, 7–12 July 2003; pp. 160–167. [Google Scholar]
- Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1965, 10, 707–710. [Google Scholar]
- Doddington, G.R.; Przybocki, M.A.; Martin, A.F.; Reynolds, D.A. The NIST speaker recognition evaluation—Overview, methodology, systems, results, perspective. Speech Commun. 2000, 31, 225–254. [Google Scholar] [CrossRef]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Mauser, A.; Hasan, S.; Ney, H. Automatic Evaluation Measures for Statistical Machine Translation System Optimization. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, 28–30 May 2008. [Google Scholar]
- Mathur, N.; Baldwin, T.; Cohn, T. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. arXiv 2020, arXiv:2006.06264. [Google Scholar]
- Velankar, A.; Patil, H.; Joshi, R. Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi. arXiv 2022, arXiv:2204.08669. [Google Scholar]
- Alammary, A.S. BERT Models for Arabic Text Classification: A Systematic Review. Appl. Sci. 2022, 12, 5720. [Google Scholar] [CrossRef]
- Dai, X.; Chalkidis, I.; Darkner, S.; Elliott, D. Revisiting Transformer-based Models for Long Document Classification. arXiv 2022, arXiv:2204.06683. [Google Scholar]
- Hamid, A.; Ahmad, R.M.T.R.L.; Hassan, W.A.W.; Majudi, S.; Fahmy, S. Text Classification on Social Media using Bidirectional Encoder Representations from Transformers (BERT) for Zakat Sentiment Analysis. Int. J. Synerg. Eng. Technol. 2022, 3, 79–87. [Google Scholar]
- Li, Z.; Si, S.; Wang, J.; Xiao, J. Federated Split BERT for Heterogeneous Text Classification. arXiv 2022, arXiv:2205.13299. [Google Scholar]
- Rahali, A.; Akhloufi, M.A. MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection. arXiv 2021, arXiv:2103.03806. [Google Scholar]
- Tezgider, M.; Yildiz, B.; Aydin, G. Text classification using improved bidirectional transformer. Concurr. Comput. Pract. Exp. 2022, 34, e6486. [Google Scholar] [CrossRef]
- Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.C.; Brockett, C.; Gao, X.; Gao, J.; Liu, J.; Dolan, B. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv 2019, arXiv:1911.00536. [Google Scholar]
- Peng, B.; Li, C.; Li, J.; Shayandeh, S.; Liden, L.; Gao, J. Soloist: Few-shot task-oriented dialog with a single pre-trained auto-regressive model. arXiv 2020, arXiv:2005.05298. [Google Scholar]
- Lamsiyah, S.; Mahdaouy, A.E.; Ouatik, S.E.A.; Espinasse, B. Unsupervised extractive multi-document summarization method based on transfer learning from BERT multi-task fine-tuning. J. Inf. Sci. 2021, 0165551521990616. [Google Scholar] [CrossRef]
- Khandelwal, U.; Clark, K.; Jurafsky, D.; Kaiser, L. Sample efficient text summarization using a single pre-trained transformer. arXiv 2019, arXiv:1905.08836. [Google Scholar]
- Liu, Y.; Lapata, M. Text summarization with pretrained encoders. arXiv 2019, arXiv:1908.08345. [Google Scholar]
- Zhang, H.; Xu, J.; Wang, J. Pretraining-based natural language generation for text summarization. arXiv 2019, arXiv:1902.09243. [Google Scholar]
- Reda, A.; Salah, N.; Adel, J.; Ehab, M.; Ahmed, I.; Magdy, M.; Khoriba, G.; Mohamed, E.H. A Hybrid Arabic Text Summarization Approach based on Transformers. In Proceedings of the IEEE 2022 2nd International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 8–9 May 2022; pp. 56–62. [Google Scholar]
- Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; Liu, R. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. arXiv 2020, arXiv:1912.02164. [Google Scholar]
- Wang, T.; Wan, X.; Jin, H. AMR-To-Text Generation with Graph Transformer. Trans. Assoc. Comput. Linguist. 2020, 8, 19–33. [Google Scholar] [CrossRef]
- Zhao, K.; Ding, H.; Ye, K.; Cui, X. A Transformer-Based Hierarchical Variational AutoEncoder Combined Hidden Markov Model for Long Text Generation. Entropy 2021, 23, 1277. [Google Scholar] [CrossRef] [PubMed]
- Diao, S.; Shen, X.; Shum, K.; Song, Y.; Zhang, T. TILGAN: Transformer-based Implicit Latent GAN for Diverse and Coherent Text Generation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 4844–4858. [Google Scholar]
- Chan, A.; Ong, Y.; Pung, B.T.W.; Zhang, A.; Fu, J. CoCon: A Self-Supervised Approach for Controlled Text Generation. arXiv 2021, arXiv:2006.03535. [Google Scholar]
- Wang, Y.; Xu, C.; Hu, H.; Tao, C.; Wan, S.; Dras, M.; Johnson, M.; Jiang, D. Neural Rule-Execution Tracking Machine For Transformer-Based Text Generation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
- Kulkarni, A.; Shivananda, A.; Kulkarni, A. Named-Entity Recognition Using CRF and BERT. In Natural Language Processing Projects; Apress: Berkeley, CA, USA, 2021. [Google Scholar] [CrossRef]
- Li, X.; Yan, H.; Qiu, X.; Huang, X. Flat: Chinese ner using flat-lattice transformer. arXiv 2020, arXiv:2004.11795. [Google Scholar]
- Ma, L.; Jian, X.; Li, X. PAI at SemEval-2022 Task 11: Name Entity Recognition with Contextualized Entity Representations and Robust Loss Functions. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, WA, USA, 14–15 July 2022; pp. 1665–1670. [Google Scholar] [CrossRef]
- Jarrar, M.; Khalilia, M.; Ghanem, S. Wojood: Nested arabic named entity corpus and recognition using bert. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 20–25 June 2022. [Google Scholar]
- Yu, Y.; Wang, Y.; Mu, J.; Li, W.; Jiao, S.; Wang, Z.; Lv, P.; Zhu, Y. Chinese mineral named entity recognition based on BERT model. J. Expert Syst. Appl. 2022, 206, 117727. [Google Scholar] [CrossRef]
- Wu, S.; Song, X.; Feng, Z. Mect: Multi-metadata embedding based cross-transformer for chinese named entity recognition. arXiv 2021, arXiv:2107.05418. [Google Scholar]
- Xuan, Z.; Bao, R.; Jiang, S. FGN: Fusion glyph network for Chinese named entity recognition. In China Conference on Knowledge Graph and Semantic Computing; Springer: Berlin/Heidelberg, Germany, 2020; pp. 28–40. [Google Scholar]
- Sehanobish, A.; Song, C.H. Using chinese glyphs for named entity recognition. arXiv 2019, arXiv:1909.09922. [Google Scholar]
- Chekol Jibril, E.; Cuneyd Tantg, A. ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer. arXiv 2022, arXiv:2207.00785. [Google Scholar]
- Schneider, E.; Rivera-Zavala, R.M.; Martinez, P.; Moro, C.; Paraiso, E.C. UC3M-PUCPR at SemEval-2022 Task 11: An Ensemble Method of Transformer-based Models for Complex Named Entity Recognition. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, WA, USA, 14–15 July 2022; pp. 1448–1456. [Google Scholar]
- He, J.; Uppal, A.; Mamatha, N.; Vignesh, S.; Kumar, D.; Sarda, A.K. Infrrd. ai at SemEval-2022 Task 11: A system for named entity recognition using data augmentation, transformer-based sequence labeling model, and EnsembleCRF. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Seattle, WA, USA, 14–15 July 2022; pp. 1501–1510. [Google Scholar]
- Ren, K.; Li, H.; Zeng, Y.; Zhang, Y. Named Entity Recognition with CRF Based on ALBERT: A Natural Language Processing Model. In China Conference on Command and Control; Springer: Berlin/Heidelberg, Germany, 2022; pp. 498–507. [Google Scholar]
- Basmatkar, P.; Maurya, M. An Overview of Contextual Topic Modeling Using Bidirectional Encoder Representations from Transformers. In Proceedings of the Third International Conference on Communication, Computing and Electronics Systems; Lecture Notes in Electrical Engineering; Springer: Singapore, 2022; Volume 844. [Google Scholar] [CrossRef]
- Alcoforado, A.; Ferraz, T.P.; Gerber, R.; Bustos, E.; Oliveira, A.S.; Veloso, B.M.; Siqueira, F.L.; Costa, A.H.R. ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling. In Computational Processing of the Portuguese Language. PROPOR 2022; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13208. [Google Scholar] [CrossRef]
- Baird, A.; Xia, Y.; Cheng, Y. Consumer perceptions of telehealth for mental health or substance abuse: A Twitter-based topic modeling analysis. JAMIA Open 2022, 5, ooac028. [Google Scholar] [CrossRef]
- Elaffendi, M.; Alrajhi, K. Beyond the Transformer: A Novel Polynomial Inherent Attention (PIA) Model and Its Great Impact on Neural Machine Translation. Comput. Intell. Neurosci. 2022. [Google Scholar] [CrossRef]
- Li, D.; Luo, Z. An Improved Transformer-Based Neural Machine Translation Strategy: Interacting-Head Attention. Comput. Intell. Neurosci. 2022, 2022, 2998242. [Google Scholar] [CrossRef]
- Dione, C.M.B.; Lo, A.; Nguer, E.M.; Oumar, S. Low-resource Neural Machine Translation: Benchmarking State-of-the-art Transformer for Wolof French. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), Marseille, France, 21–23 June 2022; pp. 6654–6661. [Google Scholar]
- Tho, C.; Heryadi, Y.; Kartowisastro, I.H.; Budiharto, W. A Comparison of Lexicon-based and Transformer-based Sentiment Analysis on Code-mixed of Low-Resource Languages. In Proceedings of the 2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI), Jakarta, Indonesia, 28 October 2021; Volume 1, pp. 81–85. [Google Scholar]
- Fu, Q.; Teng, Z.; White, J.; Schmidt, D.C. A Transformer-based Approach for Translating Natural Language to Bash Commands. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 1245–1248. [Google Scholar]
- Zhao, L.; Gao, W.; Fang, J. High-Performance English–Chinese Machine Translation Based on GPU-Enabled Deep Neural Networks with Domain Corpus. Appl. Sci. 2021, 11, 10915. [Google Scholar] [CrossRef]
- Ali, Z. Research Chinese-Urdu Machine Translation Based on Deep Learning. J. Auton. Intell. 2021, 3, 34–44. [Google Scholar] [CrossRef]
- Jing, H.; Yang, C. Chinese text sentiment analysis based on transformer model. In Proceedings of the 2022 3rd International Conference on Electronic Communication and Artificial Intelligence (IWECAI), Zhuhai, China, 14–16 January 2022; pp. 185–189. [Google Scholar]
- Tiwari, D.; Nagpal, B. KEAHT: A Knowledge-Enriched Attention-Based Hybrid Transformer Model for Social Sentiment Analysis. New Gener. Comput. 2022, 40, 1165–1202. [Google Scholar] [CrossRef] [PubMed]
- Potamias, R.A.; Siolas, G.; Stafylopatis, A. A transformer-based approach to irony and sarcasm detection. Neural Comput. Appl. 2020, 32, 17309–17320. [Google Scholar] [CrossRef]
- Mandal, R.; Chen, J.; Becken, S.; Stantic, B. Tweets Topic Classification and Sentiment Analysis based on Transformer-based Language Models. Vietnam. J. Comput. Sci. 2022. [Google Scholar] [CrossRef]
- Zhao, T.; Du, J.; Xue, Z.; Li, A.; Guan, Z. Aspect-Based Sentiment Analysis using Local Context Focus Mechanism with DeBERTa. arXiv 2022, arXiv:2207.02424. [Google Scholar]
- Kokab, S.T.; Asghar, S.; Naz, S. Transformer-based deep learning models for the sentiment analysis of social media data. Array 2022, 14, 100157. [Google Scholar] [CrossRef]
- Ashok Kumar, J.; Cambria, E.; Trueman, T.E. Transformer-Based Bidirectional Encoder Representations for Emotion Detection from Text. In Proceedings of the 2021 IEEE Symposium Series on Computational Intelligence (SSCI), Orlando, FL, USA, 5–7 December 2021; pp. 1–6. [Google Scholar]
- Yue, T.; Jing, M. Sentiment Analysis Based on Bert and Transformer. In Springer Proceedings in Business and Economics; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
- Fedus, W.; Zoph, B.; Shazeer, N.M. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv 2021, arXiv:2101.03961. [Google Scholar]
- Ontanon, S.; Ainslie, J.; Cvicek, V.; Fisher, Z. Making transformers solve compositional tasks. arXiv 2021, arXiv:2108.04378. [Google Scholar]
- Li, Z.; Wallace, E.; Shen, S.; Lin, K.; Keutzer, K.; Klein, D.; Gonzalez, J. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. arXiv 2020, arXiv:2002.11794. [Google Scholar]
- Child, R.; Gray, S.; Radford, A.; Sutskever, I. Generating long sequences with sparse transformers. arXiv 2019, arXiv:1904.10509. [Google Scholar]
- Ye, Z.; Guo, Q.; Gan, Q.; Qiu, X.; Zhang, Z. Bp-transformer: Modelling long-range context via binary partitioning. arXiv 2019, arXiv:1911.04070. [Google Scholar]
- Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar]
- Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking Attention with Performers. arXiv 2020, arXiv:2009.14794. [Google Scholar]
- Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the ICML’20: Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual, 13–18 July 2020; Article No.: 478. pp. 5156–5165. [Google Scholar]
- Su, J.; Lu, Y.; Pan, S.; Wen, B.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. arXiv 2021, arXiv:2104.09864. [Google Scholar]
- Zhang, T.; Wu, F.; Katiyar, A.; Weinberger, K.Q.; Artzi, Y. Revisiting few-sample BERT fine-tuning. arXiv 2020, arXiv:2006.05987. [Google Scholar]
- Chang, P. Advanced Techniques for Fine-Tuning Transformers. 2021. Available online: https://towardsdatascience.com/advanced-techniques-for-fine-tuning-transformers-82e4e61e16e (accessed on 2 October 2022).
- Singh, T.; Giovanardi, D. How much does pre-trained information help? Partially re-initializing BERT during fine-tuning to analyze the contribution of layers. In Stanford CS224N Natural Language Processing with Deep Learning. 2020. Available online: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/reports/custom/report13.pdf (accessed on 2 October 2022).
- Li, Y.; Lin, Y.; Xiao, T.; Zhu, J. An Efficient Transformer Decoder with Compressed Sub-layers. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Virtual, 2–9 February 2021. [Google Scholar]
- Song, Y.; Wang, J.; Liang, Z.; Liu, Z.; Jiang, T. Utilizing BERT Intermediate Layers for Aspect Based Sentiment Analysis and Natural Language Inference. arXiv 2020, arXiv:2002.04815. [Google Scholar]
- Zou, W.; Ding, J.; Wang, C. Utilizing BERT Intermediate Layers for Multimodal Sentiment Analysis. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Evci, U.; Dumoulin, V.; Larochelle, H.; Mozer, M.C. Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning. In Proceedings of the 39th International Conference on Machine Learning, PMLR (2022), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
- Lewkowycz, A. How to decay your learning rate. arXiv 2021, arXiv:2103.12682. [Google Scholar]
- Lee, C.; Cho, K.; Kang, W. Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv 2019, arXiv:1909.11299. [Google Scholar]
- Baldi, P.; Sadowski, P.J. Understanding dropout. In Proceedings of the NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2814–2822. [Google Scholar]
- Wan, L.; Zeiler, M.; Zhang, S.; Le Cun, Y.; Fergus, R. Regularization of neural networks using dropconnect. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16–21 June 2013; pp. 1058–1066. [Google Scholar]
- Hua, H.; Li, X.; Dou, D.; Xu, C.; Luo, J. Fine-tuning Pre-trained Language Models with Noise Stability Regularization. arXiv 2022, arXiv:2206.05658. [Google Scholar]
- Ishii, M.; Sato, A. Layer-wise weight decay for deep neural networks. In Pacific-Rim Symposium on Image and Video Technology; Springer: Berlin/Heidelberg, Germany, 2017; pp. 276–289. [Google Scholar]
- Yu, H.; Cao, Y.; Cheng, G.; Xie, P.; Yang, Y.; Yu, P. Relation Extraction with BERT-based Pre-trained Model. In Proceedings of the 2020 International Wireless Communications and Mobile Computing (IWCMC), Limassol, Cyprus, 15–19 June 2020; pp. 1382–1387. [Google Scholar]
- Cao, Q.; Trivedi, H.; Balasubramanian, A.; Balasubramanian, N. DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering. arXiv 2020, arXiv:2005.00697. [Google Scholar]
- Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.; Wilson, A.G. Averaging weights leads to wider optima and better generalization. arXiv 2018, arXiv:1803.05407. [Google Scholar]
- Khurana, U.; Nalisnick, E.T.; Fokkens, A. How Emotionally Stable is ALBERT? Testing Robustness with Stochastic Weight Averaging on a Sentiment Analysis Task. arXiv 2021, arXiv:2111.09612. [Google Scholar]
- Smith, S.L.; Kindermans, P.J.; Ying, C.; Le, Q.V. Don’t decay the learning rate, increase the batch size. arXiv 2017, arXiv:1711.00489. [Google Scholar]
- Dong, C.; Wang, G.; Xu, H.; Peng, J.; Ren, X.; Liang, X. EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Online, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Liu, C.L.; Hsu, T.Y.; Chuang, Y.S.; Lee, H.Y. A Study of Cross-Lingual Ability and Language-specific Information in Multilingual BERT. arXiv 2020, arXiv:2004.09205. [Google Scholar]
- Lauscher, A.; Ravishankar, V.; Vulic, I.; Glavas, G. From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
Type | Alignment | Score Function |
---|---|---|
Multiplicative Attention | Dot Product | |
Scaled Dot Product | ||
Cosine Similarity | ||
Addictive Attention | Concating | |
Additive | ||
Scoring Attention | General | |
Biased General | ||
Activated General | ||
Specific Attention | Location Based | |
Feature Based |
Abbreviations | Objective Task | Loss Function | Design Highlights | Example |
---|---|---|---|---|
CLM | Casual Language Modeling | where N represents the number of tokens in the sequence. | Token-Level, Predict Next Token, Unidirectional, Sample Efficient | GPT1 [43], UniLM [62] |
MLM | Masked Language Modeling | where represents the masked token positions in x | Token-Level, Token Masking, Bidirectional, Not Sample Efficient | BERT [3], RoBERTa [57] |
RTD | Replaced Token Detection | where represents whether the token is replaced or not. | Token-Level, Bidirectional, Token Replacement, Sample Efficient | ELECTRA [63] |
STD | Shuffled Token Detection | where and and and Also, and are linear layers parameters and is the activation function [53] | Token-Level, Token Shuffling, Bidirectional, Sample Efficient | Panda et al. [64] |
RTS | Random Token Substitution | where represents whether the token is randomly substituted or not. | Token-Level, Random Token Substituted, Sample Efficient | Di et al. [65] |
SLM | Swapped Language Modeling | where represents the positions of swapped tokens. | Token-Level, Token Swapping, Not Sample Efficient | Di et al. [65] |
TLM | Translation Language Modeling | where and represents masked positions and and represent the masked version of x and y. | Random Token Masking, Bidirectional, Not Sample Efficient | XLM [58], XNLG [66] |
ALM | Alternate Language Modeling | where z is the code-switched sentence generated from x and y, represents the masked version of z and M represents the set of masked token positions in . | Sentence-Level, Sentence Code-Switched, Bidirectional, Not Sample Efficient | Yang et al. [67] |
SBO | Sentence Boundary Objective | where f is a two-layered feed-forward neural network, S represents the positions of tokens, represents the length of span, s and e represent the start and end positions of span, and p represents the position embedding. | Span-Level, Span Masking, Not Sample Efficient | SpanBERT [68] |
NSP | Next Sentence Prediction | Where represents whether the sentences are consecutive or not. | Sentence-Level, Consecutive Sentence Swapping, Not Sample Efficient | BERT [3] |
SOP | Sentence Order Prediction | where represents whether the sentences are swapped or not. | Sentence-Level, Sentence Swapping, Not Sample Efficient | ALBERT [69] |
SSLM | Sequence To Sequence LM | where is the masked version of x with masked n-gram span and represents the length of the masked n-gram span. | Token-Level, Unidirectional, Token Masking, Not Sample Efficient | T5 [70], mT5 [71], MASS [72] |
DAE | Denoising Auto Encoder | where is the corrupted version of x. | Sentence-Level, Sentence Reconstructing, Noise Reduction, Sample Efficient | BART [73] |
Benchmark | Task | Variants | Size | Split | Ref |
---|---|---|---|---|---|
RACE | Reading Comprehension | 24.26 MiB | test 1045 train 18,728 validation 1021 | [298] | |
GLUE | Sentiment Analysis | SST-2 | 7.09 MiB | test 1821 train 67,349 validation 872 | [299] |
CoLA | 368.14 KiB | test 1063 train 8551 validation 1043 | [300] | ||
MRPC | 1.43 MiB | test 1725 train 3668 validation 408 | [301] | ||
QQP | 57.73 MiB | test 390,965 train 363,849 validation 40,430 | [302] | ||
Natural Language Inference | MNLI | 298.29 MiB | test 9796 train 392,702 validation 9815 | [303] | |
QNLI | 10.14 MiB | test 5463 train 104,743 validation 5463 | [304] | ||
WNLI | 28.32 KiB | test 146 train 635 validation 71 | [305] | ||
Semantic Textual Similarity | RTE | 680.81 KiB | test 3000 train 2490 validation 277 | [306] | |
STS-B | 784.05 KiB | test 1379 train 5749 validation 1500 | [307] | ||
SQuAD | Question Answering | SQuAD2.0 SQuAD1.1 | 94.04 MiB | train 87,599 validation 10,570 | [85] |
N | TB Models | GLUE | SQuAD | RACE | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MNLI-m (Acc) | MNLI-mm (Acc) | QQP (F1) | QNLI (Acc) | SST-2 (Acc) | CoLA (MC) | STS-B (Spearman Correlation) | MRPC (F1) | RTE (Acc) | WNLI (Acc) | SQuAD 1.1 (F1) | SQuAD 2.0 (F1) | |||
14.5 M | 75.4 | 74.9 | 66.5 | 84.8 | 87.6 | 19.5 | 77.1 | 83.2 | 62.6 | - | - | - | - | |
29.2 M | 77.6 | 77.0 | 68.1 | 86.4 | 89.7 | 27.8 | 77.0 | 83.4 | 61.8 | - | - | - | - | |
110 M | 84.6 | 83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | - | 88.5 | - | - | |
335 M | 86.7 | 85.9 | 72.1 | 92.7 | 94.9 | 60.5 | 86.5 | 89.3 | 70.1 | - | 90.9 | 81.9 | - | |
235 M | 91.3 | 91.0 | 74.2 | - | 97.1 | 69.1 | 92.5 | 93.4 | 89.2 | 91.8 | 94.1 | 88.1 | 82.3 | |
335 M | 88.1 | −87.7 | 71.9 | 94.3 | 94.8 | 64.3 | 89.9 | 90.9 | 79.0 | 65.1 | 94.6 | 88.7 | - | |
15.1 M | 81.5 | 81.6 | 68.9 | 89.5 | 91.7 | 46.7 | 80.1 | 87.9 | 65.1 | 65.1 | - | - | - | |
356 M | 90.2 | 90.2 | 92.2 | 94.7 | 96.4 | 68.0 | 92.4 | 90.9 | 86.6 | 89.0 | 94.6 | 89.4 | 83.2 | |
14.5 M | 82.5 | 81.8 | 71.3 | 87.7 | 92.6 | 44.1 | 80.4 | 86.4 | 66.6 | - | - | - | ||
340 M | 88.0 | - | 74.1 | 95.7 | 95.2 | 65.3 | 90.3 | 92.0 | 83.1 | 65.1 | - | - | - | |
110 M | 85.5 | - | 72.0 | 92.6 | 94.7 | 57.2 | 88.5 | 89.9 | 76.9 | 65.1 | 90.6 | - | - | |
52.2 M | 78.9 | 78.0 | 68.5 | 85.2 | 91.4 | 32.8 | 76.1 | 82.4 | 54.1 | 65.1 | - | - | - | |
335 M | 91.3 | 90.8 | 90.8 | 95.8 | 97.1 | 71.7 | 92.5 | 90.7 | 89.8 | 92.5 | 94.9 | 91.4 | - | |
550 M | 89.1 | 88.5 | 73.2 | 94.0 | 95.6 | 62.9 | 88.8 | 90.7 | 76.0 | 71.9 | - | - | - | |
260 B | 92.3 | 91.7 | 75.2 | 97.3 | 97.8 | 75.5 | 93.0 | 93.9 | 92.6 | 95.9 | - | - | - | |
340 M | 87.0 | 85.9 | - | 92.7 | 94.5 | 61.1 | - | - | 70.9 | - | - | 83.4 | ||
11 B | 92.2 | 91.9 | 75.1 | 96.9 | 97.5 | 71.6 | 93.1 | 92.8 | 92.8 | 94.5 | 96.2 | - | - | |
140 M | 89.9 | 90.1 | 92.5 | 94.9 | 96.6 | 62.8 | 91.2 | 90.4 | 87.0 | - | 94.6 | 89.2 | - | |
117 M | 82.1 | 81.4 | 70.3 | 87.4 | 91.3 | 45.4 | 80.0 | 82.3 | 56.0 | - | - | - | - | |
1.5 B | 82.1 | 81.4 | 70.3 | 88.1 | 91.3 | 45.4 | 82.0 | 82.3 | 56.0 | - | - | - | 59.0 | |
117 M | 85.8 | 85.4 | - | - | 92.6 | - | - | - | - | - | - | 81.3 | 66.0 | |
360 M | 89.8 | 91.0 | 91.8 | 93.9 | 95.6 | 63.6 | 91.8 | 89.2 | 83.8 | 92.5 | 94.5 | 88.8 | 81.8 | |
530 M | 91.4 | 91.4 | 92.7 | - | - | - | - | - | - | - | 95.5 | 91.2 | 89.5 |
Metric | Formula |
---|---|
[309] | |
[310] | |
[310] | |
F1- [310] |
Metric | Formula |
---|---|
BLUE [312] | where x is the generated text sequence to be compared. indicates that the n-gram appears in both the generated and reference texts. n is the number of n-grams in the sequence. G is an abbreviation to generated text. |
ROUGE [313] | where R is an abbreviation to referenced text. |
PPL [314] | Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized sequence , then the perplexity of W is: where n is the number of words in the generated text. i the word in the generated text. is the probability of the word in an LM trained with reference texts. |
DISTINCT [315] | where n is the number of n-grams that appear once in the generated texts and the denominator is the total number of n-grams in the generated text. |
Method | Highlights | Examples |
---|---|---|
Debiasing Omission (DO) [392] | The most popular optimizer for fine-tuning BERT is BERTADAM, which is a modified version of the ADAM first-order stochastic optimization technique. The difference between it and the original ADAM algorithm is the absence of a bias correction step. One of the main causes of BERT fine-tuning instability is the bias correction omission, which affects the learning rate, particularly early in the process. | BertADAM [392] |
Re-Initializing Transformer Layers (RTL) [393] | It is the re-initialization of the top N layers of the transformer. Selecting the ideal number of top layers is crucial because increasing the number of re-initializations past the ideal point may lead to subpar results. | [394] [395] |
Utilizing Intermediate Layers (UIL) [396] | The semantic information in the intermediate layers is ignored in existing BERT-based works, which only use the final output layer of BERT. This approach recommends investigating the possibility of using BERT intermediate layers to improve the effectiveness of fine-tuning BERT. | [396] [397] [398] |
Layer-wise Learning Rate Decay (LLRD) [399] | With the learning rate distribution method known as LLRD, the top layers learn at higher rates than the bottom layers, which learn at lower rates. In order to achieve this, the learning rate is first set for the top layer and then decreased layer by layer from top to bottom using a multiplicative decay rate. | XLNet [44] ELECTRA [63] |
Mixout Regularization (MR) [400] | Dropout [401] and DropConnect [402] inspired Mixout, which is a stochastic regularization technique. At every training iteration, the pre-trained value for each model parameter is substituted. It is shown that this constrains the fine-tuned model from deviating too much from the pre-trained initialization in order to achieve the goal of preventing catastrophic forgetting. | [400] [403] |
Pre-trained Weight Decay (PWD) [404] | The regularization technique known as weight decay (WD) is common. By deducting a fixed amount from the model’s pre-trained parameters, this technique is modified for fine-tuning pre-trained models. Pre-trained weight decay in Transformer fine-tuning performs better than conventional weight decay and stabilizes fine-tuning. | [405], BioBERT [5], DeFormer [406] |
Stochastic Weight Averaging (SWA) [407] | A method of deep neural network training that uses a modified learning rate schedule and averages the weights of the networks iteratively traversed. Weights are averaged to produce larger optima and better generalization. | ALBERT [408] |
Learning Rate Warm-up Steps (LRWS) [409] | A linear schedule has two phases when warming-up steps are included. The optimizer’s initial learning rates are set during the warm-up phase, which comes first. The learning rates therefore begin to increase linearly from 0 to the initialized learning rates. Next comes the normal phase, when the steps start to linearly decrease until they reach 0. | EfficientBERT [410] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rahali, A.; Akhloufi, M.A. End-to-End Transformer-Based Models in Textual-Based NLP. AI 2023, 4, 54-110. https://doi.org/10.3390/ai4010004
Rahali A, Akhloufi MA. End-to-End Transformer-Based Models in Textual-Based NLP. AI. 2023; 4(1):54-110. https://doi.org/10.3390/ai4010004
Chicago/Turabian StyleRahali, Abir, and Moulay A. Akhloufi. 2023. "End-to-End Transformer-Based Models in Textual-Based NLP" AI 4, no. 1: 54-110. https://doi.org/10.3390/ai4010004
APA StyleRahali, A., & Akhloufi, M. A. (2023). End-to-End Transformer-Based Models in Textual-Based NLP. AI, 4(1), 54-110. https://doi.org/10.3390/ai4010004