Comparing the Quality and Speed of Sentence Classiﬁcation with Modern Language Models

: After the advent of Glove and Word2vec, the dynamic development of language models (LMs) used to generate word embeddings has enabled the creation of better text classifier frameworks. With the vector representations of words generated by newer LMs, embeddings are no longer static but are context-aware. However, the quality of results provided by state-of-the-art LMs comes at the price of speed. Our goal was to present a benchmark to provide insight into the speed–quality trade-off of a sentence classifier framework based on word embeddings provided by selected LMs. We used a recurrent neural network with gated recurrent units to create sentence-level vector representations from word embeddings provided by an LM and a single fully connected layer for classification. Benchmarking was performed on two sentence classification data sets: The Sixth Text REtrieval Conference (TREC6)set and a 1000-sentence data set of our design. Our Monte Carlo cross-validated results based on these two data sources demonstrated that the newest deep learning LMs provided improvements over Glove and FastText in terms of weighted Matthews correlation coefficient (MCC) scores. We postulate that progress in LMs is more apparent when more difficult classification tasks are addressed.


Introduction
Recent years have seen substantial development in natural language processing (NLP), owing to the creation of vector representations of text called embeddings. The breakthrough language models (LMs) Glove [1] and Word2 Vec [2] enabled the creation of static word-level embeddings, which have the same form regardless of the context in which the word is found. Over time, researchers developed LMs capable of creating vector representations of text entities of different lengths, e.g., by analyzing text at the character [3][4][5] or sub-word level [6]. Recent progress in this field is exemplified by the development of many LMs that create contextualized word embeddings, i.e., their form depends on the context in which the token is found, as in Embeddings from Language Models (ELMo) [7], Bidirectional Encoder Representations from Transformers (BERT) [8], or Generalized Autoregressive Pretraining for Language Understanding (XLNet) [9].
The modern sentence classification approach relies on token level vector representations provided by a chosen LM from which sentence level embeddings are created. LMs providing more sophisticated contextualized embeddings should increase the quality of sentence classification but at the expense of speed. Maintaining a balance between quality and speed is considered important by many researchers, as evidenced by the struggles in developing efficient NLP frameworks [10][11][12]. However, the time cost

Text Classification
In this research, we used Flair v. 0.4.4 [11], an NLP framework written in PyTorch and published by the authors of the Flair LM [20]; this program enables the creation of text classification models based on various state-of-the-art LMs. With this framework, we instantiated sentence-level classifiers by using a unidirectional RNN with gated recurrent units (GRU RNN) with a single fully connected layer for final sentence classification. The GRU RNN task was to create a constant-length, sentence-level vector representation from token level embeddings provided by the selected LMs. An overall scheme presenting this workflow is presented in Figure 1. In the Flair framework, if we chose an LM enabling fine-tuning on the downstream task, e.g., BERT, the process of training the GRU RNN along with the classification layer automatically included fine-tuning of the LM.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 2 of 13 results in downstream tasks remain unclear. Determining this information is not trivial because when a new LM is published, authors usually provide comparisons of quality but not information regarding the training time (TT) and inference time (IT). Comparing results from various frameworks can also be difficult because of the use of different methods for creating sentence embeddings from lower-level embeddings, e.g., by naïvely computing averages of word embeddings or by computing the element-wise sum of the representations at each word position [13], or by using a deep neural network [13], recurrent neural network (RNN) [14], bidirectional long short-term memory RNN (BiLSTM) [15], or hierarchy by max pooling (HBMP) [16] model. Therefore, two factors often change at one time and influence the quality of sentence-level embeddings, namely the LM providing token embeddings and the method of creating sentence-level embeddings. Another hindrance when comparing the quality of LMs is that authors often choose to publish evaluations carried out on a variety of data sets yet provide results for only the best model obtained after many training runs without reporting relevant statistical information regarding the average models trained in their applied frameworks.
In the present study, we compared the embeddings created by various LMs in a single sentence classification framework that used the same RNN to create sentence-level embeddings to ensure that the overall performance was unbiased by the choice of algorithm for the creation of sentence-level embeddings. Additionally, to improve the comparability of the selected LMs, we performed benchmarking analysis on the same data sets, namely The Sixth Text REtrieval Conference (TREC6) set [17] and our preliminary meetings preliminary data set (MPD) [18] consisting of 1000 labeled sentences, which are published along with this paper. Furthermore, we performed Monte Carlo crossvalidation (MCCV) [19] to assess the quality of predictions and the TT and IT achieved with the selected LMs. All tests were conducted on the same computing machine. To further improve comparability, for all compared LMs, we used the same fine-tuning regimen developed during a preliminary study.

Text Classification
In this research, we used Flair v. 0.4.4 [11], an NLP framework written in PyTorch and published by the authors of the Flair LM [20]; this program enables the creation of text classification models based on various state-of-the-art LMs. With this framework, we instantiated sentence-level classifiers by using a unidirectional RNN with gated recurrent units (GRU RNN) with a single fully connected layer for final sentence classification. The GRU RNN task was to create a constant-length, sentencelevel vector representation from token level embeddings provided by the selected LMs. An overall scheme presenting this workflow is presented in Figure 1. In the Flair framework, if we chose an LM enabling fine-tuning on the downstream task, e.g., BERT, the process of training the GRU RNN along with the classification layer automatically included fine-tuning of the LM.

Selection of the LMs
The LMs selected for testing are presented in Table 1. We chose classic LMs that provide static word-level embeddings, including Glove [1], FastText [21], a size optimized LM called BPEmb [22] that operates on a sub-word level, and a variety of recently developed deep learning LMs that create high-quality contextualized embeddings.
The complex architecture of recent models, such as BERT, Robustly Optimized BERT Pretraining Approach (RoBERTa), and XLNet, allows for choosing from many layers (heads) of the LM model that output usable token embeddings. The quality of these embeddings varies across the LM layers and even across language tasks [23]. For these LMs, we compared embeddings from the default output layer and mixing embeddings from all available output layers, as proposed in a technique called scalar mix [23].

Data Used for Benchmarking
To benchmark the selected LMs, we decided to address two-sentence classification tasks. The first was defined as the TREC6 set, a data set consisting of 5452 training questions and 500 test questions divided into six classes. The task was to classify questions as being about abbreviation, entity, description or abstract concept, human being, location, or numeric values. The class definitions in the TREC6 set were not overlapping and generally constituted a relatively easy classification task. The second task was defined by the MPD, consisting of 1000 artificial sentences that appeared as if they could Appl. Sci. 2020, 10, 3386 4 of 12 have been written by an anonymous attendee of a meeting and could have come from several-sentence comments posted after that meeting. The task was to classify whether the sentence was about the following: time and punctuality, performance assessment, subject of the meeting, recommendations, or technical issues. Because the definitions of the classes were somewhat overlapping, the data set was created by three researchers who wrote and labeled the sentences independently, and because each sentence was considered outside of a wider context, this classification task was considered difficult. Notably, both benchmarked data sets had imbalanced classes (distributions presented in Table 2).

Performance Metrics
For drawing conclusions and discussing the statistically significant differences, the benchmark relied on the Matthews correlation coefficient (MCC) introduced by Matthews [27] because it is known to be more informative than the F1 score (F1) and class-balanced accuracy (BA) [28]. However, because F1 and CBA are widely used, we also computed them and included them in the full results published along with the MPD [18]. To address the speed of training and inference of the sentence classification framework, we report the time (hours) taken for framework training (TT) per model and time (seconds) taken to classify the entire test set (IT).

Training and Testing Procedure
All computations were carried out on a single NVIDIA Titan RTX 24 GB RAM GPU. We carried out ten test runs on data splits created according to the MCCV method [29] to counter the high variance in results due to the small sizes of the data sets. The TREC6 training data were split into 80 percent training and 20 percent validation data, and tests were carried out on a set of 500 sentences officially labeled as a test. With the MPD, the procedure involved randomly choosing 100 sentences for the testing of all LMs, and according to MCCV, splitting the remaining 900 sentences ten times into 80 percent training and 20 percent validation data.

Statistics
In this paper, the results are based on statistically significant differences in the assessed parameters. Our procedure for statistical analysis began with Shapiro-Wilk [30] testing for the normality of the distribution, which led us to the conclusion that in most cases, the two distributions were normal. In the second step, we carried out one-way ANOVA. If the ANOVA indicated statistically significant differences, we carried out a third step involving Tukey HSD multiple comparison tests to assess the trials that had significant differences between them. For a test case in which the values of only one Appl. Sci. 2020, 10, 3386 5 of 12 pair of parameters were compared, a Wilcoxon signed-rank test was used instead of a Tukey HSD test. The statistical analysis was carried out in Python3 (Statsmodels version ==0.10.1 and Pingouin version==0.2.9 packages). Throughout the analysis, we used a significance threshold of p = 0.05. The full results of the statistical analysis were published, along with the MPD [18].

Preliminary Study to Identify Framework Parameters and the Training Regimen
For a given LM, it is possible to select from a range of training parameter values that will affect both the training time (TT) and final model performance. Generally, training a model for a small number of epochs will provide an advantage in training time (TT) but can cause under-fitting, i.e., a negative influence on performance. If a model is trained for more epochs, TT will increase and the final model performance can increase, but it is not guaranteed due to a negative over-fitting phenomenon.
Therefore, before benchmarking the effects of the chosen LMs on the quality and speed of the training and the sentence classification, we carried out a preliminary study to determine the impact of the hyperparameters and the training parameters of the framework on our classification tasks.
Our work examined various parameters from a group of learning-rate-related parameters: (a) Learning rate-the step size at each iteration while moving toward a minimum of a loss function [31]; we tested values of 0.3, 0.2, 0.15, 0.1, and 0.05. (b) Initial learning rate-the learning rate that the model is trained with during the first epoch. (c) Patience-the number of epochs the model is trained with a given learning rate after the last improvement of model performance in a given step and before annealing of the learning rate. For example, given patience equal to 10, if the performance improved during training at the current learning rate only in the 3rd epoch, even without improvement in performance in any other epoch of the executed step, the training will proceed up to 3 + 10 epochs at the given learning rate. In our experiments, values of 5, 10, 15, 20, and 25 were tested. (d) Annealing factor-the ratio of decreasing the learning rate. Annealing factor × previous learning rate = current learning rate. In our experiments values of 0.3, 0.4, and 0.5 were considered. (e) Minimal learning rate-a minimal threshold learning rate. When computing the learning rate for the next training step according to the equation: learning rate × anneal factor results in a value lower than minimal learning rate, the model training is terminated. Our work experimented with values of 0.01, 0.008, 0.002, 0.001, and 0.0001.
Other considered parameters affecting the speed and quality of the model performance included: (1) The size of a mini-batch, i.e., the number of data instances provided at the same time on a model input during training, where values of 4, 8, 32, and 64 were considered. (2) The RNN hidden size, which defines the size of the vector representation of a sentence created by the RNN from token level embeddings, where values of 2, 16, 128, 256, and 512 were considered. (3) Shuffling sentences provided to the framework during training (true or false) were analyzed. Other framework parameters were set to default, as proposed by the authors of the Flair framework.
All trials during the preliminary study were computed with the FastText ("en-news") LM [2] with 1 million static word vectors. The initial study was carried out on both the TREC6 set and the MPD to demonstrate possible differences that could be attributed to the data set selection. Preliminary results from the TREC6 data set were considered decisive regarding the selection of training parameters for the final experiment as this data set consisted of over five times more data instances. When assessing the results of our preliminary study regarding training-related parameters, we considered only the MCC scores and TT of the framework. However, because the hidden size can also affect the IT, in this case, we also included it in the results.
Because the purpose of the study was to compare various LMs in the same training conditions, which most likely do not provide the highest possible performance for all LMs, for the final experiment, we selected training parameters that probably caused the models to be slightly under-fitted but provided a rather low TT. If one aimed for a training procedure that guarantees the top performance, a much longer TT must be considered and the training regime should be fitted precisely to both the LM and the data set.

Principal Component Analysis for Visualization of the LMs' Quality
To visualize discrepancies between the baseline and state-of-the-art LMs, we carried out a two-component principal component analysis of sentence embeddings provided by the framework using GloVe and RoBERTa large scalar mix LMs. The sentence-level vector representations were computed for all 1000 sentences from the MPD. Each sentence embedding was accompanied by an original class label, which allowed us to inspect the quality of grouping each class using the analyzed LM. Figure 2 shows that increasing the size of the sentence-level vector representation created by the RNN increased the TT. Simultaneously, for the short questions in the TREC6 data set, sentence embeddings as short as 16 elements enabled high performance. A higher dimensionality of sentence embeddings did not appear to improve the MCC score since no statistically significant differences were found; however, owing to a relatively low increase in the TT, we used a hidden size of 256 to ensure that we provided sufficient space for the information present in the sentences.

Preliminary Study
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 13 Figure 2 shows that increasing the size of the sentence-level vector representation created by the RNN increased the TT. Simultaneously, for the short questions in the TREC6 data set, sentence embeddings as short as 16 elements enabled high performance. A higher dimensionality of sentence embeddings did not appear to improve the MCC score since no statistically significant differences were found; however, owing to a relatively low increase in the TT, we used a hidden size of 256 to ensure that we provided sufficient space for the information present in the sentences.  Figure 3 illustrates the effect of the mini-batch size on the framework performance, given a timeconsuming training regimen. For the given setup, there were no statistically significant differences in the MCC scores with small mini-batch sizes of 4 and 8. However, the small mini-batch sizes were significantly superior to the higher values of 32 and 64. Additionally, the TT with a mini-batch size of 8 was significantly shorter than that with a batch size of 4. Therefore, we used a mini-batch size of 8. The full results of the statistical analysis have been published, along with the MPD and python code used in our research [18].  Figure 3 illustrates the effect of the mini-batch size on the framework performance, given a time-consuming training regimen. For the given setup, there were no statistically significant differences in the MCC scores with small mini-batch sizes of 4 and 8. However, the small mini-batch sizes were significantly superior to the higher values of 32 and 64. Additionally, the TT with a mini-batch size of 8 was significantly shorter than that with a batch size of 4. Therefore, we used a mini-batch size of 8. The full results of the statistical analysis have been published, along with the MPD and python code used in our research [18].     Table 3 depicts the influence of the training-rate-related parameters on the performance and TT of the adopted framework for the TREC6 set and the MPD. The performance discrepancies for the MPD and the TREC6 set measured in points of the mean MCC score were 0.05 and 0.036, respectively. When the mean TT was considered, the differences were up to 0.116 h for the MPD and 0.643 h for the TREC6 set. It can be observed that the relation of the TT achieved in various test runs was very similar in both data sets. This allowed us to hypothesize that the relation of the TT between test runs did not vary strongly with the choice of data set. However, this was not the case when the LMs quality was concerned. We believe that the proposed training regimes for the TREC6 data set, when applied to the MPD, caused each model to be as well trained, with only one exception of a very short training scheme p5lr.2mlr.008a.5. This was likely caused by the smaller size of the MPD, which allowed the LMs to fit the data in a shorter time.     Table 3 depicts the influence of the training-rate-related parameters on the performance and TT of the adopted framework for the TREC6 set and the MPD. The performance discrepancies for the MPD and the TREC6 set measured in points of the mean MCC score were 0.05 and 0.036, respectively. When the mean TT was considered, the differences were up to 0.116 h for the MPD and 0.643 h for the TREC6 set. It can be observed that the relation of the TT achieved in various test runs was very similar in both data sets. This allowed us to hypothesize that the relation of the TT between test runs did not vary strongly with the choice of data set. However, this was not the case when the LMs quality was concerned. We believe that the proposed training regimes for the TREC6 data set, when applied to the MPD, caused each model to be as well trained, with only one exception of a very short training scheme p5lr.2mlr.008a.5. This was likely caused by the smaller size of the MPD, which allowed the LMs to fit the data in a shorter time.  Table 3 depicts the influence of the training-rate-related parameters on the performance and TT of the adopted framework for the TREC6 set and the MPD. The performance discrepancies for the MPD and the TREC6 set measured in points of the mean MCC score were 0.05 and 0.036, respectively. When the mean TT was considered, the differences were up to 0.116 h for the MPD and 0.643 h for the TREC6 set. It can be observed that the relation of the TT achieved in various test runs was very similar in both data sets. This allowed us to hypothesize that the relation of the TT between test runs did not vary strongly with the choice of data set. However, this was not the case when the LMs quality was concerned. We believe that the proposed training regimes for the TREC6 data set, when applied to the MPD, caused each model to be as well trained, with only one exception of a very short training scheme p5lr.2mlr.008a.5. This was likely caused by the smaller size of the MPD, which allowed the LMs to fit the data in a shorter time.

Preliminary Study
Based on this exploratory study, we decided to use a training regimen providing almost the best performance for both the TREC6 set and the MPD, namely p20lr.1mlr.002a.5. For the TREC6 set, the selected training procedure allowed for a 0.933 mean MCC score at a decent TT cost of a median of 0.269 h per trained model. Therefore, the configuration of the adopted framework in the final LM comparison was as follows: patience = 20, initial learning rate = 0.1, minimal learning rate = 0.002, annealing factor = 0.5, mini-batch size = 8, hidden size = 256, with example shuffling during training. Table 3. Impact of the learning-rate-related parameters on the framework performance for the TREC6 set and the MPD given a mini-batch size = 8, hidden size = 256, shuffle = True. Compared variants were named with use of abbreviations: p-patience, lr-initial learning rate, mlr-minimal learning rate, a-annealing. For example, p5lr.1mlr.0001a.5 stands for patience = 5, initial learning rate = 0.1, minimal learning rate = 0.0001, and annealing factor = 0.5. The final training regime selected for the final LM comparison is highlighted in bold.

Results of the Main Study
A comparison of selected LMs in the previously described framework was carried out on two data sets. The results obtained with the TREC6 data set are presented in Figure 5. The results obtained with the preliminary data set are shown in Figure 6.
The results presented in Figure 5 indicated that the baseline provided by static embeddings created by Glove and FastText was high and reached a 0.933 ± 0.005 MCC score. Recent transformer-based models outperformed this baseline and have achieved up to a 0.96 ± 0.006 MCC score for the BERT large uncased scalar mix version. Simultaneously, the MCC scores achieved by RoBERTa, XLNet, and variants of these models were unexpectedly worse than those of BERT for several possible reasons, such as incorrect training regimens for these LMs and data set properties. Generally, these results may indicate that, given the small differences in quality and the small data set size, to distinguish among very good models, a similar benchmark analysis must be used on an n-times-larger data set.
Simultaneously, the results achieved on the MPD (see Figure 6) revealed a different picture. A large and statistically significant discrepancy was observed between the MCC scores of the baseline LMs providing static word embeddings, such as Glove (0.402 ± 0.051) or FastText (0.47 ± 0.047), and state-of-the-art context-aware LMs, such as RoBERTa (0.59 ± 0.026). Leveraging the scalar mix technique allowed us to use the output from all available layers of these complex models and thus achieve the best results in all tested model pairs. The default layer of the XLNet achieved a very poor result of 0.377 ± 0.063, whereas the scalar mix version achieved 0.521 ± 0.038; the straightforward BERT scored 0.532 ± 0.043 and its scalar mix counterpart scored 0.596 ± 0.027; and the RoBERTa and RoBERTa scalar mix version achieved 0.59 ± 0.026 and 0.603 ± 0.023, respectively. Therefore, the scalar mix technique appeared to be highly useful, in agreement with findings from Liu et al. [32]. However, the increase in quality came at the expense of both TT and IT. As expected, the IT was the lowest for LMs providing static embeddings, reaching as low as 0.111 ± 0.011 s per 100 sentences for the FastText LM. The highest inference time of 3.822 ± 0.015 was obtained with the XLNet scalar mix LM. Additionally, the TTs of the contextualized LMs in their scalar mix versions were the highest, reaching 1.444 ± 0.175 h for the XLNet scalar mix LM, whereas the baseline Glove-based framework required only 0.069 ± 0.011 h to train. Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 13 Figure 5. Comparison of framework performance with selected language models on the TREC6 data set. Explanation of the abbreviated names of LMs: S-scalar mix, L-large model version. Example notation RoBERTa LS stands for RoBERTa large scalar mix model.  The results for the MPD allowed us to visualize different qualities of various LMs grouping sentences by classes. Figure 7 presents a visualization based on a principal component analysis, which indicates that the Glove-based framework was unable to clearly divide classes, whereas the RoBERTa large scalar mix LM was able to distinguish groups with significantly higher quality.
providing static embeddings, reaching as low as 0.111 ± 0.011 s per 100 sentences for the FastText LM. The highest inference time of 3.822 ± 0.015 was obtained with the XLNet scalar mix LM. Additionally, the TTs of the contextualized LMs in their scalar mix versions were the highest, reaching 1.444 ± 0.175 h for the XLNet scalar mix LM, whereas the baseline Glove-based framework required only 0.069 ± 0.011 h to train.
The results for the MPD allowed us to visualize different qualities of various LMs grouping sentences by classes. Figure 7 presents a visualization based on a principal component analysis, which indicates that the Glove-based framework was unable to clearly divide classes, whereas the RoBERTa large scalar mix LM was able to distinguish groups with significantly higher quality.  and RoBERTa large scalar mix (right) LMs. Class labels are represented by colors: red-performance assessment, green-subject, blue-recommendations, black-technical, and magenta-time.

Conclusions
A comparison of the results obtained for the TREC6 set and the MPD suggested that they differed strongly in the magnitude of the MCC scores. For the TREC6 set, almost all LMs achieved MCC scores greater than 0.9, whereas on the MPD, the highest values were only slightly above 0.6, and the lowest values were below 0.4. The finding that even LMs creating static embeddings achieved very good results on the TREC6 data set was due to the simplicity of the classification tasks. Consequently, there was little room for improvement of state-of-the-art LMs, thus making any conclusions regarding the quality of newer LMs difficult to draw. In contrast, the MPD was challenging because using simple static word embeddings allowed for a 0.47 ± 0.047 MCC score at best. However, interestingly, the sophisticated context-aware LMs demonstrated their superiority on the MPD since they were able to locate the necessary information in sentences full of contextual meaning and achieve scores of up to 0.603 ± 0.023. From these results, we postulate that for context-aware LMs to fully demonstrate their advantages, a sufficiently difficult task must be used.
A similar phenomenon was encountered in a deep learning task from another domain, namely the detection of objects in image analysis. Owing to the progress in the quality of solutions offered by models using complex convolutional neural network architectures, the differences in performance on once broadly addressed Pascal Voc data set became very small. Many models were able to achieve up to an 80 percent mean average precision score [33]. A solution was proposed by the MS CoCo challenge [34], in which the metrics emphasized the need for more precise localization of detected objects, thus increasing the task difficulty. As a result, the state-of-the-art models achieved a less than 50 percent mean average precision score in this challenge [35], thus leaving a lot of room for further improvement and demonstration of high-quality solutions. Unfortunately, owing to the very small size of the MPD, other explanations for this phenomenon cannot be excluded: some newer deep learning models might enable the framework to learn properly even given the small amount of data, or a framework based on baseline LMs might simply need more data. To rule out these possibilities, a larger yet still challenging data set would be required.

Study Limitations and Future Work
The presented comparison was carried out on two small data sets, thus resulting in a high standard deviation of the results. In assessing differences between models achieving similar median MCC scores, i.e., differing by less than 0.01 point on the TREC6 set, before drawing any conclusions, the standard deviation exceeding 0.01 points in some cases must be considered. The TREC6 data set appeared to be insufficiently difficult for examining differences in state-of-the-art LMs. Therefore, our future work will aim at increasing the MPD size to rule out all the negative effects of the small analyzed data sample and to allow us to draw stronger conclusions on a more challenging data set. Another limitation of the study was associated with our preliminary study, in which we decided to select one set of hyperparameters and training parameters for all selected LMs. Such a decision provided a common ground for the whole comparison. However, it also had a negative impact, namely the selected training regime was most likely favorable for some of the LMs and we have no information about which ones. A contradictory approach could be to search for training parameters separately for each selected LM. Pursuing this line of the experiment would lead to a situation where probably almost all LMs should have their original training regimes and one would end up comparing "training-optimized" versions of each LM. Such an approach would also have its imperfections; for example, it can be very time-consuming for state-of-the-art LMs, and in the end, would put the researcher in a situation of comparing LMs trained with entirely different procedures.