From General Language Understanding to Noisy Text Comprehension

: Obtaining meaning-rich representations of social media inputs, such as Tweets (unstruc-tured and noisy text), from general-purpose pre-trained language models has become challenging, as these inputs typically deviate from mainstream English usage. The proposed research establishes effective methods for improving the comprehension of noisy texts. For this, we propose a new generic methodology to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics from latent representations of multi-layer, pre-trained language models. Further, we clearly establish how BERT, a state-of-the-art pre-trained language model, comprehends the linguistic attributes of Tweets to identify appropriate sentence representations. Five new probing tasks are developed for Tweets, which can serve as benchmark probing tasks to study noisy text comprehension. Experiments are carried out for classiﬁcation accuracy by deriving the sentence vectors from GloVe-based pre-trained models and Sentence-BERT, and by using different hidden layers from the BERT model. We show that the initial and middle layers of BERT have better capabil-ity for capturing the key linguistic characteristics of noisy texts than its latter layers. With complex predictive models, we further show that the sentence vector length has lesser importance to capture linguistic information, and the proposed sentence vectors for noisy texts perform better than the existing state-of-the-art sentence vectors.


Introduction
Natural Language Processing (NLP) and its subfield, Natural Language Understanding (NLU), primarily focuses on the well-known complex problem of machine reading comprehension. Among several challenges facing NLU, the representation of sentences incorporating all their linguistic elements is considered to be highly complex. Due to the benefit of accurate sentence representations, e.g., sentence classification, text summarization, and machine translation, it has become necessary to explore new NLU methods that incorporate linguistic components, such as syntax and semantics, to improve accuracy. While a plethora of techniques have already been proposed, representing sentences as vectors of real numbers in high dimensional continuous space is still attracting attention [1,2].
For vector representation, both word and sentence embeddings have influenced the representation, following the rapid rise of Word2Vec [3]. Recently, unsupervised, pretrained language models, such as Bidirectional Encoder Representations from Transformers (BERT) [4], were successful in achieving state-of-the-art results in various NLP tasks, e.g., at the sentence level, thereby introducing a major paradigm shift in sentence representations. It may be noted that unlike shallow word vector models (i.e., Word2Vec [3] and Global Vectors for Word Representation (GloVe) [5]), deep models, such as BERT, are contextual.
Widespread use cases, such as sentiment analysis and intent analysis, mandate sophisticated sentence representations since these models essentially involve the identification of intricate linguistic patterns [6,7]. With the increasing proliferation of social media data, such as Tweets, it has further become inevitable to represent noisy texts as vectors to improve the model performance. For this reason, the BERT model is extensively used with Tweets to achieve state-of-the-art accuracy [8][9][10][11].
However, the application of pre-trained language models, such as BERT, in such scenarios is not easy because Tweets follow a different distribution [12,13] than the training inputs. While the BERT model is pre-trained on BookCorpus and English Wikipedia, the Tweets exhibit a significant deviation from this mainstream English language usage. Further, such challenges become extremely overwhelming, as Tweets cover different domains (e.g., day-to-day activities, sports, politics, and science); hence, they are significantly different. For these reasons, the language representation should clearly express non-task-specific general-purpose priors to develop artificially intelligent systems [14].
Although BERT is a general-purpose language model, the reason behind its overall success is not understood clearly. Goldberg [15] and Jawahar et al. [16] made efforts to understand BERT's ability to learn the structure and syntax of the English language. It was observed that different layers and regions of BERT capture different traits of the English language. However, it was not reported how these findings can enhance the quality of word or sentence embeddings. Indeed, Kumar et al. [17] demonstrated a drastic fall in BERT's performance with an increase in noise level. Apart from this, there was also the recent emergence of various pre-trained language models comprising multi-layer architectures [18]. Thus, a technique based on the latent representations of multi-layer models is vital for optimizing the vector representations to be used for use cases involving unstructured and noisy texts.
To address these research gaps, we use BERT as the multi-layer pre-trained language model and Tweets to represent noisy texts. We propose a systematic approach to derive a diverse set of sentence vectors combining and extracting various linguistic characteristics. For this, we have developed new probing datasets, using noisy texts based on the definition of specific probing tasks in [19] to analyze BERT's behavior across different linguistic territories centered on noisy texts. We derive generalizable sentence representations for noisy texts, comprising the most important linguistic characteristics to capture the meaning of a sentence. More specifically, our key contributions for enabling BERT in deriving meaning-rich sentence representation from the noisy text are as follows: • New noisy probing datasets: This new dataset can serve as benchmark datasets for future researchers to study the linguistic characteristics of unstructured and noisy texts. These datasets are available in the public domain (https://bit.ly/3rK0g7P) and available on request. • New methodology: this allows studying the linguistic comprehension of multi-layer language models. • Generic technique: used for sentence vector generation, using a pre-trained multi-layer language model. The rest of this paper is organized as follows. Section 2 provides relevant background information related to BERT's language understanding ability and probing tasks. Section 3 discusses the probing dataset generation approach and the strategy to generate sentence embeddings. Section 4 presents various experimental results across different probing tasks. The results are analyzed and discussed in Sections 5 and 6, respectively. Finally, Section 7 presents the conclusion.

Pre-Trained Language Models
Recently, word embedding [20] has become popular as a de facto starting point for representing the meaning of words. However, static methods, such as Word2Vec [3], GloVe [5], and FastText [21] generally generate fixed word representations in a vocabulary. Hence, these techniques cannot easily be adapted to identify the contextual meaning of a word. Recent discoveries of dynamic, pre-trained language representations, such as ELMo, a deep contextualized word representation [22], and BERT [4] produce dynamic representations of a word based on its context. The BERT architecture includes a multi-layer bidirectional Transformer [23] and an attention mechanism that learns contextual relations between words (or sub-words) in a text. The Transformer consists of two separate mechanisms-an encoder that processes the input, and a decoder that generates a prediction for the task. BERT, which is trained bidirectionally on a large corpus of unlabeled text, including the entirety of Wikipedia and BookCorpus, allows its models to understand the meaning of a language more correctly.
Further, several other Transformer-based language models perform well at a broader range of tasks beyond document classification, such as commonsense reasoning, semantic similarity, and reading comprehension. Transformer-XL [24], a Transformer-based autoregressive model, enables capturing longer-term dependencies in a sentence and achieves better performance on NLP tasks for both short and long sequences. Generative Pretrained Transformer 3 (GPT-3) [25], the third generation language prediction model in the GPT-n series created by OpenAI, is an auto-regressive Transformer model that performs reasonably well on unseen NLP tasks.
These recent models capture many facets of language relevant for downstream tasks, such as long-term dependencies, hierarchical relations, and context, to provide state-ofthe-art performance [15,26]. Further, previous research [20,27,28] demonstrated that deep learning models with complex architectures that leverage the contextual meaning of the words can significantly improve the learning abilities.

Language Understanding with BERT
Goldberg [15] assesses the extent to which the BERT model captures the syntactic structure of a sentence, using three stimuli tasks related to subject-verb agreement. Though the results are not directly comparable with previous work, due to BERT's bidirectional nature, the results suggest that purely attention-based BERT models are likely capable of capturing syntactic information at least as well as the sequence models, and probably better.
Jawahar et al. [16] performed a series of experiments, using conventional and standard English sentences extracted from books, to identify the linguistic information learned by BERT. These experiments were based on the probing datasets developed by [19], using the Toronto BookCorpus dataset [29], which was one of the two data sources used to train the BERT model. They showed that BERT's intermediate layers encode a rich set of linguistic characteristics, with surface features at the bottom, syntactic features in the middle, and semantic features at the top. This indicates that specific regions or layers of BERT are better suited for comprehending different aspects of the English language.
Similarly, Liu et al. [30] examined the linguistic knowledge captured by contextual word representations derived from different layers of large-scale neural language models. They showed that the frozen contextual representations are competitive with state-of-theart, task-specific models in many cases but fail on tasks requiring fine-grained linguistic knowledge. These studies focused only on structured and clean English sentences. They paid little attention to combining the layer representations based on linguistic knowledge to derive a meaning-rich sentence vector. Tenny et al. [31] introduced "edge probing" tasks, covering syntax, semantic meaning and dependency relations phenomena to study how contextual representations encode sentence structures. Their results using BERT and a few other pre-trained language models concluded that these models encode syntactic phenomena strongly but demonstrate comparable minor improvements on semantic tasks, compared to a non-contextual baseline. However, they worked only with the top layer activations of the BERT model. Further, Hewitt and Manning [32] showed that the contextual word representations provided by pre-trained language models, such as BERT, embed syntax trees in their vector representations. Nevertheless, they focused mainly on the syntactic structure.
On the other hand, Clark et al. [26] analyzed BERT's attention mechanism and showed that a specific set of attention heads correspond well to linguistic notions of syntax and coreference. Further, they demonstrated the ability of BERT's attention heads to capture important syntactic information, using an attention-based probing classifier.
However, Wang et al. [33] more recently concluded that the popular complex pretrained language models do not necessarily translate noisy text to better representations. Further, they highlighted that more exploration is needed in this area.

Probing Tasks
Shi et al. [34] and Adi et al. [35] introduced general prediction tasks to understand the language information captured by sentence vectors. Shi et al. [34] investigated whether Neural Machine Translation (NMT) systems learn source language syntax as a by-product of training by analyzing the syntactic structure as a by-product of training. Adi et al. [35] proposed a framework that facilitates a better understanding of the encoded representations, using tasks to predict a sentence's length, detect a change in word orders, and identify the words in a sentence.
Extending the work of [19,34,35] has introduced ten classification problems known as probing tasks. As we know, a probing task is a text classification problem that focuses on a grouping of sentences based on simple linguistic characteristics of sentences. The performance of this classification model depends on the richness of the linguistic information packed into a sentence representation. Further, these probing tasks are assigned to three groups: surface information, syntactic information, and semantic information, based on the primary linguistic feature required to perform the task effectively. The surface information tasks can rely only on surface properties (e.g., sentence length) to perform the classification successfully, and no linguistic knowledge is required. The tasks grouped under syntactic information are sensitive to a sentence's syntactic properties (e.g., depth of the syntactic tree). In contrast, semantic information-related tasks require some understanding of the meaning of a sentence and the semantic structure.

Methodology
This section introduces our methodology for leveraging probing tasks to efficiently validate BERT's ability to capture linguistic information and to derive meaning-rich sentence representations for noisy and unstructured text.
We propose a systematic approach to study the linguistic behaviors of multi-layer pre-trained language models by dividing the layers into multiple regions. Hence, in our methodology, we introduce a novel technique to generate sentence embeddings by bisecting BERT into three regions ( Figure 1) and then combining the hidden layers and token vectors, using two pooling operations. This allows us to analyze a diverse set of sentence vectors and their ability to capture linguistic information representing different linguistic domains. Next, we discuss our approach to generate probing datasets covering five probing tasks under noisy text conditions. These noisy probing datasets are crucial in determining each sentence vector's ability to capture necessary linguistic patterns to classify sentences to the target classes of each probing task. This framework can be easily extended to study the language comprehension capabilities of similar multi-layer language models.
The details of the methodology and its components are presented below.

Sentence Vector Generation
Our proposed methodology uses pre-trained language models to generate sentence representations. We use the "BERT BASE -uncased" model [4] to obtain word embeddings from different hidden layers to produce sentence vectors. This allows for exploration of the linguistic features of unstructured and noisy text, such as Tweets, as learned by different hidden layers of the BERT model. Further, to link BERT's learning ability with specific linguistic components, inspired by the work of Jawahar et al. [16], we divide BERT's hidden layers into three regions as shown in Figure 1. Jawahar et al. [16] showed that BERT's hidden layers encode a rich hierarchy of linguistic information, with surface features at the bottom, syntactic features in the middle and semantic features at the top. These linguistic components are crucial to represent the meaning of a sentence. Hence, in our methodology, we propose a novel technique to generate region-wise sentence embeddings by bisecting BERT into three regions. Further, apart from this, we use pre-trained Word2Vec [3] and Stanford's GloVe [5] models to derive sentence vectors. In contrast to BERT, these models, although shallow and non-contextual, offer 10 to 100 times more vocabulary, thereby providing a vibrant vocabulary. For this reason, it is possible that this may outweigh the benefits of a contextaware pre-trained model with a minimal vocabulary (e.g., BERT), especially for noisy data [36]. Moreover, we employ a word vector trained with the GloVe algorithm, using two billion Tweets to evaluate the impact of Twitter-specific pre-trained language models.
The following section explains the strategy to generate multiple sentence embeddings, using the pre-trained BERT BASE -uncased model. It may be noted that for the remaining paper, the term BERT is used to represent BERT BASE -uncased.

Sentence Representations Using Multi-Layer Pre-Trained Language Models
An input sentence is represented as a set of input tokens T = [t 0 , t 1 , . . . , t n ], where t 0 is the special [CLS] token that needs to be prepended for the out-of-the-box pooling schema to work. BERT produces a set of hidden layer activations H 0 , H (1) , . . . , H (L) , where n ] are the activation vectors of the lth hidden layer. We have ignored the H 0 , which consists of non-contextual word-piece embeddings, to generate sentence representations.
To generate a sentence representation based on multiple hidden layers, we propose to generate token representation vector w i for each token t i in T, using a layer pooling strategy. A layer pooling strategy combines different representations of the same token across multiple hidden layers. For this, three layer pooling strategies are studied: (i) SUMlayer-strategy, (ii) MEAN-layer-strategy, and (iii) CONCAT-layer-strategy. The SUM-layerstrategy and the MEAN-layer-strategy calculate the sum and mean of all the activation vectors h i ∈ R d of the selected hidden layers, respectively, producing w i ∈ R d , where d is the size of the hidden vector h. Thus, for each sentence, the Mean-layer-strategy and SUM-layer-strategy produce a matrix W ∈ R n×d . On the other hand, the CONCAT-layerstrategy concatenates the corresponding hidden activation vectors h i in the order of the layer numbers to generate w i ∈ R kd , where k is the number of BERT layers selected to generate the sentence representation. The CONCAT-layer-strategy produces a sentence representation W ∈ R n×nd .
Then, to derive the sentence vector S = [s 1 , s 2 , . . . , s ||w i || ], we apply multiple token pooling strategies for the sentence representation W (obtained after applying the layer pooling strategy), where each token representation w i is a row. A token pooling strategy merges all the token embeddings of a sentence into a singe vector. For this, we study two token pooling operations: (i) MEAN-token-strategy, and (ii) MAX-token-strategy. MEANtoken-strategy and MAX-token-strategy are calculated as s j = E 1≤j≤n W ij and s j = max 1≤j≤n W ij , respectively. Further, the MEAN-MAX-token-strategy we propose concatenates the MEAN-token-strategy output vector and the MAX-token-strategy output to derive a sentence vector twice the size of w i .
As shown in Figure 1, for each region Rn (n ∈ 1, 2, 3), different combinations of four layers are considered to generate sentence embeddings. We apply the layer pooling and token pooling strategy combinations listed in Table 1 across each BERT region Rn to systematically generate a diverse set of sentence embeddings, using the pre-trained BERT model. Table 1. Strategy to generate sentence embeddings from each region (ref. Figure 1) of the BERT model. Rn-i represents the ith layer in the nth region. We combine each layer pooling strategy with every token pooling strategy across identified layers to generate multiple sentence embeddings. Layer pooling is not applicable for the sentence embeddings generated using a single vector.

Layers
No. of Layers Layer Pooling Token Pooling Our experiments also utilize the state-of-the-art sentence embedding model, Sentence-BERT (SBERT) [37], which uses Siamese and triplet network structures to derive semantically meaningful sentence vectors from the pre-trained BERT model. We propose to use a pre-trained model optimized for Semantic Textual Similarity (STS), as this model is recommended for general-purpose use. SBERT uses a mean pooling strategy to derive sentence vectors from word embeddings.

Static Embeddings
We propose two shallow pre-trained models, namely Word2Vec [3] and GloVe [5], to generate sentence vectors for unstructured and noisy sentences, as these language models are rich in vocabulary, compared to BERT. It is known that any social media data, such as Tweets, often lack grammatical structure and can contain misspelled words and acronyms. Hence, a language model (e.g., Word2Vec and GloVe) that ensures a lower percentage of out-of-vocabulary (OOV) words may provide better sentence representations than a deep pre-trained model with a smaller vocabulary [36].
We use the MEAN-token-strategy to derive sentence embeddings, using Word2Vec and GloVe.

Noisy Probing Datasets
Probing datasets have a crucial role in the proposed study, as they validate the model's ability to comprehend linguistic characteristics. Studies reported earlier (e.g., [19]) have focused only on language comprehension of structured and grammatical sentences. Hence, the existing probing datasets [19] contain structured and grammatical sentences and rely on the pre-trained Probabilistic Context-Free Grammar (PCFG) model [38] and part-ofspeech, constituency, and dependency parsing information provided by the Stanford Parser. Although the PCFG model reported close to 87% accuracy for regular English sentences, it is poorly suited for noisy texts [39,40]. Further, the available Twitter-specific dependency parsers reported a low overall accuracy level with further reductions if the test set topics differed from the training dataset. Thus, the use of automatic part-of-speech or automatic dependency parsing as suggested by [19] is not a feasible option for noisy probing datasets. Hence, we propose to use a noisy dataset manually annotated with the required linguistic labels to generate quality probing datasets.
For this, we use "Tweebank v2", a collection of English Tweets [41], annotated in Universal Dependencies [42], as it can be exploited to generate the required noisy probing datasets. Authors of [41] followed a rigorous two-stage process to develop 3550 manually labeled Tweets. They automatically annotated the Tweets, using a parser trained on a sample set of Tweets manually annotated in the first stage. In the second stage, they manually corrected the parsed data. These high-quality labels are crucial to developing gold standard probing datasets for noisy text data. However, this research [41] did not focus on specific aspects of linguistics, such as dependency parsing information. Due to the unavailability of these linguistic labels, we are focusing only on a selected subset of probing tasks out of the ten probing tasks proposed by [19]. Nevertheless, the selected probing tasks continue to cover the three important linguistic categories (i.e., surface, syntactic and semantic), thereby enabling us to analyze the richness of the sentence vectors across all three levels of linguistic information and ensuring the quality of the findings. Further, we introduce additional criteria explained below to adapt the dataset to noisy conditions. The probing tasks that are focused on in this study are explained in the following sections.

Word content
We consider a 10-class classification task with ten words as targets, considering the available manually annotated instances. The aim is to predict which of the target words appears in the given sentence. Words that are not part of the vocabulary are split by BERT into subwords and characters. In this case, word embeddings might not reflect the best meaning of the word. Hence, we propose to use only the words that appear in the BERT vocabulary as target words. We construct the data by picking the first ten lower-cased words occurring in the corpus vocabulary ordered by frequency and having a length of at least four characters, as this is a noisy dataset this improves the reliability of the dataset. Each sentence contains a single target word, and the word occurs precisely once in the sentence. The task is referred to as "WC" in the paper.

Bigram shift
The purpose of the Bigram Shift task is to test whether an encoder is sensitive to legal word orders. Two adjacent words in a Tweet are inverted, and the classifier performs a binary classification to identify inverted and non-inverted Tweets. The task is referred to as "BShift" in the paper.

Tree depth
The Tree Depth task evaluates the encoded sentence's ability to understand the hierarchical structure by allowing the classification model to predict the depth of the longest path from the root to any leaf in the Tweet's parser tree. The dataset contains six different classes (two to seven) based on the tree depth. The task is referred to as "TreeDepth" in the paper.

Semantic odd man out
The Tweets are modified by replacing a random noun or a verb o with another noun or verb r. The task of the classifier is to identify whether the sentence gets modified due to this change. The task is called "SOMO" in the paper.
These five probing tasks, covering the three key linguistic information levels, are presented in Table 2.

Sentence Vector Evaluation Framework
The most commonly used approach to generate sentence vectors is to average the BERT output layer (BERT embeddings) or to use the output of the first token (the [CLS] token). We extend the common sentence vector generation with our sentence embedding generation technique and combine it with the new probing datasets to develop a sentence vector evaluation framework, as shown in Figure 2. This framework enables us to assess the ability of various sentence vectors to capture linguistic information that can be useful for various downstream tasks. Probing datasets consist of the noisy datasets we developed, using manually annotated Tweets. As discussed in Section 3.1.1, the Embedding Generator generates a diverse set of sentence vectors based on the BERT model while generating sentence vectors using various other pre-trained models. Next, sentence vectors are forwarded to a classification model. We propose to use a Logistic Regression (LR) model and a Multi-Layer Perceptron (MLP) model to analyze the relationship between different sentence vectors and the shallowness or the deepness of the network.

Dataset Development
As discussed in Section 3.2, we have developed five different probing datasets for these different probing tasks. The probing datasets are developed based on the Tweebank v2 dataset (https://github.com/Oneplus/Tweebank, accessed on 10 August 2020) developed by [41]. Tweebank v2, a collection of English Tweets annotated in Universal Dependencies [42], is useful since it can be exploited for training NLP systems to enhance their performance on social media texts. Tweebank v2 dataset contains 3550 Tweets, which includes tokenization, part-of-speech-tagging, and labeled Universal Dependencies. This dataset is split into train, development, and test sets as shown in Table 3. We use these tokenization, part-of-speech tagging and labeled dependencies to generate five probing datasets as discussed in Section 3.2. Table 4 shows the distribution of Tweets for training, validation and tests in each of the probing datasets. Our splits are based on the original splits of the Tweebank v2 dataset.

Sentence Embedding Generation
As shown in Table 5, we leverage a few commonly used pre-trained language models and the Sentence-BERT embeddings model under each of the base language models discussed in Section 3.1. For training, standard sentences from the Google News dataset and Wikipedia were used for "GoogleNews" and the "glove_6b" pre-trained models while BERT BASE model was trained using BookCorpus and Wikipedia data. Similarly, the SBERT-NLI-base sentence transformer was trained on the SNLI [43] dataset, whereas the "glove_twitter" language model was trained with a large number of Tweets.

Probing Task Classification
We use SentEval toolkit [44] to evaluate different sentence encoders. As in [45], we use a deeper network-MLP and a Logistic Regression classifier-to make the findings more practical while reducing the undesirable side effects, such as preference for embeddings of a larger size. We use the classifier and the validator provided with the SentEval toolkit (https://github.com/facebookresearch/SentEval/, accessed on 12 August 2020) [44] after modifying it to accommodate the proposed sentence embeddings. Following Conneau et al. [44], we use the parameters, shown in Table 6, for Logistic Regression and MLP. However, to cope with the computational constraints, we modify the value of the "batch_size" parameter to 32.

Results
This section first analyses the effectiveness of the proposed pooling strategies: layer pooling and token pooling. Next, we analyze the distribution of the language understanding (surface, syntactic and semantic) across the various regions of the BERT model proposed for this study. Finally, we analyze the performance of the sentence vectors generated by combining these findings along with the existing sentence vector generation mechanisms, including the state-of-the-art techniques.

Pooling strategy analysis:
For this study, we consider sentence embeddings derived using all four layers of each BERT region. Table 7 shows the resulting sentence vector sizes for each combination of layer and token pooling strategies when applied to four hidden layers of BERT. The CONCATlayer-strategy and MEAN-MAX-token-strategy significantly increase the resulting sentence vector size, by four times and two times, respectively. From the results shown in Table 8, we note that the Logistic Regression model achieves the best results with sentence vectors of size 6144, whereas the MLP model achieves the best results, in most cases, with 1536 vector size. From this, it becomes evident that simpler models, such as Logistic Regression, require huge sentence vectors to identify linguistic patterns, while complex models can achieve improved results with significantly lower-sized sentence vectors. Similarly, Table 9 shows that the Logistic Regression model achieves, in most cases, the best accuracy with the CONCAT-layer-strategy. However, one of the syntactic information groups' tasks and the semantic information task obtains the best results with the MEAN-layer-strategy. On the other hand, the MLP model performs satisfactorily with the MEAN-layer-strategy and SUM-layer-strategy. Both logistic regression and the MLP models prefer the MEAN-MAX-token-strategy or MEAN-token-strategy, while MAXtoken-strategy performs poorly across all the performing tasks. In the rest of the analyses, the results derived with the MEAN-layer-strategy and MEAN-token-strategy using the MLP classifier are used. This enables easy comparisons of the BERT based sentence embeddings with vectors derived from static pre-trained models by calculating the average of the word embeddings. Further, Sentence-BERT internally uses the mean of the token embeddings to generate sentence embeddings.
Region-wise analysis: Figure 3 shows a heat map of the accuracies (darker colors equate to higher accuracy) of each probing task with sentence vectors generated using each hidden layer of the BERT model. The SentLen and the WC tasks in the Surface Information group achieves better accuracy with sentence vectors derived from hidden layers in the first region (R1), and the performance gradually decreases as we move toward the last layers of the BERT model. On the other hand, higher accuracies are obtained for the syntactic information tasks-BShift and TreeDepth-with the sentence vectors generated using the hidden layers from the second region (R2). The initial layers of the R2 show the most contribution to the accuracy, while the hidden layers from the R1 contribute poorly to the syntactic information group tasks. Further, the hidden layers that contribute to increasing the sentence vectors' richness for the semantic information task are found at the border of R2 and R3. Overall, in the context of noisy texts, the hidden layers in the region R1 contain most of the linguistic characteristics required to address probing tasks in the surface group. In contrast, the syntactic and semantic group tasks are able to identify necessary linguistic patterns from R1 and R2. Nevertheless, the sentence vectors' performance derived from hidden layers in the last region (R3) ranges from low to marginal, indicating their inability to capture linguistic information from noisy texts.
Overall accuracy: Table 10 presents the classification accuracies for probing tasks with sentence vectors derived from GloVe-based pre-trained models, Sentence-BERT and using different hidden layers from the BERT BASE -uncased model. In the context of BERT-based sentence vectors, we have considered sentence vectors derived based on the last hidden layer, the last four hidden layers, and all 12 layers. Devlin et al. [4] achieved comparable results for featurebased by using those layers as input to an artificial recurrent neural network. Based on our findings, we propose two separate approaches for noisy texts. The first is based on BERT's first hidden layer, while the second combines the first hidden layer of each BERT region, i.e., layers 1, 5 and 9 (1-5-9).
The MLP model achieves the best accuracy for all the probing tasks, except for the SOMO task, which is in the semantic information group. The Logistic Regression model has reached the surface information probing tasks' best results with the BERT-based sentence vectors derived only by using the first hidden layer. However, Logistic Regression performs better for the syntactic and semantic information probing tasks with sentence vectors generated using all 12 hidden layers of the BERT model.
On the other hand, the best results for the MLP model are mostly achieved with the sentence vectors derived using the 1-5-9 hidden layers. Only the semantic information task achieves the best accuracy with all 12 hidden layers. The WC probing task performs well with the first hidden layer, and the second-best accuracy is obtained with the 1-5-9 hidden layers.

Discussion
The experimental results related to the comparison of BERT sentence vectors with respect to GloVe and Word2Vec is given in Table 10. It can be observed that the BERT sentence vectors performed exceptionally well on all the probing tasks and performed better than GloVe and Word2Vec, despite these two representations having a rich vocabulary. Specifically, the GloVe model, despite being trained on a large corpus of Tweets, performed poorly. This overall performance observed for noisy texts is in agreement with the superior performance reported earlier [30][31][32] of contextual representations derived using BERT over non-contextual baselines on standard English sentences. Further, the sentence vectors derived from BERT's hidden layers achieved significantly better results over the stateof-the-art Sentence-BERT model. This underpins the importance of combining useful linguistic components to derive superiors sentence representations.
However, as we can see from Figure 3, the latter hidden layers of BERT performed poorly in capturing linguistic information compared to the shallow layers. We observe that the unstructured nature of the Tweets benefits more from the initial layers that capture shallow information than the last layers, which capture more complex hidden information. Since the results reported by authors of the BERT model [4], the top layers of the BERT model have been commonly used to derive sentence vectors for NLP tasks with both standard and noisy texts [7,31,36,46]. Nevertheless, our results confirm that the initial layers of the pre-trained BERT model are more efficient at comprehending noisy text. Further, the earlier layers of each region are observed to contribute more significantly toward encoding specific linguistic components.
Further, as we can see from Table 7, the experiments relating to the length of the sentence vector also revealed that the simpler predictive models perform better with large sentence vectors, while complex models are observed to prefer significantly smaller vectors. This underpins the fact that the complex models are better at identifying intricate patterns from compressed vectors that contain rich information. However, simpler models need higher dimensions of sentence vectors to achieve better results.
The methodology presented to systematically analyze the knowledge distribution within a multi-layer pre-trained language model, while generating sentence vectors, can capture various linguistic characteristics. This technique, being generic, can be directly applied to most multi-layer pre-trained language models to understand the linguistic properties captured by latent representations. The method leads to devising similar sentence embedding strategies to generate sentence embeddings from other Transformer-based models, such as Transformer-XL and GPT-3 models. The new probing datasets and the proposed framework can be used to study the ability of these models to comprehend natural language. Moreover, the noisy probing datasets generated in this study can lead to further research in NLU by providing additional datasets that cover the domain of noisy data.
It is also significant that future research should focus on the understanding of preprocessing Tweets to reduce the noise level of the linguistic knowledge distribution and the derived sentence representations. Moreover, the same probing dataset could be used to examine the relationship between the BERT's attention layers and the meaning-rich sentence embeddings. This could help to derive more meaning-rich sentence vectors.

Conclusions
The research work reported in this paper demonstrates that the general language understanding of pre-trained language models, such as BERT, can be effectively exploited to comprehend noisy texts. Further, the proposed methodology can effectively generate sentence vectors encoding different linguistic aspects, using latent representations of multilayer pre-trained language models. We observe that the shallow layers of the BERT model are better at capturing linguistic information of noisy and unstructured texts than the deeper layers for general English sentences [16]. Further, it can be noted that simple predictive models prefer large sentence vectors, while complex models are more successful with significantly smaller sentence vectors. It is worthwhile noting that the first layer or a combination of BERT layers from each region can be used to derive generalizable sentence vectors for noisy and unstructured texts.
We believe that our new noisy probing datasets can serve as benchmark datasets for future researchers to study the linguistic characteristics of unstructured and noisy texts. Currently, work is in progress on developing new and larger probing datasets for noisy texts, covering all 10 probing tasks.

Data Availability Statement:
We publish a new dataset that can serve as benchmark datasets for future researchers to study the linguistic characteristics of unstructured and noisy texts. These datasets are available in the public domain (https://bit.ly/3rK0g7P) and available on request.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

AI
Artificial Intelligence BERT Bidirectional Encoder Representations from Transformers NLP Natural Language Processing NLU Natural Language Understanding NMT Neural Machine Translation PCFG Probabilistic Context-free Grammar