Contextualizer: Connecting the Dots of Context with Second-Order Attention

: Composing the representation of a sentence from the tokens that it comprises is difﬁcult, because such a representation needs to account for how the words present relate to each other. The Transformer architecture does this by iteratively changing token representations with respect to one another. This has the drawback of requiring computation that grows quadratically with respect to the number of tokens. Furthermore, the scalar attention mechanism used by Transformers requires multiple sets of parameters to operate over different features. The present paper proposes a lighter algorithm for sentence representation with complexity linear in sequence length. This algorithm begins with a presumably erroneous value of a context vector and adjusts this value with respect to the tokens at hand. In order to achieve this, representations of words are built combining their symbolic embedding with a positional encoding into single vectors. The algorithm then iteratively weighs and aggregates these vectors using a second-order attention mechanism, which allows different feature pairs to interact with each other separately. Our models report strong results in several well-known text classiﬁcation tasks.


Introduction
The representation of natural language utterances is a central issue of the application of machine learning techniques to natural language. Indeed, natural language occurrences are difficult to represent using the mathematical objects on which algorithms may operate. Manually constructed symbolic representations tend to "leak", to be unable to capture edge cases, leading to the favoring of learned representations [1,2]. However, it is difficult to learn compact, efficient representations. Language is very sparse: not only does a vast array of local patterns exist, but they will often combine in very few and variable ways. This is not an issue of the granularity of fragmentation. For example, one could choose to break up sentences into characters rather than words. In doing so, the vocabulary of base tokens is greatly reduced in number but data grow sparser. More importantly, larger patterns remain difficult to sanction. That is, whether one opts to break sentences up into characters or words, words will still exist, and only a select few combinations thereof will be conceivable, fewer still will be observed.
As such, it is difficult to establish ways in which to construct suitable utterance representations from base components. Some, such as bag-of-words representations will opt to be deliberately simplified and eliminate the need for the learning of this construction. Other, more complex ones, such as topic models, will allow for some learning of utterance representations but require dedicated learning objectives. In contrast, through backpropagation, neural network approaches allow the learning of complex procedures for constructing utterance representations while still allowing for variety in downstream training objectives. Regardless, while the exact parametrization of these operations is to be inferred from the data, there is still considerable structure to provide. That is, the operations in the neural network and their composition need to be specified. Some aspects of this structure are somewhat imposed by practical issues, such as the ability to handle variable length data. Other, more deliberate choices are informed by prior beliefs about language. For example, tree-structured recursive neural networks are predicated on the notion that complex sentences are recursively constructed from their parts [3,4].
As mentioned, language is sparse and combinatorially difficult: even in highly synthetic languages, combinations of words will carry some semantic sense that the words themselves do not carry. This can be broadly construed as an issue of context: parts of utterances need to be put into the context of the whole in order to be understood.
Contextualization is fundamentally difficult because it is a circular problem. One cannot recompose a whole by putting its parts in the context of said whole without already knowing what the whole is. One potential solution to this is to iteratively adjust the context against which the parts are compared. That is, to begin with a presumably erroneous value of the context representation and adjust this value with respect to the tokens at hand. This is what Transformer encoders [5] do: First, the tokens in a sequence are compared against each other by the self-attention mechanism [6]. This consists of having each token attend over all tokens in the sentence, with a parametric attention function producing a scalar weighting of the importance of each token with respect to the attender. Then, a new representation for each token is produced through this weighting. The process is then repeated a set number of times, as shown in Figure 1. In doing so, the word representations produced by this encoder are put into the context of the whole. Transformers have achieved much success in various Natural Language Processing (NLP) tasks [7][8][9][10], ranging from sentiment analysis to question-answering and natural language inference [11].
Transformer Contextualizer Nonetheless, the self-attention mechanism on which Transformers are built has two chief disadvantages. Firstly, because all word pairs are evaluated, the complexity is quadratic with respect to the length of the utterance. Secondly, the weighting provided by the self-attention mechanism is based on bilinear forms, mapping each pair of word vectors to a single scalar. As such, Transformers require multiple sets of self-attention parameters, called heads, so that separate heads might focus on different features of the word vectors. To address these issues, we propose a new architecture-the Contextualizer-based on iteratively adjusting a context vector using a second-order attention mechanism. Its computational complexity grows linearly with respect to the sequence length, as opposed to quadratically. This article is organized as follows. Section 2 introduces the proposed approach. Section 4 describes experiments conducted in a few well-known document classification tasks and the results obtained. Finally, Section 5 concludes this article.

Contextualizer
The proposed encoder utilizes the approach illustrated in Figure 1. It proceeds as follows: Over a set number of steps, token representations are matched to the context representation to produce a contextualized representation. These representations are then aggregated into a new context representation, and the process begins anew. The present section details how these computations are carried out in a general-purpose setting.
Let w 1 , . . . , w n be a sequence of tokens forming a document of length n, indexed by i = 1, . . . , n. Before the contextualization steps, each token is mapped to a single real vector combining information about its identity and position in the sequence. The former is provided by a symbol embedding (e.g., pretrained word vectors) of dimension m, e(w i ) ∈ R m . The latter is based on a positional encoding inspired by Maupomé et al. [12]. This positional encoding is as follows: given a vector of parameters, s ∈ R m , the jth component of the encoding for position i, p(i), is given by: Multiplicative constants amplify or dampen the peak of a softmax application. By applying the softmax across tokens in the sequence, the parameter vector s allows the model to modulate certain positions for different components of p. Combining these two aspects, the token and its position, the vector representation of token w i is: where * denotes the Hadamard product.
Next, there are K contextualization steps, indexed by k = 1, . . . , K. Each of these steps will produce a new context vector, c k . The default context used at the first step of contextualization, c (0) , can be set to a constant or a learned parameter, for example. This context vector will contextualize the tokens, which will then be aggregated into a new context. An attention mechanism provides the contextualizing function called at every iteration. Using any of the various attention mechanisms in the literature, contextualizing each token would amount to producing a scalar weight, α i , for each token depending on its content and that of the attender (the context vector in our case). The contextualization of token x i at step k with respect to the previous context, c (k−1) , would then be However, the use of scalar attention weights requires that each component in the operands interacts only with its homolog, collapsing all information to a single number. One must therefore compute several of these interactions with different sets of parameterscalled attention heads-so that each of these may focus on different features. This is particularly important when using distributed token representations, where each component might carry a different semantic sense. As such, Transformers contain several attention heads. In contrast, to have the weight of each token be a vector, α i , rather than a scalar, α i , would let each component of the token representation have a separate salience with respect to the current context: This is illustrated in Figure 2.

Multi-Head Attention
Second-Order Attention Such a mechanism would eliminate the need for several heads, as each feature of each token can interact with each feature of the context by a different parameter. However, a second-order attention weighting would require parametrization by a tensor of degree three (3), which would take the parameter count of the model to O (m 3 ), as both the input and the context vectors are of dimension m. For token representations of even modest size, this would result in a computationally intensive model. Moreover, if the transformation to be learned does not require a full-rank degree-three tensor, such a parametrization would ostensibly be prone to overfitting because of its excess capacity. Instead, a tensor of rank (not to be confused with the degree or order of a tensor, the rank of a tensor is analogous to the rank of a matrix) u can be used, with u becoming a hyperparameter, see [13][14][15]. Using this approach, the attention vector for token x i would be computed as: where U (k) , V (k) ∈ R u×m and W (k) ∈ R m×u are the matrices of parameters for the kth contextualization step, and b (k) is the corresponding bias vector.
The newly computed attention vectors serve to update the context vector. This update z (k) , is then obtained by adding the contextualized token vectors together, followed by layer normalization [16]:z The update is then applied to the context vector to obtain the new value for the context vector: As mentioned, this process is repeated over a set number of steps, allowing information from different sets of tokens to inform the context. The final context vector, c (K) , then contains 3a fixed-size summary of the sequence of tokens at hand.
By reducing sequences of arbitrary length to fixed-size encodings, the proposed approach could potentially squash some information, whereas Transformers encode their input into a sequence of vectors. In return, the number of comparisons in one iteration of the Contextualizer algorithm grows linearly with respect to the number of tokens, as opposed to quadratically for the Transformer. Table 1 presents the computational complexities of these two algorithms as well as recurrent and convolutional layers. In addition, as illustrated by Figure 1, the Transformer has the drawback of losing sight of the original representation of the tokens, whereas the Contextualizer does not. Table 1. Complexities of common layers used in NLP; l designates the kernel size, h, the number of attention heads, n, m and u, designate the length of the sequence, the dimension of the word representations and the multiplicative dimension, respectively. For Transformers and Contextualizers, the complexity is for a single contextualization step.

Related Work
The computational complexity of Transformers makes them unwieldy for long sequences. As such, there have been several efforts to simplify the computation of full token-to-token self-attention to a lighter, more computationally efficient version.
For example, self-attention can be limited to local neighborhoods [17]. That is, instead of comparing tokens attend to each other throughout the sequence, tokens can be restrained to attending over a local portion of the sequence. This approach can be complemented by having sliding-window neighborhoods [18,19]. It can also be combined with causal masking: allowing tokens to attend only over preceding tokens. In doing so, segments can be chained recursively, allowing deeper contextualization levels to receive information from earlier segments [20].
More sophisticated, dynamic approaches can rely on inferred neighborhoods. Rather than determining neighbors by position, tokens can be bucketed by locality-sensitive hashing [21] or clustering [22]. Alternatively, neighborhoods can be determined by the syntax tree of the utterance at hand [23].
Yet another approach is to replace softmax self-attention with a lighter variant. For example, by replacing the exponential kernel implicit to softmax self-attention by a polynomial kernel, key-value products can be shared across queries, reducing computational complexity [24]. In the same vein, softmax self-attention can be approximated via random feature maps, thus reducing the dimension of the attention space as a function of sequence length rather than eliminating token pairs [25].

Exploratory Experiments
We began with experiments on the well-known Rotten Tomatoes dataset (MR) [26] (available at https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.home.html, accessed on 15 May 2022). It consists of 11 k English-language sentences from film reviews classed as either positive or negative in equal proportions. The documents are fairly short, with 95% of them being 45 words long or shorter.
The first experiments sought to compare the test set classification accuracy and computation time of Transformer and Contextualizer models of comparable sizes. These parameter counts were chosen to be relatively small in accordance with the limited size of the dataset. For each architecture, four models of 0.5, 1, 1.5 and 2 million parameters were trained and tested. These counts excluded the initial token embedding layer. Following the Transformer approaches [7,10,27], the documents were tokenized into word-piece tokens [28,29]. The hyperparameters of the models were set following Vaswani et al. [5] while adjusting for the smaller model size. The dimension of the embedding space, m, was set to 128. The number of contextualization steps (the number of encoding layers in the Transformer) was set to five for all models. The number of attention heads for Transformers was set to four. Hence, the variable adjusted to increase the parameter count of the models was the dimension of the attention space for Transformers and the rank of the tensor decomposition, u, for Contextualizers. Models were trained on a Intel Core i7-7700 machine with 32 GiB of memory over 10 epochs with batches of 32 examples using the Adam [30] optimizer, with a learning rate of 1 × 10 −4 . The best model on a 10% validation set over the 10 epochs was selected for testing.
Results are presented in Table 2. As shown, both architectures show comparable results across model sizes.
As expected, computation time is much greater for Transformer networks.
We then proceeded with experiments measuring the effect on performance of the nature of the default context. All Contextualizer models shared the same configurations except the default context, which was set to be either a constant, c (0) = 1, a vector of learned parameters, c (0) = c d or a random vector redrawn for every document from a uniform distribution, c (0) ∼ U (−1, 1). We hypothesized that using a random default context would make the network more robust by reducing dependence on prior beliefs and therefore mitigating overfitting. For the same reasons, one could expect a learned default context to be more likely to overfit than a constant one. Table 3 summarizes the results. As one might expect, a random starting context vector hurts performance when contextualization is performed but once. The models are quick to adjust, as all choices of default context seem to arrive at very similar final accuracies.

Further Results
We continue with experiments in binary document classification on other well-known English-language datasets in order to compare the performance of the proposed approach to the Transformer-based Universal Sentence Encoder architectures (USE) [8]. The Subjectivity dataset (available at https://www.cs.cornell.edu/people/pabo/movie-review-data/, accessed on 15 May 2022) (SUBJ) [31] comprises 10 k sentences around films classed as subjective or objective, released in June 2004. Annotation is automatic based on whether the sentence is a synopsis (objective) or an appreciation (subjective). The Customer Reviews dataset (CR) (available at http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar accessed on 15 May 2022), introduced by Hu and Liu [32], comprises 3775 reviews of electronic products. These reviews were extracted from Amazon and CNET and manually annotated. They are equally divided between positive and negative sentiment. Finally, the Multi-Perspective Question Answering dataset (available at https://mpqa.cs.pitt.edu/ corpora/mpqa_corpus/ accessed on 15 May 2022) (MPQA) [33] deals again in sentiment polarity, containing 10,606 k phrases from press articles.
Test accuracy on several benchmark text classification tasks of our Contextualizer models compared to the Universal Sentence Encoders (USE) [8], Transformer-based (T) and Deep-Averaging-Network-based (D) is shown in Table 4. The results of these experiments demonstrate that the Contextualizer architecture can perform competitively, even with small models and relatively small datasets.

Conclusions
We proposed an algorithm for constructing sentence representations based on the notion of iteratively adjusting a central context vector. This algorithm was closely related to the encoder part of the Transformer algorithm. One key difference was the use of the proposed second-order attention mechanism, replacing multiple attention heads.
Another important difference was the computational complexity, which was linear in sequence length. Transformer models have been the driving force behind the expansive use and development of large models such as BERT [7], RoBERTa [27], GPT-3 [10], Electra [9], among others, which are extensively trained by adapted language-modeling tasks. The reduced complexity of the Contextualizer model would be of use both in terms of pretraining and in terms of wielding these large models in downstream tasks. This is important given how the computational cost of these large models can further the economical divide between low-and high-resource laboratories and companies [34]. Additionally, this computational demand also incurs a significant environmental impact [35]. Therefore, lighter-computation approaches could help mitigate these concerns.
Yet, as seen in Section 4, the Contextualizer achieved results comparable to those of a Transformer when controlling for model size. Our approach was also able to achieve competitive results in benchmark document classification tasks even with low parameter counts. Furthermore, our results suggested the approach was robust to different choices of the number of contextualization steps and default contexts. Further work will be conducted in this direction, as well as in formally characterizing the conditions that stabilize the context vector as the number of contextualization steps increases.

Data Availability Statement:
The Contextualizer source code is available under the GNU GPL v3 license to ensure reproducibility. It can be found in the following repository: https://gitlab.labikb. ca/ikb-lab/nlp/contextualizer, accessed on 15 May 2022.