Changing the Geometry of Representations: α-Embeddings for NLP Tasks

Word embeddings based on a conditional model are commonly used in Natural Language Processing (NLP) tasks to embed the words of a dictionary in a low dimensional linear space. Their computation is based on the maximization of the likelihood of a conditional probability distribution for each word of the dictionary. These distributions form a Riemannian statistical manifold, where word embeddings can be interpreted as vectors in the tangent space of a specific reference measure on the manifold. A novel family of word embeddings, called α-embeddings have been recently introduced as deriving from the geometrical deformation of the simplex of probabilities through a parameter α, using notions from Information Geometry. After introducing the α-embeddings, we show how the deformation of the simplex, controlled by α, provides an extra handle to increase the performances of several intrinsic and extrinsic tasks in NLP. We test the α-embeddings on different tasks with models of increasing complexity, showing that the advantages associated with the use of α-embeddings are present also for models with a large number of parameters. Finally, we show that tuning α allows for higher performances compared to the use of larger models in which additionally a transformation of the embeddings is learned during training, as experimentally verified in attention models.


Introduction
Word embeddings are used as a compact representation for the words of a dictionary. They are learned starting from one hot encodings by maximizing the likelihood of a chosen probabilistic model. Rumelhart et al. [1] first introduced the idea of using the internal representation of a neural network to construct a word embedding. Bengio et al. [2] employed a neural network to predict the probability of the next word given the previous ones. Mikolov et al. [3] proposed the use of a recurrent language model based on recurrent neural networks, to learn the vector representations. More recently, this approach has been exploited further and with great success by means of bidirectional LSTM (long short-term memory networks) [4] and transformers [5][6][7].
In this paper, we focus on Skip-Gram (SG), a well-known log-linear model for the conditional probability of the context of a given central word. Together with the continuous bag of words (predicting the central word given the context instead), SG has been shown to be able to efficiently capture syntactic and semantic information [8,9]. Skip-Gram is at the basis of many popular word embedding algorithms such as Word2Vec [8,9] and models based on weighted matrix factorization of the global co-occurrences such as GloVe [10], cf. Levy and Goldberg [11]. These methods are deeply related, Levy and Goldberg showed how Word2Vec SG with negative sampling is effectively performing a matrix factorization of the shifted positive pointwise mutual information [11].
Mikolov et al. [12] noted how, once the embedding space has been learned, syntactic and semantic analogies between words translate into linear relations between the respec-tive word vectors. There have been numerous works investigating the reason for the correspondence between linear properties and word relations. Pennington et al. gave a very intuitive explanation of this behavior in their paper on GloVe [10]. More recently, Arora et al. [13] investigated this property by introducing a hidden Markov model, under some regularity assumptions on the distribution of the word embedding vectors, cf. [14].
Word embeddings are often used as input for other computational models, to solve more complex inference tasks. The evaluation of the quality of a word embedding, which ideally should encode syntactic and semantic information, is not easy to be determined, and different approaches have been proposed in the literature. This evaluation can be in terms of performance on intrinsic tasks such as word similarity [10,[15][16][17] or by solving word analogies [8,12]. However, several authors [18,19] have shown a low degree of correlation between the quality of an embedding for word similarities and analogies on one side and on downstream tasks on the other, for instance on classification or prediction, to which the embeddings are given in input. This observation points out the need for a complete experimental evaluation of word embeddings in both intrinsic and extrinsic tasks.
Several works have highlighted the effectiveness of post-processing techniques [15,16], such as Principal Components Analysis (PCA) [14,20], focusing on the fact that certain dominant components are not carriers of semantic nor syntactic information, and thus they act like noise for determinate tasks of interest. Recently, we have proposed in [21,22] a different approach which acts on the learned vectors after training, similarly to a postprocessing step, by using a geometrical framework based on Information Geometry [23,24], in which word embeddings are represented as vectors in the tangent space of the probability simplex. A family of word embeddings called natural α-embeddings is introduced, where α is a deformation parameter for the geometry of the probability simplex known in Information Geometry in the context of α-connections. Noticeably, α word embeddings include the standard word embeddings as a special case for α = 1. In this paper, we revisit the natural α-embeddings and evaluate them over different tasks. We show how the α parameter provides an extra handle that, by deforming the word embeddings, allows for an improvement of the performance on different intrinsic and extrinsic tasks in Natural Language Processing (NLP). Recently, the use of Riemannian methods has attracted considerable interest in the literature of NLP, recent applications of Riemannian optimization algorithms can be found in [25,26]. In particular, approaches learning word embeddings on a Riemannian manifold have been devised, such as the Poincaré GloVe [27,28] on the Poincaré disk and the Joint Spherical Embeddings (JoSE) [29] on the sphere. This article is an extended version of [30] and is organized as follows. In Section 2, we introduce the word embeddings based on conditional models, while in Section 3, we review the geometrical framework for α-embeddings. In Section 4, we assess the impact of α-embeddings on the performances of different intrinsic and extrinsic tasks in NLP, with particular emphasis on attention mechanisms, where we show that α-embeddings (controlled by a single scalar) are able to provide better performances than transformations of the embeddings requiring a large number of parameters. Finally, in Section 5, we conclude the paper and present future perspectives.

Word Embeddings Based on Conditional Models
One of the simplest models which can be used for the unsupervised training of a set of word embeddings are linear conditional models. The Skip-Gram conditional model [9,10] allows the unsupervised training of a set of word embeddings by predicting the conditional probability of any word χ to be in the context of a central word w where Z w = ∑ χ ∈D exp(u T w v χ ) is the normalization constant. This model is defined by two column vectors u, v ∈ R d to each word. The set of vectors u w , v w for w ∈ D arranged by rows compose two n × d matrices U, V, respectively. Such matrices are typically learned from data by maximum likelihood estimation [8,10,11].
Equation (1) represents an over-parametrized exponential family in the open n − 1 dimensional simplex P n , parametrized by two matrices U and V of size n × d, where n is the cardinality of the dictionary D and d is the size of the embeddings. Notice that the number of free parameters (2dn) is greater than the number n of sufficient statistics 1 χ , corresponding to the one hot encoding of the words of the dictionary. We will refer to the columns of the matrix V as V k and to its rows as v χ , seen as column vectors. Analogous notation will be used for U. It is common practice in the literature of word embeddings to consider u w or alternatively u w + v w as embedding vectors for a word w from the dictionary, see [8][9][10]16,20]. In the remaining part of this section we will review the natural α-embeddings and limit embeddings originally proposed in [21,22] based on notions of Information Geometry [23,24].
After the inference procedure for the estimation of the model parameters, the matrices V and U are fixed. For each word w, the conditional model p w (χ) from Equation (1) is a ddimensional exponential family E V in the n − 1 dimensional open simplex P n , which models the probability of a word χ in the context of the central word w. From this perspective, the exponential model E V has d sufficient statistics corresponding to the columns of V, while each row u w of U corresponds to an assignment for the natural parameters, which identifies a probability distribution in the model. During training, both matrices U and V are updated to maximize the likelihood of the observed data in the corpus. This implies that both the sufficient statistics of the exponential model E V are updated, by changing the columns of V, as well as the assignment of the natural parameters u w of each conditional distribution p w .
Each conditional model p(χ|w) lies inside a face of the (n × n − 1)-dimensional simplex, corresponding to the ambient space for the joint distribution p(χ, w) parametrized by U, V. Since the conditional models are defined over the same sample space and have the same sufficient statistics determined by V, we can identify them with a single exponential family E V embedded in P n , as depicted in Figure 1. The Skip-Gram model is defines a joint curved model in the (n × n − 1)-dimensional simplex. Some faces of this model correspond to the conditional models p(χ|w) for some w. The conditional models are defined over the same sample space and have the same sufficient statistics determined by V, they represent, in fact, different points on the same exponential family E V embedded in P n . At each training step, the model E V varies with V.

α-Embeddings
In Information Geometry, a statistical model is represented as a Riemannian manifold endowed with a Riemannian metric given by the Fisher information matrix [23,24,31]. The Fisher matrix for the exponential family (1) corresponds to the covariance matrix of the centered sufficient statistics where ∆V(p 0 ) = (V − E p u [V]) are the centered sufficient statistics evaluated over the dictionary and ∆v χ (p 0 ) corresponds to a row of ∆V(p 0 ) expressed as a column vector. The geometry of a statistical manifold defined by a metric and a connection can be induced by a divergence [23]. Taking two positive measures p and q, the family of α-divergences, for α ∈ R, are defined as It is a known fact that α-divergences are also f-divergences and thus induce the same metric on the manifold, which is the Fisher metric [32], indeed by taking the Hessian of an α-divergence between infinitesimally close probability distributions, we obtain the Fisher information matrix I(p 0 ) for any α. The exponential family, endowed with the family of α-divergences is a dually-flat manifold, meaning that the α-divergences define a corresponding family of α-connections [23], which are dually coupled with respect to the metric. For α = 0, we obtain the Levi-Civita connection, which is by definition compatible with the metric and thus self-dual. It is possible to prove that the h α -representation provides a parametrization in which the corresponding α-connection is flat. In our previous papers [21,22], using an information geometric framework, we have introduced a novel family of embeddings called natural α-embeddings. Given a reference measure p 0 in the exponential family E V , the natural α-embedding of a given word w from the dictionary is defined as the α-projection Π (α) represented by means of the h α -representation, used to deform probability distributions in the simplex [23,32]. The main intuition behind this definition is that a word embedding for w corresponds to the vector in the tangent space of p 0 that allows to reach p w starting from p 0 . Since the h α -representation, the logarithmic map and the projection are expressed as a function of the same parameter α, a family of natural α-embeddings W (α) In the following, we report the main formula for the computation of the natural α-embeddings, all the detailed derivations can be found in [21,22]. By combining the formula for the α-projection and the α-logarithmic map, we obtained the following formula for the natural α-embeddings where, employing a slight abuse of notation, p 0 is a vector, diag(p 0 ) is a diagonal matrix with diagonal p 0 , and the vector l We summarize the α-embeddings calculation with the following pseudo-code (see in Algorithm 1).

Algorithm 1: α-embeddings.
Data: U, V matrices obtained from the training of GloVe α-embeddings can be used both for downstream tasks and also to evaluate similarities and analogies in the tangent space of the manifold [21,22]. Given two words a and b, a measure of similarity is defined by the geometric cosine similarity between α-embeddings Moreover, analogies of the form a : b = c : d can be solved by minimizing an analogy measure κ It has been shown in [21,22] that for α = 1, if p 0 equals the uniform distribution over the dictionary, the embeddings of Equation (5) reduce to the standard vectors u w . Furthermore, by substituting the Fisher information matrix I(p 0 ) with the identity matrix, Equations (7) and (8) reduce to the standard formulas used in the literature for similarities and analogies [8][9][10]. Proposition 3 in [21] or equivalently Proposition 1 in [22] provides conditions under which the Fisher information matrix is isotropic, i.e., proportional to the identity. It is quite common practice in the literature to use the embedding vectors u + v, which have been shown to provide better results [10] than simply u vectors. In the context of natural α-embeddings, the vectors u + v can be interpreted as a shift of the natural parameters u of the exponential family. It can be demonstrated [21,22] that this corresponds to a reweighting of the probabilities in Equation (1) in which N w is an additional factor emerging from the normalization. Equation (9) represents a change of reference measure proportional to exp(v T w v χ ), i.e., giving more importance to those words χ whose v vectors are aligned to that of the central word w. This defines an analogous notion of u + v embeddings (popularly used in the literature) in the context of α-embeddings W (α)

Limit Embeddings
The behavior of the α-embeddings for α progressively approaching minus infinity turns out to be of particular interest. Indeed, in this case, l α p 0 w (χ) is progressively more and more peaked on the words χ which have larger ratio p w (χ)/p 0 (χ), up to the point of corresponding to a delta distribution over the set Notice that the norm of l α p 0 w (χ) tends to infinity as α tends to minus infinity, since 1 − α tends to infinity and thus the maximum of the probability ratio (which is always greater or equal to 1 for any two distributions) is progressively predominant, see Equation (6). Since for all tasks of interest, we always use normalized α-embeddings (either with the identity matrix or with the Fisher metric), this allows us to consider only the direction of the tangent vectors. In the limit of α going to minus infinity, the un-normalized limit embeddings simplify to where 1 χ * w is the indicator function for the words in χ * w from the dictionary. Notice that diag p 0 weights the rows of ∆V, while the indicator function 1 χ * w selects only a restricted number of rows of the matrix, which are then premultiplied by the inverse of the Fisher information matrix. In most cases, the ratio has a unique argmax, hence the limit embeddings depend on one single row of ∆V This simple formula allows the straightforward implementation of geometrical methods which are based on un-normalized α-embeddings in the limit case of α going to minus infinity. Additionally, let us notice that in the case for two words w and w , we have χ * w = χ * w , then the associated α-embeddings will tend to correspond as α → −∞, thus limit embeddings also naturally induce a clustering in the embedding space.

Experiments
We considered two corpora: the English Wikipedia dump October 2017 (enwiki), with 1.5B words, and its augmented version composed by Gutenberg [33], English Wikipedia and Book-Corpus [34][35][36] (geb), with 1.8B words. We used the WikiExtractor Python script [37] to parse the Wikipedia dump xml file. A minimal preprocessing was performed, by lowercasing all the letters and removing stop-words and punctuation.
For each corpus, we trained a set of word embeddings with vector sizes of 50 and 300. We employed a cut-off minimum frequency (m0) of 1000, obtaining a dictionary of about 67 k words for both enwiki and geb. For GloVe, we used the code at [38], the window size was set to be 10 as in [10], with a decaying weighting rate from the central word of 1/d for the calculation of co-occurrences. We trained the models for a maximum of 1000 iterations. For Word2Vec SG, we used the code at [39] with window 10 and negative sampling 5. We trained the models for 100 epochs.
The embeddings in Equation (5) will be denoted with E in all the figures and tables from this section, while the limit embeddings in Equation (12) will be denoted with LE. Embeddings have been normalized either with the Fisher information matrix (F) or the identity matrix (I). Similarly, scalar products will be computed either with the Fisher information matrix (F) or the identity matrix (I). In the following, in case both inner product and normalization are used in the same experiment, they will be computed with respect to the same metric (either F or I). For the reference distribution needed for the computation of the α-embeddings, we have chosen the uniform distribution (0), the unigram distribution of the model (u) obtained by marginalization of the joint distribution learned by the model or the unigram distribution estimated from the corpus (ud). Embeddings are denoted by U if in the computation of Equations (5) and (12) the formula used for p w is Equation (1), while they will be denoted by U+V if Equation (9) is used instead. We evaluate the α-embeddings on intrinsic tasks such as similarities, analogies, and concept categorization, as well as on extrinsic ones like document classification, sentiment analysis, and sentence entailment. We consider α with step 0.1 between (-10, 10) for similarities and analogies. We perform experiments with step 0.1 for α ∈ (-2, 2), step 0.2 for α ∈ (-10, -2) ∪ (2, 10) and step 1 for α ∈ (-30, -10) ∪ (10, 30) for concept categorization, document classification, sentiment analysis, and sentence entailment.

Similarities, Analogies, and Concept Categorization
In Figure 2, we report results for similarities and analogies with embeddings of size 300. For similarities, we consider the following datasets: ws353 [40], mc [41], rg [42], scws [43], men [44], mturk287 [45], rw [46], and simlex999 [47]. For analogies, we use the Google analogy dataset [8] split, as is common practice in the literature, in semantic analogies (sem) and syntactic analogies (syn), or alternatively considering all of them (tot). The limit embeddings (colored dotted lines) achieve good performances on both tasks, above the competitor methods from the literature U and U+V based on GloVe vectors centered and normalized by column, as described in Pennington et al. [10]. Comparison with baseline methods from the literature on word similarity is presented in Table 1, we compared with the limit embeddings since they usually seem to perform well on the similarity task, see Figure 2 top row. The limit embeddings methods reported in the table outperform the Wiki Giga 5 pretrained vectors [10] (6B words corpus) and other comparable baselines from the literature.
In Table 2, we report the best performances for the analogy task on α-embeddings, where α is selected with cross-validation. For the syn dataset, using the embeddings trained on the enwiki corpus, the limit embeddings have been found to work better instead. The standard deviations reported are obtained by averaging the performances on test of the top three α selected on the basis of the best performances on validation. The standard deviations obtained are relatively small, which indicates that tuning α is easy also on tasks with small amounts of data in cross-validation. The best tuned α on the geb dataset outperform the baselines for all experiments.  Table 1. Spearman correlations for the similarity tasks. WG5 denotes the wikigiga5 pretrained vectors on 6B words [10] tested for comparison on the dictionary of the smaller corpora enwiki and geb. U and U+V are the standard methods either for GloVe or Word2Vec. PSM refers to the accuracies reported by Pennington et al. [10] on enwiki, BDK is the best setup across tasks (as a result of hyperparameters tuning) reported by Baroni et al. [48], and LGD are the best methods in cross-validation with fixed window sizes of 5 and 10 (as a result of hyperparameters tuning) reported by Levy et al. [17].  [29] 73.9 -----74.8 --33.9 - Table 2. Accuracy on analogy tasks for the different methods for enwiki and geb corpora. The best α is selected with a 3-fold cross validation (α between −10 and 10, with step 0.1), unless the limit embeddings is the one performing best. The best α values are reported in parentheses. PSM are the accuracies reported by Pennington et al. [10] on enwiki, BDK is the best setup across tasks (as a result of hyperparameters tuning) reported by Baroni et al. [48]. The last intrinsic tasks considered are cluster purity for concept categorization datasets AP [49] and BLESS [50]. For each dataset and for each set of embeddings, we run a spherical k-means algorithm with the help of the Python package spherecluster [51,52].

Method
More specifically, we normalize the embeddings in the tangent space T h α (p 0 ) E (α) V to obtain points on a sphere embedded in the tangent space itself, and then we compute distances on such sphere with the arccosine of the cosine similarity in Equation (7). We set n_init = 300, n_clusters equal to the number of groups of the dataset, and use default parameters otherwise. We run the clustering algorithms 10 times and we select the best results. In Table 3, we report clusters purity on the geb word embeddings of dimension 300. Tuning the value of α allows us to obtain a considerable cluster purity improvement with respect to the standard GloVe baseline (GloVe U+V). Interestingly, we notice how the purity values are superior to the values reported in the literature and comparable only with Baroni et al. [48], where the authors employ a hyperparameter tuning for the training of GloVe. The purity curves ( Figure 3) are more noisy w.r.t. similarities and analogies, this is because the datasets available for this task are quite limited in size. Almost all curves exhibit a peak, which is relatively more pronounced for smaller embedding sizes, while the limit behavior for large negative α performs better for a larger embedding size. This points to the fact that clustering induced by the limit embeddings of Equation (12) is better behaved when the dimension of the embeddings, and the number of sufficient statistics, is larger. Table 3. Clustering purity (×100) with the spherical clustering method described in the main text, compared with numbers from literature. The max, average, and standard deviation are obtained over 10 runs. BDK is the best setup across tasks (as a result of hyperparameter tuning) reported by Baroni et al. [48].  [53] 82.0 -Word2Vec [53] 81.0 -

Document Classification and Sentiment Analysis
In this subsection, we present results on the 20 Newsgroup multi-classification [54] and the IMDBReviews sentiment analysis [55]. The α-embeddings are normalized before training either with I or F. We use a linear architecture (BatchNorm+Dense) for both tasks, while for sentiment analysis we also use a recurrent architecture (Bidirectional LSTM 32 channels, GlobalMaxPool1D, Dense 20 + Dropout 0.05, Dense). When using the linear architecture, a continuous bag of words representation is used. In Tables 4 and 5, we report the best α chosen with respect to the validation set and the best performance for the limit embeddings of size 300. Limit embeddings have been generalized, by considering the words associated to the t largest values for the probabilities ratio in Equation (11), instead of a single one. We denote this modification by -t1/3/5. Furthermore, we indicate with -w the experiments in which the χ * rows of ∆V in Equation (12) are weighted with p w (χ)/p 0 (χ), instead of p 0 (χ)). The improvements reported in Tables 4 and 5 are small but appear on every task (at least 0.5% in accuracy) on both Newsgroups and IMDBReviews, such increase of performance are present also when network architectures of increased complexity are used, such as for bidirectional LSTM. Figure 4 reports the curves for the values on test with early stopping based on the validation for embedding sizes of 50 and 300. The improvements when α is tuned are higher on size 50, exhibiting a more evident peak. For size 300 the improvements are smaller but consistent. In particular, a peak performance for α can be always easily identified for a chosen reference distribution and a chosen normalization.

Sentence Entailment
In this subsection, we evaluate the impact on the performance of α-embeddings on the task of sentence entailment, solved by a neural network with a more complex architecture. We consider the Stanford Natural Language Inference (SNLI) dataset [56], constituted of pairs of sentences (a, b). The task is to predict whether a is entailed by b, b contradicts a, or whether their relationship is neutral. To perform the task, we choose the decomposable attention model from Parikh et al. [57], implementing the attention mechanism from Bahdanau et al. [58]. The decomposable attention model breaks the sentence apart into subsections and aligns them to check their similarity or differences, thus determining whether the sentences are entailed or not. The model consists of three trainable components along with a part for input representation: Attend, Compare, and Aggregate. All three components consists of separate neural networks (with attention mechanisms) which are trained jointly. Intra-sentence attention is used in the case we implemented.
The model was trained as follows. The batch size was set to 32 and the dropout ratio used before all the was ReLU layers fixed to 0.2. Batch normalization was used in the attention layers to ensure robustness and faster convergence. The learning rate was set to 0.05, along with a decay rate of 0.1 after every 20 epochs. The experiments were run for 200 epochs, with the Adagrad optimizer. The weights of the network were initialized with a Gaussian distribution with a mean of 0 and a standard deviation of 0.01. We implemented the attention model in PyTorch [59], starting from the code by Kim [60] and Li [61]. In the first step of preprocessing, we removed punctuation and stop-words from the sentences in the dataset. During training, we used a maximum sentence length of 50 words. While using the embeddings, each sentence was tokenized and tokens for padding and unknown words were added. The 300 dimensional geb α-embeddings were used. Each vector was normalized with either the Fisher or the identity matrix. All embeddings remained fixed while training.
Two types of experiments were performed. In one set of experiments, the embedding vectors were linearly transformed by means of a matrix whose entries are learned during training. In the original paper by Parikh et al. [57] such a linear transformation projects the word embeddings to a 200 dimension space, however, we decided to keep the dimensions fixed to 300 to compare the performance with those of the next set of experiments, where no projection matrix is used. In the following, we refer to the linear transformation as a projection matrix.
The results of the prediction accuracy for the sentence entailment task as a function of α are reported in Figures 5 and 6. For the case with a trainable projection matrix (Figure 6), we observe that the baseline accuracy is higher and the gain deriving from the use of α-embeddings is smaller. This is expected, as the projection matrix already provides a linear transformation of the embedding (task-dependent fine-tuning) before the attention mechanism. It should be noted that using a projection matrix of dimension 300 × 300 adds about 12.4 percent more trainable parameters to the architecture (which has ≈7.25 × 10 5 parameters without the projection layer). For the case where α-embeddings are used without the projection matrix, we can see that there is a larger improvement to the accuracy, but the baseline is lower in this case. The projection matrix already provides a linear transformation of the word-vectors limiting the improvement that α-embeddings can have over the baseline. It is worth noticing that α-embeddings always provide an improvement compared to regular embeddings given by α = 1, even on the more complex attention model with projection. Interestingly enough, for certain values of α, we can see that the accuracy of the α-embeddings without projection surpasses the baseline values for the same task when the projection is used (and are even comparable with the best α), see Table 6. This points to the fact that using α-embeddings and tuning the value of α can be an alternative to the use of more complicated architectures where a linear transformation of the embeddings is used, reducing the computational efforts and obtaining better performances. Table 6. Accuracy of α-embeddings on test for the Stanford Natural Language Inference (SNLI) sentence entailment task, compared to GloVe and Word2Vec baseline vectors. We report experiments both with and without a projection matrix. The best values for α are reported in parentheses. The values presenting the largest improvement over the baselines are marked in bold.

Method No Projection Projection
GloVe U+V Word2Vec U+V

Conclusions
In this paper, we have evaluated experimentally the performance of α-embeddings on several intrinsic and extrinsic tasks in NLP. For word similarities and analogies, the α-embeddings provide significant improvements over standard embedding methods corresponding to α = 1 and over baselines from the literature. Improvements are present on all the tasks tested with different margins, depending on the value of α on the chosen reference distribution (0, u, ud) and the normalization method (I, F). We observe that the best value of α depends both on the task and on the dataset. Thus, α-embeddings provide an extra hyperparameter on the optimization problem when solving the specific task, allowing to choose the best deformation of the space based on data. Values of α lower than 1 and negative seem to be preferred across most tasks. Limit embeddings provide a simple alternative that does not require validation over α but can still offer an improvement on several tasks of interest. Furthermore, limit embeddings induce a clustering in the space of the representations learned by the SG model during training. Performances of the limit embeddings grow with the increasing dimension of the embedding on Newsgroups and IMDB Reviews, pointing to the possibility that limit embeddings show better performances than α-embeddings on higher dimensional spaces.
On the decomposable attention model, the accuracy of α-embeddings without projection surpasses the baseline values for the same task with projection and is also comparable with the best α with projection. This is an indication that using α-embeddings and tuning the value of α can allow to save the extra parameters used to learn a transformation of the embeddings during training, which is costly, reducing the computational efforts and obtaining better performances.
In the present work, α is chosen on the basis of the performance on the validation set. As a future work, we advocate for the design of an automated mechanism optimizing α during training, leading to the definition of an α GloVe loss function and an α attention mechanism. As a future work, we advocate for the design of training algorithms based on α, which are able to automatically tune such hyperparameter and thus learn the best geometry for the task at hand.

Conflicts of Interest:
The authors declare no conflict of interest.