# Changing the Geometry of Representations: α-Embeddings for NLP Tasks

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Word Embeddings Based on Conditional Models

## 3. $\mathit{\alpha}$-Embeddings

Algorithm 1:$\alpha $-embeddings. |

#### Limit Embeddings

## 4. Experiments

#### 4.1. Similarities, Analogies, and Concept Categorization

#### 4.2. Document Classification and Sentiment Analysis

#### 4.3. Sentence Entailment

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature
**1986**, 323, 533–536. [Google Scholar] [CrossRef] - Bengio, Y.; Ducharme, R.; Vincent, P.; Jauvin, C. A neural probabilistic language model. J. Mach. Learn. Res.
**2003**, 3, 1137–1155. [Google Scholar] - Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. Available online: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf (accessed on 25 February 2021).
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Harrahs and Harveys, Statelinec, NV, USA, 5–10 December 2013. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods In Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
- Levy, O.; Goldberg, Y. Neural Word Embedding as Implicit Matrix Factorization; NIPS: Quebec, QC, Canada, 2014; p. 9. [Google Scholar]
- Mikolov, T.; Yih, W.T.; Zweig, G. Linguistic Regularities in Continuous Space Word Representations; NAACL-HLT: Atlanta, GA, USA, 2013. [Google Scholar]
- Arora, S.; Li, Y.; Liang, Y.; Ma, T.; Risteski, A. Rand-walk: A latent variable model approach to word embeddings. arXiv
**2016**, arXiv:1502.03520. [Google Scholar] - Mu, J.; Bhat, S.; Viswanath, P. All-But-the-Top: Simple and Effective Postprocessing for Word Representations. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Bullinaria, J.A.; Levy, J.P. Extracting semantic representations from word co-occurrence statistics: A computational study. Behav. Res. Methods
**2007**, 39, 510–526. [Google Scholar] [CrossRef] [Green Version] - Bullinaria, J.A.; Levy, J.P. Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behav. Res. Methods
**2012**, 44, 890–907. [Google Scholar] [CrossRef] [PubMed] - Levy, O.; Goldberg, Y.; Dagan, I. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Trans. Assoc. Comput. Linguist.
**2015**, 3, 211–225. [Google Scholar] [CrossRef] - Tsvetkov, Y.; Faruqui, M.; Ling, W.; Lample, G.; Dyer, C. Evaluation of Word Vector Representations by Subspace Alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 17–21 September 2015; pp. 2049–2054. [Google Scholar] [CrossRef] [Green Version]
- Schnabel, T.; Labutov, I.; Mimno, D.; Joachims, T. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 298–307. [Google Scholar] [CrossRef]
- Raunak, V. Simple and Effective Dimensionality Reduction for Word Embeddings. In Proceedings of the LLD Workshop—Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 9 December 2017. [Google Scholar]
- Volpi, R.; Malagò, L. Natural Alpha Embeddings. arXiv
**2019**, arXiv:1912.02280. [Google Scholar] - Volpi, R.; Malagò, L. Natural Alpha Embeddings. Inf. Geom.
**2021**, in press. [Google Scholar] - Amari, S.I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Cambridge, MA, USA, 2000. [Google Scholar]
- Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016; Volume 194. [Google Scholar]
- Fonarev, A.; Grinchuk, O.; Gusev, G.; Serdyukov, P.; Oseledets, I. Riemannian Optimization for Skip-Gram Negative Sampling. In Proceedings of the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 2028–2036. [Google Scholar]
- Jawanpuria, P.; Balgovind, A.; Kunchukuttan, A.; Mishra, B. Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach. Trans. Assoc. Comput. Linguist.
**2019**, 7, 107–120. [Google Scholar] [CrossRef] - Nickel, M.; Kiela, D. Poincaré embeddings for learning hierarchical representations. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Tifrea, A.; Becigneul, G.; Ganea, O.E. Poincaré GloVe: Hyperbolic Word Embeddings. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Meng, Y.; Huang, J.; Wang, G.; Zhang, C.; Zhuang, H.; Kaplan, L.; Han, J. Spherical text embedding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Volpi, R.; Malago, L. Evaluating Natural Alpha Embeddings on Intrinsic and Extrinsic Tasks. In Proceedings of the 5th Workshop on Representation Learning for NLP-Association for Computational Linguistics (ACL), 9 July 2020. Online. [Google Scholar]
- Amari, S.I. Differential-Geometrical Methods in Statistics; Lecture Notes in Statistics; Springer: New York, NY, USA, 1985; Volume 28. [Google Scholar]
- Amari, S.I.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci.
**2010**, 58, 183–195. [Google Scholar] [CrossRef] [Green Version] - Free eBooks—Project Gutenberg. Available online: https://www.gutenberg.org (accessed on 1 September 2019).
- Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 7–13. [Google Scholar]
- Aligning Books and Movie: Towards Story-like Visual Explanations by Watching Movies and Reading Books. Available online: https://yknzhu.wixsite.com/mbweb (accessed on 3 September 2019).
- Kobayashi, S. Homemade BookCorpus. Available online: https://github.com/soskek/bookcorpus (accessed on 13 September 2019).
- WikiExtractor. Available online: https://github.com/attardi/wikiextractor (accessed on 8 October 2017).
- Pennington, J.; Socher, R.; Manning, C. GloVe Project Page. Available online: https://nlp.stanford.edu/projects/glove/ (accessed on 26 October 2017).
- word2vec Google Code Archive. Available online: https://code.google.com/archive/p/word2vec/ (accessed on 19 October 2017).
- Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; Ruppin, E. Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web, Hong Kong, China, 1–5 May 2001; pp. 406–414. [Google Scholar]
- Miller, G.A.; Charles, W.G. Contextual correlates of semantic similarity. Lang. Cogn. Process.
**1991**, 6, 1–28. [Google Scholar] [CrossRef] - Rubenstein, H.; Goodenough, J.B. Contextual correlates of synonymy. Commun. ACM
**1965**, 8, 627–633. [Google Scholar] [CrossRef] - Huang, E.H.; Socher, R.; Manning, C.D.; Ng, A.Y. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers—Volume 1, Jeju, Korea, 8–14 July 2012. [Google Scholar]
- Bruni, E.; Tran, N.K.; Baroni, M. Multimodal distributional semantics. J. Artif. Intell. Res.
**2014**, 49, 1–47. [Google Scholar] [CrossRef] [Green Version] - Radinsky, K.; Agichtein, E.; Gabrilovich, E.; Markovitch, S. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th International Conference on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 337–346. [Google Scholar]
- Luong, M.T.; Socher, R.; Manning, C.D. Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, 8–9 August 2013; pp. 104–113. [Google Scholar]
- Hill, F.; Reichart, R.; Korhonen, A. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist.
**2015**, 41, 665–695. [Google Scholar] [CrossRef] - Baroni, M.; Dinu, G.; Kruszewski, G. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 238–247. [Google Scholar]
- Almuhareb, A. Attributes in Lexical Acquisition. Ph.D. Thesis, University of Essex, Colchester, UK, 2006. [Google Scholar]
- Baroni, M.; Lenci, A. How we BLESSed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Edinburgh, UK, July 2011; pp. 1–10. Available online: https://www.aclweb.org/anthology/W11-2501/ (accessed on 26 February 2021).
- Banerjee, A.; Dhillon, I.S.; Ghosh, J.; Sra, S. Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res.
**2005**, 6, 1345–1382. [Google Scholar] - Laska, J.; Straub, D.; Sahloul, H. Spherecluster. Available online: https://github.com/jasonlaska/spherecluster (accessed on 4 December 2019).
- Wang, B.; Wang, A.; Chen, F.; Wang, Y.; Kuo, C.C.J. Evaluating word embedding models: Methods and experimental results. APSIPA Trans. Signal Inf. Process.
**2019**, 8, e19. [Google Scholar] [CrossRef] [Green Version] - Lang, K. Newsweeder: Learning to filter netnews. In Machine Learning Proceedings 1995; Elsevier: Amsterdam, The Netherlands, 1995; pp. 331–339. [Google Scholar]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
- Bowman, S.R.; Angeli, G.; Potts, C.; Manning, C.D. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
- Parikh, A.P.; Täckström, O.; Das, D.; Uszkoreit, J. A decomposable attention model for natural language inference. In Proceedings of the Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv
**2014**, arXiv:1409.0473. [Google Scholar] - Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in PyTorch. In Proceedings of the NIPS Autodiff Workshop, Long Beach, CA, USA, 8 December 2017; 2017. [Google Scholar]
- Kim, Y. Available online: https://github.com/harvardnlp/decomp-attn (accessed on 23 October 2017).
- Li, B. Available online: https://github.com/libowen2121/SNLI-decomposable-attention (accessed on 11 November 2018).

**Figure 1.**The Skip-Gram model is defines a joint curved model in the $(n\times n-1)$-dimensional simplex. Some faces of this model correspond to the conditional models $p\left(\chi \right|w)$ for some w. The conditional models are defined over the same sample space and have the same sufficient statistics determined by V, they represent, in fact, different points on the same exponential family ${\mathcal{E}}_{V}$ embedded in ${\mathbb{P}}^{n}$. At each training step, the model ${\mathcal{E}}_{V}$ varies with V.

**Figure 2.**Word similarities expressed in Spearman correlation × 100 (

**top**) and word analogies accuracies (

**bottom**) for different values of $\alpha $. The left column reports experiments on enwiki, while the right column reports experiments on geb. U, U+V, and WG5-U+V are the GloVe vectors of size 300 described in the text, centered and normalized. Figure from [30].

**Figure 3.**Cluster purity on concept categorization task (plotted with 3-point average). Figure from [30].

**Figure 4.**Accuracy and AUC on 20 Newsgroups and IMDB Reviews datasets for varying $\alpha $. The metrics I and F refer to the normalization of the embeddings before training. Figure from [30].

**Figure 5.**Accuracy of the decomposable attention model over the sentence entailment task without projection matrix. (

**top row**) Test accuracies at best validation point during training; (

**bottom row**) test accuracies averaged over the last 10 epochs of training. (

**Left column**) U embeddings, (

**right column**) U+V embeddings. The vectors have been normalized either with the Fisher information matrix (F) or with the identity matrix (I). The limit embeddings are represented by the dashed lines of the corresponding color.

**Figure 6.**Accuracy of the decomposable attention model with an additional trainable projection matrix. (

**top row**) Test accuracies at best validation point during training; (

**bottom row**) test accuracies averaged over the last 10 epochs of training. (

**left column**) U embeddings, (

**right column**) U+V embeddings. The vectors have been normalized either with the Fisher information matrix (F) or with the identity matrix (I). The limit embeddings are represented by the dashed lines of the corresponding color.

**Table 1.**Spearman correlations for the similarity tasks. WG5 denotes the wikigiga5 pretrained vectors on 6B words [10] tested for comparison on the dictionary of the smaller corpora enwiki and geb. U and U+V are the standard methods either for GloVe or Word2Vec. PSM refers to the accuracies reported by Pennington et al. [10] on enwiki, BDK is the best setup across tasks (as a result of hyperparameters tuning) reported by Baroni et al. [48], and LGD are the best methods in cross-validation with fixed window sizes of 5 and 10 (as a result of hyperparameters tuning) reported by Levy et al. [17].

Method | ws353 | mc | rg | scws | ws353s | ws353r | Men | mturk287 | rw | simlex999 | All | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

enwiki | LE-U+V-ud-F (our) | $\mathbf{75.5}$ | $\mathbf{83.4}$ | $\mathbf{81.5}$ | 63.5 | $\mathbf{77.8}$ | $\mathbf{69.2}$ | 75.6 | 60.1 | $\mathbf{55.6}$ | $\mathbf{41.6}$ | $\mathbf{62.6}$ |

GloVe WG5-U+V | 65.1 | 73.8 | 77.6 | 62.2 | 71.3 | 60.7 | $\mathbf{77.2}$ | 65.7 | 51.5 | 41.0 | 61.3 | |

GloVe U | 60.2 | 69.3 | 69.8 | 58.3 | 67.1 | 56.4 | 69.2 | 67.2 | 47.1 | 31.4 | 53.6 | |

GloVe U+V | 63.8 | 74.5 | 75.2 | 58.7 | 69.5 | 60.9 | 71.6 | $\mathbf{67.3}$ | 45.5 | 32.2 | 55.1 | |

Word2Vec U | 64.7 | 73.5 | 78.4 | 63.6 | 73.7 | 56.1 | 72.9 | 65.4 | 47.3 | 34.5 | 59.1 | |

Word2Vec U+V | 66.1 | 75.3 | 76.1 | $\mathbf{64.1}$ | 75.2 | 57.3 | 72.5 | 63.8 | 46.1 | 33.4 | 58.7 | |

geb | LE-U+V-ud-F (our) | $\mathbf{77.0}$ | $\mathbf{81.2}$ | $\mathbf{83.5}$ | $\mathbf{65.0}$ | $\mathbf{80.3}$ | $\mathbf{68.7}$ | $\mathbf{79.6}$ | 62.4 | $\mathbf{59.3}$ | $\mathbf{46.9}$ | $\mathbf{65.2}$ |

GloVe WG5-U+V | 65.1 | 73.8 | 77.9 | 61.8 | 71.3 | 60.7 | 77.2 | 65.7 | 53.2 | 40.6 | 60.4 | |

GloVe U | 61.3 | 73.0 | 76.3 | 58.7 | 68.6 | 54.0 | 68.7 | $\mathbf{68.1}$ | 48.9 | 30.6 | 51.9 | |

GloVe U+V | 64.9 | 77.4 | 79.9 | 59.1 | 71.5 | 58.8 | 71.4 | $\mathbf{68.1}$ | 48.5 | 32.5 | 53.7 | |

Word2Vec U | 65.5 | 77.8 | 74.7 | 62.6 | 73.2 | 58.5 | 73.1 | 67.5 | 48.3 | 32.9 | 59.0 | |

Word2Vec U+V | 69.4 | 77.4 | 78.2 | 63.5 | 76.0 | 62.5 | 73.9 | 65.3 | 49.0 | 32.9 | 59.6 | |

GloVe PSM 6B [10] | 65.8 | 72.7 | 77.8 | 53.9 | - | - | - | - | 38.1 | - | - | |

Word2Vec BDK [48] | 73 | - | 83 | - | 78 | 68 | 80 | - | - | - | - | |

GloVe LGD win5 [17] | - | - | - | - | 74.5 | 61.7 | 74.6 | 63.1 | 41.6 | 38.9 | - | |

GloVe LGD win10 [17] | - | - | - | - | 74.6 | 64.3 | 75.4 | 61.6 | 26.6 | 37.5 | - | |

Poincaré GloVe 100D [28] | 62.3 | 80.5 | 76.0 | - | - | - | - | - | 42.8 | 31.8 | - | |

JoSE 100D [29] | 73.9 | - | - | - | - | - | 74.8 | - | - | 33.9 | - |

**Table 2.**Accuracy on analogy tasks for the different methods for enwiki and geb corpora. The best $\alpha $ is selected with a 3-fold cross validation ($\alpha $ between −10 and 10, with step 0.1), unless the limit embeddings is the one performing best. The best $\alpha $ values are reported in parentheses. PSM are the accuracies reported by Pennington et al. [10] on enwiki, BDK is the best setup across tasks (as a result of hyperparameters tuning) reported by Baroni et al. [48].

Method | Sem | Syn | Tot | |
---|---|---|---|---|

enwiki | E-U+V-0-I (our) | $\mathbf{84.5}\pm 0.4\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}(1.8\pm 0.1)$ | 67.33 (−∞) | $\mathbf{74.4}\pm 0.1\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}(1.7\pm 0.1)$ |

GloVe WG5-U+V | 79.4 | $\mathbf{67.5}$ | 72.6 | |

GloVe U | 77.8 | 62.1 | 68.9 | |

GloVe U+V | 80.9 | 63.4 | 70.9 | |

Word2Vec U | 74.58 | 54.96 | 63.39 | |

Word2Vec U+V | 75.44 | 55.03 | 63.81 | |

geb | E-U+V-0-I (our) | $\mathbf{83.8}\pm 0.4\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}(1.7\pm 0.1)$ | $\mathbf{72.2}\pm 0.4\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}(1.3\pm 0.1)$ | $\mathbf{76.7}\pm 0.3\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}(1.3\pm 0.1)$ |

GloVe WG5-U+V | 78.7 | 65.2 | 70.7 | |

GloVe U | 75.7 | 66.8 | 70.4 | |

GloVe U+V | 80.0 | 68.5 | 73.2 | |

Word2Vec U | 71.20 | 52.62 | 60.15 | |

Word2Vec U+V | 71.59 | 51.88 | 59.87 | |

GloVe PSM 1.6B [10] | 80.8 | 61.5 | 70.3 | |

GloVe PSM 6B [10] | 77.4 | 67.0 | 71.7 | |

Word2Vec BDK [48] | 80.0 | 68.5 | 73.2 | |

Poincaré GloVe 100D [28] | 66.4 | 60.9 | 63.4 |

**Table 3.**Clustering purity ($\times 100$) with the spherical clustering method described in the main text, compared with numbers from literature. The max, average, and standard deviation are obtained over 10 runs. BDK is the best setup across tasks (as a result of hyperparameter tuning) reported by Baroni et al. [48].

Dataset | Method | Max Purity | Avg Purity |
---|---|---|---|

AP | E-U+V-u-F ($\alpha $ = −4) | $\mathbf{70.9}$ | 66.2 ± 2.1 |

GloVe U+V | 64.3 | 61.4 ± 2.5 | |

Word2Vec U+V | 63.5 | 61.0 ± 1.6 | |

GloVe [53] | 61.4 | - | |

Word2Vec [53] | 68.2 | - | |

Word2Vec BDK [48] | $\mathbf{71.0}$ | - | |

BLESS | E-U+V-ud-I ($\alpha $ = 1.1) | $\mathbf{89.0}$ | 83.5 ± 2.6 |

GloVe U+V | 86.0 | 83.4 ± 2.5 | |

Word2Vec U+V | 80.0 | 77.3 ± 2.5 | |

GloVe [53] | 82.0 | - | |

Word2Vec [53] | 81.0 | - |

**Table 4.**AUC and accuracy on test of 20 Newsgroups multiclass classification (BatchNorm + Dense), compared to baseline vectors. Best $\alpha $ and best limit method (on validation) are reported in parentheses.

Method | 20 Newsgroups | |
---|---|---|

AUC | acc | |

Word2Vec U+V | 95.66 | 63.17 |

GloVe U+V | 96.34 | 65.06 |

E-U+V-0-F | 96.76 ($0.2$) | 65.86 ($0.4$) |

E-U+V-u-F | $\mathbf{96.79}$ ($0.2$) | $\mathbf{66.30}$ ($0.2$) |

E-U+V-ud-F | $\mathbf{96.79}$ ($0.4$) | 65.24 ($0.6$) |

LE-U+V-0-F | 96.65 (t3-w) | 64.47 (t1) |

LE-U+V-u-F | 96.65 (t3-w) | 64.54 (t1) |

LE-U+V-ud-F | 96.38 (t5-w) | 64.76 (t3-w) |

**Table 5.**Accuracy on test of IMDB Reviews sentiment analysis binary classification, with linear (BatchNorm + Dense) and with BiLSTM architecture (Bidirectional LSTM 32 channels, GlobalMaxPool1D, Dense 20 + Dropout 0.05, Dense), compared to baseline vectors. Best $\alpha $ and best limit method (on validation), are reported in parentheses.

Method | IMDB Reviews | |
---|---|---|

acc lin | acc BiLSTM | |

Word2Vec U+V | 82.84 | 87.61 |

GloVe U+V | 83.76 | 88.00 |

E-U+V-0-F | 83.58 ($2.4$) | 88.12 ($-4.0$) |

E-U+V-u-F | 83.72 ($-3.0$) | 88.56 ($-4.0$) |

E-U+V-ud-F | 84.23 ($-3.0$) | 88.48 ($-2.2$) |

LE-U+V-0-F | 84.00 (t1) | 88.36 (t1) |

LE-U+V-u-F | $\mathbf{84.29}$ (t1) | $\mathbf{88.66}$ (t1) |

LE-U+V-ud-F | 84.00 (t3-w) | 88.49 (t3-w) |

**Table 6.**Accuracy of $\alpha $-embeddings on test for the Stanford Natural Language Inference (SNLI) sentence entailment task, compared to GloVe and Word2Vec baseline vectors. We report experiments both with and without a projection matrix. The best values for $\alpha $ are reported in parentheses. The values presenting the largest improvement over the baselines are marked in bold.

Method | No Projection | Projection |
---|---|---|

GloVe U+V Word2Vec U+V | 83.2 76.1 | 83.4 81.7 |

E-U+V-0-I | 83.6 ($-7$) | 84.2 ($-4$) |

E-U+V-0-F | 84.1 ($-4$) | 84.2 ($-1$) |

E-U+V-u-I | 84.0 ($-4$) | 84.0 ($-4$) |

E-U+V-u-F | $\mathbf{84.6}(-\mathbf{8})$ | $\mathbf{84.5}(-\mathbf{8})$ |

E-U+V-ud-I | 83.8 ($-1$) | 84.0 ($-1$) |

E-U+V-ud-F | 84.1 ($-2$) | $\mathbf{84.5}(-\mathbf{1})$ |

GloVe U Word2Vec U | 83.7 74.6 | 84.1 76.1 |

E-U-0-I | 83.7 ($+1$) | 84.1 ($+1$) |

E-U-0-F | $\mathbf{84.0}(+\mathbf{3})$ | $\mathbf{84.3}(+\mathbf{1})$ |

E-U-u-I | 83.5 ($-6$) | 84.0 ($+1$) |

E-U-u-F | 83.9 ($-5$) | 84.2 ($-10$) |

E-U-ud-I | 82.8 ($-6$) | 84.0 ($+1$) |

E-U-ud-F | 83.1 ($-5$) | 84.0 ($+1$) |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Volpi, R.; Thakur, U.; Malagò, L.
Changing the Geometry of Representations: *α*-Embeddings for NLP Tasks. *Entropy* **2021**, *23*, 287.
https://doi.org/10.3390/e23030287

**AMA Style**

Volpi R, Thakur U, Malagò L.
Changing the Geometry of Representations: *α*-Embeddings for NLP Tasks. *Entropy*. 2021; 23(3):287.
https://doi.org/10.3390/e23030287

**Chicago/Turabian Style**

Volpi, Riccardo, Uddhipan Thakur, and Luigi Malagò.
2021. "Changing the Geometry of Representations: *α*-Embeddings for NLP Tasks" *Entropy* 23, no. 3: 287.
https://doi.org/10.3390/e23030287