Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languages
Abstract
1. Introduction
- Do clear language clusters form in the MUSE embedding space?
- How close are word vectors for different languages depending on shared script, morphological complexity, and lexical overlap?
- Which factors are most strongly associated with language separation in the embedding space?
Contribution and Structure of the Paper
2. Terminology
3. Related Works
4. Data and Methodology
4.1. Data Collection
4.2. Materials and Methods
4.2.1. Random Forest Classification
4.2.2. Cosine Distance Analysis
4.2.3. UMAP-Visualization
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Pires, T.; Schlinger, E.; Garrette, D. How Multilingual Is Multilingual BERT? Available online: https://api.semanticscholar.org/CorpusID:174798142 (accessed on 8 November 2025).
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. Available online: https://arxiv.org/abs/1911.02116 (accessed on 8 November 2025).
- Ruder, S.; Vulić, I.; Søgaard, A. A Survey of Cross-Lingual Word Embedding Models. Available online: https://arxiv.org/pdf/1706.04902v2 (accessed on 8 November 2025).
- Lukoyanova, T.V. Cognitive Terminology as One of the New Schools in Modern Linguistics. Ling. Mobilis 2014, 3, 75–80. Available online: https://cyberleninka.ru/article/n/kognitivnoe-terminovedenie-kak-odno-iz-napravleniy-sovremennoy-lingvistiki (accessed on 17 October 2025).
- Boldyrev, N.E.; Belyaeva, I.V. Cognitive Mechanisms for Constructing the Interpretative Meaning of Phraseological Units in the Context of Conflict-Free Communication. Bull. Russ. Univ. Peoples’ Friendsh. Ser. Linguist. Semiot. Semant. 2022, 13, 925–936. Available online: https://www.researchgate.net/publication/366650980_Cognitive_Mechanisms_of_Phraseological_Units_Interpretive_Meaning_Construction_in_Relation_to_Conflict-Free_Communication (accessed on 17 October 2025).
- Boldyrev, N.N.; Efimenko, T.N. The Influence Potential of Media Text: A Cognitive Approach. Issues Cogn. Linguist. 2025, 3, 5–18. Available online: https://vcl.ralk.info/issues/2025/vypusk-3-2025/vozdeystvuyushchiy-potentsial-mediateksta-kognitivnyy-podkhod.html (accessed on 8 November 2025).
- Konurbaev, M.E.; Ganeeva, E.R. Cognitive Basis of Speech Compression in Oral Translation. Issues Cogn. Linguist. 2024, 2, 24–32. Available online: https://vcl.ralk.info/issues/2024/vypusk-2-2024/kognitivnye-osnovy-rechevoy-kompressii-v-ustnom-perevode.html (accessed on 17 October 2025).
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. Available online: https://fasttext.cc/docs/en/crawl-vectors.html (accessed on 8 November 2025). [CrossRef]
- Meta AI. MUSE: Multilingual Unsupervised and Supervised Embeddings. Available online: https://github.com/facebookresearch/MUSE (accessed on 17 December 2025).
- Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; Jégou, H. Word Translation Without Parallel Data. Available online: https://arxiv.org/abs/1710.04087 (accessed on 17 December 2025).
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Salton, G.; Wong, A.; Yang, C.S. A Vector Space Model for Automatic Indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw. 2018, 3, 861. [Google Scholar] [CrossRef]
- Daniels, P.T.; Bright, W. (Eds.) The World’s Writing Systems; Oxford University Press: New York, NY, USA, 1996; ISBN 9780195079937. [Google Scholar]
- Aronoff, M.; Fudeman, K. What Is Morphology? 2nd ed.; Wiley-Blackwell: Malden, MA, USA, 2011. [Google Scholar]
- Speer, R. Rspeer/wordfreq: V3.0. Zenodo 2022. [Google Scholar] [CrossRef]
- Brysbaert, M.; New, B. Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behav. Res. Methods 2009, 41, 977–990. [Google Scholar] [CrossRef] [PubMed]
- van Heuven, W.J.B.; Mandera, P.; Keuleers, E.; Brysbaert, M. SUBTLEX-UK: A New and Improved Word Frequency Database for British English. Q. J. Exp. Psychol. 2014, 67, 1176–1190. [Google Scholar] [CrossRef] [PubMed]
- Zipf, G.K. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology; Addison-Wesley Press: Cambridge, MA, USA, 1949. [Google Scholar]
- Abu-Rayyash, H.; Lacruz, I. Through the Eyes of the Viewer: The Cognitive Load of LLM-Generated vs. Professional Arabic Subtitles. J. Eye Mov. Res. 2025, 18, 29. [Google Scholar] [CrossRef] [PubMed]
- Saretzki, J.; Knopf, T.; Forthmann, B.; Goecke, B.; Jaggy, A.-K.; Benedek, M.; Weiss, S. Scoring German Alternate Uses Items Applying Large Language Models. J. Intell. 2025, 13, 64. [Google Scholar] [CrossRef] [PubMed]
- Shafron, E. The Accounting Tower of Babel: Language and the Translation of International Accounting Standards. SSRN Electron. J. 2023, 4394442, 1–57. [Google Scholar] [CrossRef]
- Anderson, R.; Scala, C.; Samuel, J.; Kumar, V.; Jain, P. Are Emotions Conveyed across Machine Translations? Establishing an Analytical Process for the Effectiveness of Multilingual Sentiment Analysis with Italian Text. Available online: https://doi.org/10.2139/ssrn.5266525 (accessed on 7 January 2024).
- Ding, Q.; Cao, H.; Cao, Z.; Zhou, Y.; Zhao, T. Cross-Lingual Semantic Information Fusion for Word Translation Enhancement. Available online: https://doi.org/10.2139/ssrn.5062126 (accessed on 8 November 2025).
- Fu, B.; Brennan, R.; O’Sullivan, D. A Configurable Translation-Based Cross-Lingual Ontology Mapping System to Adjust Mapping Outcome. In Proceedings of the ESWC 2012, Heraklion, Greece, 27–31 May 2012. [Google Scholar] [CrossRef]
- Kvapilíková, I.; Artetxe, M.; Labaka, G.; Agirre, E.; Bojar, O. Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Seattle, WA, USA, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 255–262. [Google Scholar] [CrossRef]






| Study | Model/Representation | Languages | Main Focus | Limitation Relative to Our Work |
|---|---|---|---|---|
| Devlin et al. [1] | mBERT (contextual transformer) | 104 | Cross-lingual transfer and downstream performance | Focus on contextual embeddings; no explicit analysis of static word-type geometry or language-level distances |
| Pires et al. [2] | mBERT (contextual transformer) | 104 | Cross-lingual generalization across Wikipedia articles | Emphasis on task transfer; lacks interpretable metrics for embedding-space structure |
| Kvapilíková et al. [27] | Unsupervised multilingual sentence embeddings | Multiple European languages | Parallel corpus mining using multilingual sentence representations | Concentrates on sentence-level alignment; does not study language clustering in static word embeddings |
| Ruder et al. [4] | Survey of cross-lingual embedding models | Various | Systematic overview of cross-lingual word, sentence, and document embeddings | Review-oriented; does not provide a concrete empirical setup comparable to our MUSE-based analysis |
| Language | Arabian | German | English | Persian | Hindi | Indonesian | Lithuanian | Russian | Tajik | Chinese |
|---|---|---|---|---|---|---|---|---|---|---|
| Arabian | 0.00 | 0.95 | 1.05 | 0.98 | 1.05 | 0.96 | 0.92 | 0.98 | 0.89 | 1.10 |
| German | 0.95 | 0.00 | 0.98 | 0.84 | 0.97 | 0.90 | 0.81 | 1.02 | 0.87 | 1.00 |
| English | 1.05 | 0.98 | 0.00 | 1.02 | 1.04 | 1.01 | 0.97 | 0.99 | 1.01 | 1.04 |
| Persian | 0.98 | 0.84 | 1.02 | 0.00 | 1.04 | 1.03 | 1.02 | 1.01 | 0.91 | 1.06 |
| Hindi | 1.05 | 0.97 | 1.04 | 1.04 | 0.00 | 0.95 | 1.11 | 1.01 | 1.07 | 0.97 |
| Indonesian | 0.96 | 0.90 | 1.01 | 1.03 | 0.95 | 0.00 | 0.92 | 0.93 | 0.87 | 0.91 |
| Lithuanian | 0.92 | 0.81 | 0.97 | 1.02 | 1.11 | 0.92 | 0.00 | 1.08 | 0.89 | 1.05 |
| Russian | 0.98 | 1.02 | 0.99 | 1.01 | 1.01 | 0.93 | 1.08 | 0.00 | 0.93 | 0.89 |
| Tajik | 0.89 | 0.87 | 1.01 | 0.91 | 1.07 | 0.87 | 0.89 | 0.93 | 0.00 | 0.92 |
| Chinese | 1.10 | 1.00 | 1.04 | 1.06 | 0.97 | 0.91 | 1.05 | 0.89 | 0.92 | 0.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Aleshina, A.V.; Bulgakov, A.L.; Xin, Y.; Skrebkova, L.S. Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languages. Computation 2026, 14, 16. https://doi.org/10.3390/computation14010016
Aleshina AV, Bulgakov AL, Xin Y, Skrebkova LS. Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languages. Computation. 2026; 14(1):16. https://doi.org/10.3390/computation14010016
Chicago/Turabian StyleAleshina, Anna V., Andrey L. Bulgakov, Yanliang Xin, and Larisa S. Skrebkova. 2026. "Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languages" Computation 14, no. 1: 16. https://doi.org/10.3390/computation14010016
APA StyleAleshina, A. V., Bulgakov, A. L., Xin, Y., & Skrebkova, L. S. (2026). Linguistic Influence on Multidimensional Word Embeddings: Analysis of Ten Languages. Computation, 14(1), 16. https://doi.org/10.3390/computation14010016

