Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models
Abstract
1. Introduction
- We provide a broad empirical comparison of multilingual LLM representations across multiple models, languages, and evaluation settings.
- We identify consistent correlations between syntactic similarity and representational similarity across layers and models.
- We evaluate how these correlations relate to cross-lingual transfer performance using probing-based analysis.
- We introduce an attention-based structural comparison using tree distance as an additional diagnostic signal while discussing its limitations as a proxy for syntax.
- We analyze how these observations vary across model architectures and layers, highlighting recurring patterns and open questions.
2. Related Work
2.1. Cross-Lingual Language Representations
2.2. Structure and Geometry of Language Models
3. Methodology
3.1. Cosine Similarity Metrics for Comparing Hidden Representations
3.2. Typological Distances Across Languages
- geography (“geo”)—Geographic distances between languages on the globe;
- syntax average (“syntac”)—An average score representing the distinctness of the paradigms observed in a given language in terms of syntax;
- phonology average (“phono”)—An average score representing the production rules of speech sounds of a language;
- inventory average (“invent”)—An average score representing features related to phonetic inventories or the lexical patterns of a language.
3.3. Structural Probing
3.4. Analyzing Dependency Structures in the Attention Layer Using Editing Tree Distance
| Algorithm 1 Syntactic Distances to Binary Constituency Trees [42]. |
|
4. Experiments
5. Results
6. Limitations and Reproducibility
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Average Cross-Lingual Mean Cosine Distances for Word- and Sentence-Level Representations in Selected Pre-Trained LLMs








Appendix B. F1 Accuracies for All Models in the Probing Task for POS Tagging




Appendix C. Cross-Lingual Tree Distance Analysis Between Attention Maps and UD in All Models




Appendix D. Editing Tree Distance Algorithm
| Algorithm A1 Editing Tree Distance Algorithm [45]. | |
| 1: Input: Trees and | |
| 2: Output: Edit distance between and , where and | |
| 3: function TreeEditDistance() | |
| 4: for all do | |
| 5: for all do | |
| 6: i = LR_keyroots1[i’] | |
| 7: j = LR_keyroots2[j’] | |
| 8: TreeDist() | ▷ Compute TreeDist with dynamic programming |
| 9: end for | |
| 10: end for | |
| 11: end function | |
| 12: function TreeDist() | |
| 13: forestdist | |
| 14: for to i do | |
| 15: | |
| 16: end for | |
| 17: for to j do | |
| 18: | |
| 19: end for | |
| 20: for to i do | |
| 21: for to j do | |
| 22: if and then | |
| 23: | |
| 24: | |
| 25: else | |
| 26: | |
| 27: end if | |
| 28: end for | |
| 29: end for | |
| 30: end function | |
References
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
- Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.; Love, J.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
- Üstün, A.; Aryabumi, V.; Yong, Z.; Ko, W.; D’Souza, D.; Onilude, G.; Bhandari, N.; Singh, S.; Ooi, H.; Kayid, A.; et al. Aya Model. arXiv 2024, arXiv:2402.07827. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
- Karthikeyan, K.; Wang, Z.; Mayhew, S.; Roth, D. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Philippy, F.; Guo, S.; Haddadan, S. Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space. In Proceedings of the Workshop on Computational Linguistic Typology, Dubrovnik, Croatia, 6 May 2023. [Google Scholar]
- Xu, R.; Yang, Y.; Otani, N.; Wu, Y. Unsupervised Cross-Lingual Transfer of Word Embedding Spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2465–2474. [Google Scholar]
- Wu, S.; Conneau, A.; Li, H.; Zettlemoyer, L.; Stoyanov, V. Emerging Cross-Lingual Structure in Pretrained Language Models. arXiv 2019, arXiv:1911.01464. [Google Scholar]
- Dufter, P.; Schütze, H. Identifying Necessary Elements for BERT’s Multilinguality. arXiv 2020, arXiv:2005.00396. [Google Scholar]
- Shah, C. Correlations between Multilingual Language Model Geometry and Crosslingual Transfer Performance. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), Torino, Italy, 20–25 May 2024. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
- Artetxe, M.; Labaka, G.; Agirre, E. Learning Principled Bilingual Mappings of Word Embeddings while Preserving Monolingual Invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2289–2294. [Google Scholar]
- Artetxe, M.; Labaka, G.; Agirre, E. Learning Bilingual Word Embeddings with (Almost) No Bilingual Data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 451–462. [Google Scholar]
- Artetxe, M.; Labaka, G.; Agirre, E. A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 789–798. [Google Scholar]
- Chen, X.; Cardie, C. Unsupervised Multilingual Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 261–270. [Google Scholar]
- Conneau, A.; Lample, G.; Ranzato, M.-A.; Denoyer, L.; Jégou, H. Word Translation without Parallel Data. arXiv 2017, arXiv:1710.04087. [Google Scholar]
- Hoshen, Y.; Wolf, L. Non-Adversarial Unsupervised Word Translation. arXiv 2018, arXiv:1801.06126. [Google Scholar] [CrossRef]
- Ruder, S.; Vulić, I.; Søgaard, A. A Survey of Cross-Lingual Word Embedding Models. J. Artif. Intell. Res. 2019, 65, 569–631. [Google Scholar] [CrossRef]
- Smith, S.; Turban, D.; Hamblin, S.; Hammerla, N. Offline Bilingual Word Vectors. arXiv 2017, arXiv:1702.03859. [Google Scholar] [CrossRef]
- Wu, S.; Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. arXiv 2019, arXiv:1904.09077. [Google Scholar] [CrossRef]
- Sabet, M.; Dufter, P.; Schütze, H. SimAlign. arXiv 2020, arXiv:2004.08728. [Google Scholar]
- Jones, A.; Wang, W.Y.; Mahowald, K. A Massively Multilingual Analysis of Cross-Linguality in Shared Embedding Space. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
- Blevins, T.; Levy, O.; Zettlemoyer, L. Deep RNNs Encode Soft Hierarchical Syntax. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 14–19. [Google Scholar]
- Peters, M.; Neumann, M.; Zettlemoyer, L.; Yih, W.-t. Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1499–1509. [Google Scholar]
- Tenney, I.; Das, D.; Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. arXiv 2019, arXiv:1905.05950. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time-Series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmaa, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Liu, N.F.; Gardner, M.; Belinkov, Y.; Peters, M.; Smith, N.A. Linguistic Knowledge and Transferability of Contextual Representations. In Proceedings of the NAACL, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Hewitt, J.; Manning, C.D. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Coenen, A.; Reif, E.; Yuan, A.; Kim, B.; Pearce, A.; Viégas, F.; Wattenberg, M. Visualizing and Measuring the Geometry of BERT. arXiv 2019, arXiv:1906.02715. [Google Scholar] [CrossRef]
- Littell, P.; Mortensen, D.; Lin, K.; Kairis, K.; Turner, C.; Levin, L. URIEL and lang2vec. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [Google Scholar]
- Sedgwick, P. Pearson’s Correlation Coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]
- Adi, Y.; Kermany, E.; Belinkov, Y.; Lavi, O.; Goldberg, Y. Fine-Grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Conneau, A.; Kruszewski, G.; Lample, G.; Barrault, L.; Baroni, M. What You Can Cram into a Single Vector. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
- Köhn, A. Evaluating Embeddings using Syntax-based Classification Tasks as a Proxy for Parser Performance. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, 7–12 August 2016. [Google Scholar]
- Veldhoen, S.; Hupkes, D.; Zuidema, W. Diagnostic Classifiers Revealing how Neural Networks Process Hierarchical Structure. In Proceedings of the CoCo Workshop at Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 69–77. [Google Scholar]
- Kim, T.; Choi, J.; Edmiston, D.; Lee, S. Are Pre-trained Language Models Aware of Phrases? In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Menéndez, M.; Pardo, J.; Pardo, L.; Pardo, M. The Jensen–Shannon Divergence. J. Frankl. Inst. 1997, 334, 307–318. [Google Scholar] [CrossRef]
- Shen, Y.; Lin, Z.; Huang, C.; Courville, A. Neural Language Modeling by Jointly Learning Syntax and Lexicon. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Shen, Y.; Tan, S.; Sordoni, A.; Courville, A. Ordered Neurons. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- de Marneffe, M.-C.; Manning, C.D.; Nivre, J.; Zeman, D. Universal Dependencies. Comput. Linguist. 2021, 47, 255–308. [Google Scholar] [CrossRef]
- Zhang, K.; Shasha, D. Simple Fast Algorithms for the Editing Distance between Trees. SIAM J. Comput. 1989, 18, 1245–1262. [Google Scholar] [CrossRef]
- Dellert, J.; Daneyko, T.; Münch, A.; Ladygina, A.; Buch, A.; Clarius, N.; Grigorjew, I.; Balabel, M.; Boga, H.; Baysarova, Z.; et al. NorthEuraLex: A Wide-Coverage Lexical Database of Northern Eurasia. Lang. Resour. Eval. 2019, 54, 273–301. [Google Scholar] [CrossRef]
- Cettolo, M.; Girardi, C.; Federico, M. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, Italy, 28–30 May 2012. [Google Scholar]
- Şahin, G.; Vania, C.; Kuznetsov, I.; Gurevych, I. LINSPECTOR. Comput. Linguist. 2020, 46, 335–385. [Google Scholar] [CrossRef]
- Brinkmann, J.; Wendler, C.; Bartelt, C.; Mueller, A. Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages. arXiv 2025, arXiv:2501.06346. [Google Scholar] [CrossRef]
- Wasserman, L. All of Statistics: A Concise Course in Statistical Inference; Springer: New York, NY, USA, 2004. [Google Scholar]

























Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Mantri, R.; Chen, S.; Wang, Y.; Ataman, D. Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models. Information 2026, 17, 498. https://doi.org/10.3390/info17050498
Mantri R, Chen S, Wang Y, Ataman D. Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models. Information. 2026; 17(5):498. https://doi.org/10.3390/info17050498
Chicago/Turabian StyleMantri, Raghav, Saun Chen, Yixuan Wang, and Duygu Ataman. 2026. "Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models" Information 17, no. 5: 498. https://doi.org/10.3390/info17050498
APA StyleMantri, R., Chen, S., Wang, Y., & Ataman, D. (2026). Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models. Information, 17(5), 498. https://doi.org/10.3390/info17050498
