Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models

Mantri, Raghav; Chen, Saun; Wang, Yixuan; Ataman, Duygu

doi:10.3390/info17050498

Open AccessArticle

Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models

¹

Computer Science Department, New York University, New York, NY 10012, USA

²

Graduate School of Informatics, Middle East Technical University, 06800 Ankara, Türkiye

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 498; https://doi.org/10.3390/info17050498

Submission received: 5 March 2026 / Revised: 15 April 2026 / Accepted: 8 May 2026 / Published: 18 May 2026

(This article belongs to the Special Issue Human and Machine Translation: Recent Trends and Foundations)

Download

Browse Figures

Versions Notes

Abstract

As large language models (LLMs) scale to cover more languages, their potential to support low-resource settings becomes increasingly promising. However, the mechanisms underlying cross-lingual transfer and the factors that facilitate it remain insufficiently understood. Prior work has highlighted the role of linguistic similarity—particularly syntactic structure—in enabling transfer across languages. In this study, we present a broad empirical analysis of how multilingual LLMs encode and relate structural information across languages with varying typological properties. We combine multiple complementary methods, including hidden-state similarity analysis, typological correlation, probing for syntactic features, and attention-based structural comparisons, across four multilingual models and thirteen languages. Our findings show consistent correlations between representational similarity and syntactic relatedness, suggesting that structural properties of language influence how information is organized and shared across languages. We further observe that attention-derived structures exhibit partial alignment with gold-standard syntax, though this alignment should be interpreted as heuristic rather than direct evidence of syntactic encoding. Overall, our results provide a comparative empirical perspective on cross-lingual structural bias in multilingual LLMs and highlight the importance of careful methodological interpretation when linking representation geometry to linguistic structure.

Keywords:

large language models; multilinguality; representation learning; learning bias; generalization

1. Introduction

Encouraged by the aim of eliminating the need to develop a separate model for each individual language, the development of multilingual large language models (LLMs) also provides the potential to improve applicability in many under-represented languages by facilitating knowledge transfer between high-resourced and low-resourced languages. While promising results have been obtained with multilingual models, the now more established multilingual model development methodologies have also increased awareness of language diversity and enabled progress in language modeling research. Ultimately, these developments show potential to build the next generation of language models that are applicable across languages and domains. While state-of-the-art models are trained on hundreds of languages [1,2,3,4,5], their competence remains limited to the high-resourced languages present in training sets. Making these advances applicable to low-resourced and under-represented languages is very promising and beneficial and, therefore, has attracted the attention of researchers in recent years. The main limitation in terms of achieving efficient and successful multilingual LLM development is the lack of understanding of the fundamental factors allowing for transfer of information across languages.

A primary finding conclusion of many studies is that large multilingual models ultimately suffer from the curse of multilinguality [6], according to which the performance of a multilingual model with a fixed model capacity tends to decline after a point as the number of languages in the training data increases. Thus, many studies suggest that adding more languages to a language model may hinder further improvement in low-resource languages, most likely due to incompatibility between the linguistic structures of different languages [7]. Previous works [8,9] have analyzed representations encoded in LLMs and found that similar languages are grouped together in vector space and that the distances between these representations may be significant for the sharing of information across languages. More importantly, many studies have shown that successful information sharing can be correlated with certain network parameters [10], such as vocabulary [7], hidden states, or attention heads [11]. Recently, the author of [12] proposed a positive correlation between geometric distances across language representations in vector space and zero-shot cross-lingual transfer performance with respect to encoded grammatical information. Nevertheless, previous studies are limited to encoder-only models, and their scaling and extension to larger, more recent decoder-only models remain to be explored.

In this paper, we present a systematic empirical study examining how linguistic properties relate to cross-lingual representation and transfer in multilingual LLMs. Rather than proposing a unified theory, our goal is to provide a comparative analysis across multiple models, languages, and evaluation signals in order to better characterize observable patterns in multilingual representations. Specifically, we analyze how similarities in typological features—particularly syntax—relate to similarities in hidden representations and downstream transfer performance. We further investigate whether these relationships are consistently observable across different architectures and evaluation methods. While our findings reveal stable correlations between syntactic similarity and representational structure, we emphasize that these results are descriptive and do not establish causal mechanisms. Therefore, our aim is to contribute a clearer empirical basis for understanding multilingual representation structure and to highlight open questions regarding the mechanisms underlying cross-lingual generalization.

While prior work has studied cross-lingual transfer and syntactic bias primarily in encoder-based models, our contribution lies in providing a unified empirical comparison across decoder-only LLMs, combined with attention-based structural analysis using editing tree distance, which, to the best of our knowledge, has not been jointly explored in this setting. The contributions of this study are outlined as follows:

We provide a broad empirical comparison of multilingual LLM representations across multiple models, languages, and evaluation settings.
We identify consistent correlations between syntactic similarity and representational similarity across layers and models.
We evaluate how these correlations relate to cross-lingual transfer performance using probing-based analysis.
We introduce an attention-based structural comparison using tree distance as an additional diagnostic signal while discussing its limitations as a proxy for syntax.
We analyze how these observations vary across model architectures and layers, highlighting recurring patterns and open questions.

2. Related Work

2.1. Cross-Lingual Language Representations

Cross-lingual transfer of language representation spaces was initially proposed by Mikolov, who demonstrated that such alignment can be achieved through orthogonal transformations of distributed semantics in vector space [13]. Subsequent studies aimed to explain this phenomenon by adopting approaches that predict alignments between two languages via the approximation of a linear (orthogonal) mapping, often relying on minimal or no supervision [9,14,15,16,17,18,19,20,21]. Later research highlighted the effectiveness of learning language representations within a shared multilingual space. Empirical findings indicate that neural language models can implicitly learn to align cross-lingual representations, even without an explicit alignment objective, achieving superior performance, including in zero-shot learning scenarios. However, this ability has been shown to correlate with the degree of linguistic relatedness between languages [22].

To better understand how knowledge is transferred across languages, researchers investigated whether shared vocabulary or network parameters play a central role. These analyses found that shared subword units are not essential; rather, model parameters themselves encode any information for cross-lingual knowledge transfer. Specifically, these studies found that lower layers tend to start encoding more language-agnostic features while higher layers increasingly capture language-specific characteristics [10]. This trend is especially prominent in encoder-only models, where multilinguality peaks between middle-to-higher layers [10,23]. Additionally, analytical methods testing the orthogonal alignment hypothesis using hidden representations extracted by the model are found to be most effective for aligning languages with similar word orders (e.g., English–French) but are less successful for typologically distant pairs (e.g., English–Chinese). The findings of [7] further support the conclusion that knowledge is primarily transferred via shared network parameters rather than shared subwords. The abovementioned study also reveals that representations preserve word order, introducing a bias toward high-resource languages like English, but facilitate transfer between languages with similar syntactic structures. Dufter and Schütze [11] highlight the importance of model capacity in shaping multilingual representations, suggesting that smaller models may be more likely to share features across languages, potentially by learning to encode patterns applicable across languages to best utilize limited capacity. Consistent with other studies, they conclude that structural similarities—particularly in word order—are crucial for BERT to construct an effective multilingual space. A more in-depth study of the geometric properties of multilingual language models [24] also tested the usefulness of different measures of isomorphy in the vector space of learned cross-lingual representations and how they correlate with typological features. The reported analysis suggests a similar polysynthesis and word order do affect the amount of isomorphy in the clusters of language representations. Despite the wide range of findings with respect to multilingual properties of earlier encoder-only models, whether these trends would generalize to contemporary large-scale multilingual language models remains an open question.

2.2. Structure and Geometry of Language Models

It has been demonstrated that language model representations form a hierarchy in encoding linguistic features across network layers [25,26,27]. Early work on RNN-based models revealed that syntactic abstraction increases in deeper layers, with deeper representations performing better on high-level syntactic tasks, suggesting a soft hierarchy in encoded information [25]. Peters et al. [26] extended this analysis to various architectures using a broad set of linguistic probing tasks, finding that lower layers primarily encode local syntactic features while higher layers capture more complex semantic information. This hierarchical organization appears consistent across different language modeling architectures (convolutional networks [28], long-short term memory [29], and Transformers [30]). Tenney et al. [27] further demonstrated that in BERT, linguistic features are processed in an order aligning with natural linguistic hierarchies—starting with Part-of-Speech (POS) tagging, followed by constituents, dependencies, semantic roles, and co-reference. Although features accumulate across layers, they tend to be used at specific depths. This has implications for transferability: lower layers encode general features, which become increasingly task-specific toward the top. Researchers have also found that Transformer-based models tend to distribute information more evenly across layers, likely due to the self-attention mechanism enabling broader feature distribution [31].

In contrast, geometric structures of vector-space representations remain less explored. Hewitt and Manning [32] showed the first quantitative evidence that the L2 distance between word embeddings in language models correlates with syntactic tree distance. Coenen et al. [33] confirmed this finding using Pythagorean embeddings, demonstrating that parse trees require a vector space with a quadratic metric relative to the token distance. These findings support the intuition that expressing grammatical structure may require at least a quadratic transformation. However, the full complexity of the vector space and whether it supports multiple structural encodings remains an open question. A deeper understanding could be achieved by analyzing phrase or sentence-level representations.

In conclusion, multilingual language model representations appear to be organized in a way that can be approximated as manifolds or language clusters, supporting the success of rotational methods in cross-lingual word alignment. However, the precise nature of these representations and whether they generalize to universal settings such as transfer between typologically diverse languages remains uncertain. The prevalent reliance on word alignment tasks as the primary probing method has limited our understanding of word representations beyond alignment and restricts insights into the learning of sentence-level representations across languages, which is only compulsory for the evaluation of the effect of syntax in cross-lingual knowledge transfer. While there is a general consensus among previous studies that language representations correlate with the typological structures of languages, there remains a lack of sound and established quantitative study characterizing the functional nature of cross-lingual alignments in vector space, particularly for the alignment of representations across languages with distinct linguistic typologies. This paper aims to provide a preliminary analysis of the non-linear geometric properties of language representations and examines how different linguistic features contribute to the encoding of shared or language-specific representations. Additionally, we explore the under-examined roles of attention and feed-forward layers in facilitating language-specific transformations, with a focus on how these findings scale to current LLMs.

3. Methodology

We present a comprehensive study of multilingual LLMs with the aim of measuring important linguistic properties that facilitate cross-lingual knowledge transfer. In order to make our study comparable to earlier studies, we start with traditional vector-space distance metrics for comparing semantically equivalent representations of words and sentences across languages in hidden layers of language models. We compute the correlation between derived vector-space distances with expert annotated linguistic distances to find typological characteristics that might be related to the efficiency of cross-lingual transfer. We also analyze the dependency structures learned in the attention layers of the network and compute the structural similarity of learned dependencies relative to gold-standard syntax trees representing each exact sentence in various languages. In order to also measure the correlation of distance metrics with the specific information encoded in the evaluated representations, we also probe each layer representation in the POS tagging task to confirm the validity of measured biases in easing cross-lingual transfer. We provide more details on the implementation of our analysis methods in the next sections.

3.1. Cosine Similarity Metrics for Comparing Hidden Representations

Our vector-space analysis compares hidden representations across languages using cosine similarity. For each language pair, we construct aligned sets of semantically equivalent units (words or sentences) using parallel data. For each aligned pair, we extract contextualized representations from a given model layer and compute the cosine similarity.

To obtain a language-level similarity score, we average cosine similarities across all aligned pairs. This results in a single scalar reflect the average representational proximity between two languages in a given layer. While cosine similarity has been widely used in prior work, it primarily captures geometric proximity and does not directly encode structural equivalence; therefore, it should be interpreted as a coarse proxy for representational similarity rather than a direct measure of linguistic alignment.

The similarity is averaged over sets of semantically equivalent lexical units retrieved either from multi-way parallel dictionaries or multi-way parallel corpora of translated sentences. The average distance (D) over the set of translated pairs of words (or sentences) is computed as

D = 1 / d \sum_{i}^{d} c o s (x_{i}, y_{i})

(1)

where x and y are hidden state representations retrieved from word i in different languages and d is the size of the bilingual dataset. In cases of sentences, we average all representations of words in the sentence to obtain a single sentence vector.

3.2. Typological Distances Across Languages

In order to relate representational similarity to linguistic properties, we use typological features from the URIEL database [34]. For each feature category, we compute pairwise distances between languages and compare them with representation-based similarity scores.

The URIEL database contains a collection of language typology data via the lang2vec (https://github.com/antonisa/lang2vec, accessed on 5 March 2026) library. For our examination, we select four categories of typological language features:

geography (“geo”)—Geographic distances between languages on the globe;
syntax average (“syntac”)—An average score representing the distinctness of the paradigms observed in a given language in terms of syntax;
phonology average (“phono”)—An average score representing the production rules of speech sounds of a language;
inventory average (“invent”)—An average score representing features related to phonetic inventories or the lexical patterns of a language.

The typological features are aggregated representations derived from multiple binary linguistic properties. While this aggregation enables a compact and comparable representation across languages, it may obscure fine-grained distinctions between individual linguistic features. We adopt this approach for consistency with prior work using URIEL, but we acknowledge that this simplification may affect the granularity of observed correlations.

We use Pearson correlation [35] as a simple measure of linear association between these quantities. We note that typological features are heterogeneous and partially correlated with each other and that linear correlation provides only a coarse approximation of their relationship with representation geometry. Therefore, our analysis should be interpreted as identifying statistical associations rather than precise functional relationships. We also perform two-dimensional cross-correlation analysis to measure the relationship between typological and metric or accuracy similarities across languages.

3.3. Structural Probing

Probing [36,37,38,39] is a widely used technique to analyze and interpret features encoded in hidden representations of pre-trained language models. In this approach, a simple classifier—typically, a linear model—is trained on top of frozen representations from a specific layer of the language model to predict linguistic properties such as part-of-speech tags, syntactic dependencies, or semantic roles [27,32].

Formally, given an input sentence (x) and a pre-trained language model (

M

), the representation of the i-th token from layer l is denoted by

h_{i}^{(l)} \in R^{d}

, where d is the hidden dimension. A linear probe is defined as

{\hat{y}}_{i} = softmax (W h_{i}^{(l)} + b),

(2)

where

W \in R^{k \times d}

and

b \in R^{k}

are the learned weights and bias of the linear classifier and k is the number of target classes. The model is trained to minimize the cross-entropy loss between

{\hat{y}}_{i}

and the ground-truth labels.

By evaluating the performance of probes across layers and tasks, one can infer the type and depth of linguistic information encoded in each layer. Despite their simplicity, probing classifiers offer insights into the representational structure of language models and the extent to which they capture general or task-specific linguistic features.

While probing provides a useful diagnostic tool for identifying information encoded in representations, it does not establish whether such information is causally used by the model during inference. High probing performance may reflect extractable information rather than functionally utilized structure. Therefore, we interpret probing results as indicative of representational capacity rather than direct evidence of model behavior.

3.4. Analyzing Dependency Structures in the Attention Layer Using Editing Tree Distance

In addition to hidden states, attention mechanisms also play a vital role in capturing relevancy between representations to combine and induce new information across layers. Previous studies have proposed methods for extracting tree structures from the attention mechanism for applications in unsupervised parsing. In our study, we rely on the method developed in [40] for unsupervised grammar induction from hidden states and attention weights of language models. This method uses a straight-forward two-stage mapping of words into syntactic distances, i.e.,

d_{i} = f (g (w_{i}), g (w_{i + 1}))

, where

f (\cdot, \cdot)

is a distance measure function and

g (\cdot)

is a representation extractor function used to compute syntactic distances from hidden representations using vector-space distance metrics. In our study, we use the cosine distance for computing the distance between hidden state representations in feed-forward layers, i.e.,

\cos (r, s) = \frac{2 r^{⊤} s}{({(\sum_{i = 1}^{d} r_{i}^{2})}^{1 / 2} \cdot {(\sum_{i = 1}^{d} s_{i}^{2})}^{1 / 2} + 1),}

and the Jensen–Shannon Divergence (JSD) [41] for comparison of token representations in the attention layer, i.e.,

JSD (P ‖ Q) = {(\frac{D_{KL} (P ‖ M) + D_{KL} (Q ‖ M)}{2})}^{\frac{1}{2}},

where M = \frac{P + Q}{2}

D_{KL} (A ‖ B) = \sum_{w \in S} A (w) log \frac{A (w)}{B (w)} .

Once the distances between tokens in an input sequence are derived, they can easily be converted into the target constituency tree by a simple algorithm following [42,43], as described in Algorithm 1. The procedure can also be followed through the illustration in Figure 1. The extraction is based on the estimation of the syntactic distances from vector representations of phrasal constituents taken from each position span (

h_{1}, \dots, h_{n}

) and transformed through a two-layer feed-forward network (

F F

) with a single output unit and no activation function:

d_{i} = FF (h_{i}, h_{j})

. For prediction of the constituent labels of a given word (w) with hidden states (

h_{i \dots n}

), the algorithm passes the same representations through a softmax output (

p (c_{i j} | h_{i \dots n}) = softmax (F F ((h_{i j}))

). We compare the dependency tree extracted from a given layer of a pre-trained model with the gold-standard dependency tree obtained from Universal Dependencies (UDs) [44] to measure its accuracy and coherence with human-annotated linguistic structure. For computation of the similarity between two trees, we implement the editing tree distance algorithm [45]. The editing distance implements a dynamic programming algorithm to compute the editing tree distance between two ordered, labeled trees, defined as the minimum cost sequence of insertions, deletions, and label changes required to transform one tree into another, preserving ancestor and sibling relationships. Readers can find the full algorithm in [45]. The pseudo code for the algorithm is given in Appendix D.

Algorithm 1 Syntactic Distances to Binary Constituency Trees [42].

1:: $S = [w_{1}, w_{2}, \dots, w_{n}]$ : a sequence of words in a sentence of length n
2:: $d = [d_{1}, d_{2}, \dots, d_{n - 1}]$ : a vector whose elements are the distances between every two adjacent words
3:: function TREE( $S, d$ )
4:: if $d = []$ then
5:: node ←Leaf( $S [0]$ )
6:: else
7:: $i \leftarrow arg {max}_{i} (d)$
8:: ${child}_{ℓ} \leftarrow$ TREE( $S_{\leq i}, d_{< i}$ )
9:: ${child}_{r} \leftarrow$ TREE( $S_{> i}, d_{> i}$ )
10:: node ←Node( ${child}_{ℓ}, {child}_{r}$ )
11:: end if
12:: return node
13:: end function

It is important to note that attention-derived structures are indirect proxiesand do not necessarily correspond to true syntactic representations. Therefore, the extracted trees should be interpreted as heuristic structural approximations rather than direct evidence of syntactic encoding.

4. Experiments

We analyze cross-lingual language representations in four multilingual LLMs: Aya Expanse 8B [5], Gemma 2 2B and 2 9B [4], and Llama-3.2 3B [2]. For the word-level analysis, we use bilingual dictionaries from the NorthEuraLex database [46]. In our experiments, we use thirteen languages: Arabic (ar), English (en), Spanish (es), French (fr), Hebrew (he), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh). For each language pair, we compute the mean cosine distance over all bilingual word pairs in the dictionaries between word vectors based on different LLMs. For sentence-level experiments, we use TED Talks [47] in the same thirteen language pairs. All sentences we use are multi-way parallel in all 13 languages. All word and sentence vectors are computed using the mean of subwordsin the input sequence.

For probing experiments, we use Linspector [48] datasets for POS tagging available in the same set of thirteen languages (The languages included in the datasets have two mismatches with previous experiments. Chinese mismatch: The dataset is in Traditional Chinese (often denoted as zh-hant or zh-tw) instead of Simplified Chinese (zh-cn). Portuguese mismatch: The dataset only includes European Portuguese (often pt) and not Brazilian Portuguese (pt-br).). Probing experiments are performed by extracting the hidden states at a given feed-forward layer of the LLM and using them as input to a linear classifier on the POS tagging task. In order to account for sentence-level information, we perform contextual POS tagging using the hidden states corresponding to the subwords of each probed word in the sentence after inputting the whole sentence in the LLM and use the mean vector for classification. The classifier is trained with Adam using a learning rate of 0.0007, a batch size of 128, and an early stopping rate of 5. After training a classifier for a given language, we test its POS tagging performance in all other languages.

5. Results

The results of the word-level cross-correlation analysis between mean language clusters and the expert annotations of typological features performed at each layer of the selected pre-trained models are presented in Figure 2, Figure 3, Figure 4 and Figure 5, and the sentence-level experiments can be seen in Figure 6, Figure 7, Figure 8 and Figure 9.

The findings of the first set of experiments suggest that language representations correlate with typological distance, particularly syntactic similarity. This observation is consistent across models and layers. However, these results are correlational and do not imply that syntactic structure is the sole or causal factor shaping representations. Instead, they indicate that syntactic similarity is a strong predictor of representational proximity under the current evaluation setup.

We find the strongest indicator of representational similarity in the syntactic characteristics of the language, followed by the lexical or inventory overlap between languages. All models exhibit an almost constant level of correlation with linguistic features across their layers until the last layers, where the correlations generally decrease, suggesting the last layers tend to store more universally shared multilingual features. The correlations vary more significantly in sentence representations than in word representations, highlighting the impact of syntactic structures and how they may be more relevant in different layers. We present the exhaustive results of cosine similarity across languages selected from different models and layers in the Appendix A, Appendix B, Appendix C and Appendix D.

In order to assess how syntactic information in hidden-layer representations may be shared across languages, in the second set of experiments, we evaluate the representations extracted from each layer in the zero-shot cross-lingual POS tagging task, which helps measure the universal applicability of the syntactic features encoded in the representations in each model layer. For each language, presented as a row in the results matrix of each probing experiment, we fine-tune the model in the given language and use the model in the zero-shot setting for the remaining target languages. The findings of this experiment (with overall accuracies presented in Appendix B) show the overall competence of the model in English and related languages in the Indo-European family. Cross-lingual transfer accuracies and their correlation with syntactic similarity also confirm the significance of overlap in syntactic and inventory characteristics of languages in being able to successfully transfer structural information across languages. As shown in Figure 10, Figure 11, Figure 12 and Figure 13, we also find that zero-shot transfer accuracy and the syntactic similarity features are correlated and that these correlations tend to maximize in the earlier to middle layers of the network, confirming related studies suggesting syntactic information is more concentrated in the middle layers.

In an additional set of experiments, we analyze more in depth the extent of the syntactic bias in the attention mechanism by measuring the structural similarity in dependency structures derived from the attention layers of each model and the gold-standard UD tree structures available for all thirteen languages. Here, for each sentence in one of the thirteen languages, we extract the dependency trees from the attention layer for each of the thirteen translations of the same sentence, as well as the corresponding dependency trees from UD tree banks. As presented in Figure 14, Figure 15, Figure 16 and Figure 17, as well as Figure 18, Figure 19, Figure 20 and Figure 21, we then analyze these results in two dimensions. First, we look at the tree distances between the dependency tree structures extracted from the attention maps for inputs in a given language and the UD trees representing the translations of those sentences in different languages and compute the correlation of these distances with the typological distances (denoted as 1D row correlation analysis). This allows us to assess how the overall dependency structures predicted by the attention layer in a language may most closely resemble a certain type of syntactic structure (belonging to one of the thirteen languages). Our results confirm the highest correlations with syntactic followed by geographic proximity between languages in correlation with the similarity of distances between the predicted and gold-standard dependency tree structures. In this case, we generally find the shortest distance with the English UD tree, followed by Spanish, Dutch, Portuguese, Italian, and French. All of these languages are mainly found in the Germanic–Roman language families and share close overall geographic proximity, a common syntactic structure, and a high percentage of loan vocabulary, which we can directly observe in the correlation analysis. The exhaustive cross-lingual editing tree distance values for selected models and layers can be found in Appendix C. In the second correlation analysis, we repeat the procedure across dependency tree structures that the attention maps predict across different types of input languages (1D column correlation analysis).

Our findings suggest only slight to no differences in the tree structures predicted by the attention maps, and their similarities relative to gold-standard UD trees remain almost identical. These findings indicate that attention-derived structures exhibit limited variation across input languages and show consistent similarity relative to certain dominant structural patterns. However, this observation should be interpreted cautiously, as the extraction method itself may bias the resulting structures. Therefore, rather than concluding that attention layers encode a single dominant syntax, we interpret this result as evidence that attention-based structural signals may lack sufficient sensitivity to language-specific variation under this analysis framework. On the other hand, we find this to be consistent with the recent line of studies suggesting that LLMs learn universal structural patterns shared across languages [49].

While prior work has shown that such methods can approximate syntactic structure, we do not explicitly evaluate per-sentence parsing accuracy in this study. Therefore, differences in tree distance may partially reflect properties of the induction method itself rather than true syntactic divergence.

We also report additional correlation analysis between the zero-shot POS tagging accuracy of representations collected in individual layers and the average cross-lingual editing tree distances between attention-map structures in those layers with the UD tree banks. The results presented in Figure 22, Figure 23, Figure 24 and Figure 25 show consistent correlations across models, confirming the syntactic bias and how it limits cross-lingual knowledge transfer. We also find maximum levels of correlation concentrated in the middle layers of the networks, consistent with earlier analyses on pre-trained language models and how they encode multilingual information [10,23]. We note that cross-lingual POS tagging performance may also be influenced by factors such as lexical overlap, subword tokenization, and script similarity between languages. These factors are not explicitly controlled in our setup and may contribute to observed transfer patterns. Therefore, the probing results should be interpreted as indicative of representational tendencies rather than isolated measures of syntactic transfer.

To assess the robustness of the reported correlations, we compute statistical significance using Pearson p-values [50]. Across models and layers, the correlations with syntactic similarity remain statistically significant (p < 0.05) in all experiments.

6. Limitations and Reproducibility

Our study has several limitations. First, the analysis is restricted to four multilingual models and thirteen languages, which limits the generality of the conclusions. Second, the evaluation relies on correlational analyses between representation similarity and typological features, which do not establish causal relationships. Third, datasets used across experiments (NorthEuraLex, TED Talks, and LINSPECTOR) differ in domain and annotation schemes, which may introduce inconsistencies. Additionally, probing and attention-based methods provide indirect signals and may not reflect the actual mechanisms used by models during inference. To improve reproducibility, we have released our code, preprocessing scripts, and evaluation pipelines (Available at: https://github.com/raghavm1/Linguistic-Bias-Properties, accessed on 5 March 2026). We also provide detailed descriptions of hyperparameters, dataset splits, and evaluation protocols to facilitate replication.

7. Conclusions

In this work, we presented a broad empirical analysis of how multilingual LLMs encode and relate structural information across languages. By combining multiple evaluation methods, we identified consistent correlations between syntactic similarity and representational organization across models and layers.

At the same time, our results highlight important limitations in interpreting such findings. The observed relationships are correlational, and the extent to which they reflect underlying mechanisms of cross-lingual transfer remains an open question. Similarly, attention-based structural analyses provide useful signals but should not be interpreted as direct evidence of syntactic representation.

Overall, our study contributes a comparative empirical perspective and emphasizes the need for more controlled and causal analyses of multilingual representation learning. Future work could extend this analysis to broader language sets, alternative structural probes, and controlled interventions to better isolate the mechanisms underlying cross-lingual generalization.

Author Contributions

Author contributions are reported as follows: S.C. prepared and implemented cosine distance experiments. Y.W. prepared the tree induction algorithm from attention maps. R.M. conducted all experiments over selected models. D.A. implemented the Tree Distance algorithm and designed and supervised the research project. All authors have read and agreed to the published version of the manuscript.

Funding

We thank the Gemini Academic Research program for supporting the experiments conducted in this study with cloud credits.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Replication data and code are available in a publicly accessible repository (https://github.com/raghavm1/Linguistic-Bias-Properties, accessed on 5 March 2026).

Acknowledgments

The authors thank Hila Gonen, Kyunghyun Cho, and Sebastian Ruder for their feedback and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Average Cross-Lingual Mean Cosine Distances for Word- and Sentence-Level Representations in Selected Pre-Trained LLMs

Figure A1. Heatmaps for Aya-Expanse-8B word-level mean cosine distances between languages.

Figure A2. Heatmaps for Aya-Expanse-8B sentence-level mean cosine distances between languages.

Figure A3. Heatmaps for Gemma2-2B word-level mean cosine distances between languages.

Figure A4. Heatmaps for Gemma2-2B sentence-level mean cosine distances between languages.

Figure A5. Heatmaps for Gemma2-9B word-level mean cosine distances between languages.

Figure A6. Heatmaps for Gemma2-9B sentence-level mean cosine distances between languages.

Figure A7. Heatmaps for Llama-3.2-3B word-level mean cosine distances between languages.

Figure A8. Heatmaps for Llama-3.2-3B sentence-level mean cosine distances between languages.

Appendix B. F1 Accuracies for All Models in the Probing Task for POS Tagging

Figure A9. Heatmaps for Aya-Expanse-8B for cross-lingual transfer evaluation in the POS tagging task (first language: trained; second language: zero-shot accuracy).

Figure A10. Heatmaps for Gemma2-2B for cross-lingual transfer evaluation in the POS tagging task (first language: trained; second language: zero-shot accuracy) (cont’d).

Figure A11. Heatmaps for Gemma2-9B for cross-lingual transfer evaluation in the POS tagging task (first language: trained; second language: zero-shot accuracy).

Figure A12. Heatmaps for Llama-3.2-3B for cross-lingual transfer evaluation in the POS tagging task (first language: trained; second language: zero-shot accuracy).

Appendix C. Cross-Lingual Tree Distance Analysis Between Attention Maps and UD in All Models

Figure A13. Heatmaps for cross-lingual tree distance values across layers of Aya-Expanse-8B.

Figure A14. Heatmaps for cross-lingual tree distance values across layers of Gemma2-2B.

Figure A15. Heatmaps for cross-lingual tree distance values across layers of Gemma2-9B.

Figure A16. Heatmaps for cross-lingual tree distance values across layers of Llama3.2-3B (We run the language-He tree distance computation twice and find that the algorithm does not converge in these experiments. We report all results as is in the paper.).

Appendix D. Editing Tree Distance Algorithm

Algorithm A1 Editing Tree Distance Algorithm [45].
1: Input: Trees $T_{1}$ and $T_{2}$
2: Output: Edit distance between $T_{1}$ and $T_{2}$ , where $1 \leq i \leq \| T_{1} \|$ and $1 \leq j \leq \| T_{2} \|$
3: function TreeEditDistance( $T_{1}, T_{2}$ )
4: for all $i^{'} \in \| LR_keyroots (T_{1}) \|$ do
5: for all $j^{'} \in \| LR_keyroots (T_{2}) \|$ do
6: i = LR_keyroots1[i’]
7: j = LR_keyroots2[j’]
8: TreeDist( $i, j$ )	▷ Compute TreeDist with dynamic programming
9: end for
10: end for
11: end function
12: function TreeDist( $i, j$ )
13: forestdist $(\emptyset, \emptyset) = 0$
14: for $i_{1} = l (i)$ to i do
15: $forestdist (T_{1} [l (i) . . i_{1}], \emptyset) = forestdist (T_{1} [l (i) . . i_{1} - 1], \emptyset) + γ (T_{1} [i_{1}] \to Λ)$
16: end for
17: for $j_{1} = l (j)$ to j do
18: $forestdist (\emptyset, T_{2} [l (j) . . j_{1}]) = forestdist (\emptyset, T_{2} [l (j) . . j_{1} - 1]) + γ (Λ \to T_{2} [j_{1}])$
19: end for
20: for $i_{1} = l (i)$ to i do
21: for $j_{1} = l (j)$ to j do
22: if $l (i_{1}) = l (i)$ and $l (j_{1}) = l (j)$ then
23: $forestdist (T_{1} [l (i) . . i_{1}], T_{2} [l (j) . . j_{1}]) = \min {forestdist (T_{1} [l (i) . . i_{1} - 1], T_{2} [l (j) . . j_{1}]) + γ (T_{1} [i_{1}] \to Λ),$ $forestdist (T_{1} [l (i) . . i_{1}], T_{2} [l (j) . . j_{1} - 1]) + γ (Λ \to T_{2} [j_{1}]),$ $forestdist (T_{1} [l (i) . . i_{1} - 1], T_{2} [l (j) . . j_{1} - 1]) + γ (T_{1} [i_{1}] \to T_{2} [j_{1}])}$
24: $TREEDIST (i_{1}, j_{1}) = forestdist (T_{1} [l (i) . . i_{1}], T_{2} [l (j) . . j_{1}])$
25: else
26: $forestdist (T_{1} [l (i) . . i_{1}], T_{2} [l (j) . . j_{1}]) = \min {forestdist (T_{1} [l (i) . . i_{1} - 1], T_{2} [l (j) . . j_{1}]) + γ (T_{1} [i_{1}] \to Λ), forestdist (T_{1} [l (i) . . i_{1}], T_{2} [l (j) . . j_{1} - 1]) + γ (Λ \to T_{2} [j_{1}]), forestdist (T_{1} [l (i) . . l (i_{1}) - 1], T_{2} [l (j) . . l (j_{1}) - 1]) + TREEDIST (i_{1}, j_{1})}$
27: end if
28: end for
29: end for
30: end function

References

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.; Hauth, A.; Millican, K.; et al. Gemini: A Family of Highly Capable Multimodal Models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
Team, G.; Mesnard, T.; Hardin, C.; Dadashi, R.; Bhupatiraju, S.; Pathak, S.; Sifre, L.; Rivière, M.; Kale, M.; Love, J.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295. [Google Scholar] [CrossRef]
Üstün, A.; Aryabumi, V.; Yong, Z.; Ko, W.; D’Souza, D.; Onilude, G.; Bhandari, N.; Singh, S.; Ooi, H.; Kayid, A.; et al. Aya Model. arXiv 2024, arXiv:2402.07827. [Google Scholar] [CrossRef]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Karthikeyan, K.; Wang, Z.; Mayhew, S.; Roth, D. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Philippy, F.; Guo, S.; Haddadan, S. Identifying the Correlation Between Language Distance and Cross-Lingual Transfer in a Multilingual Representation Space. In Proceedings of the Workshop on Computational Linguistic Typology, Dubrovnik, Croatia, 6 May 2023. [Google Scholar]
Xu, R.; Yang, Y.; Otani, N.; Wu, Y. Unsupervised Cross-Lingual Transfer of Word Embedding Spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2465–2474. [Google Scholar]
Wu, S.; Conneau, A.; Li, H.; Zettlemoyer, L.; Stoyanov, V. Emerging Cross-Lingual Structure in Pretrained Language Models. arXiv 2019, arXiv:1911.01464. [Google Scholar]
Dufter, P.; Schütze, H. Identifying Necessary Elements for BERT’s Multilinguality. arXiv 2020, arXiv:2005.00396. [Google Scholar]
Shah, C. Correlations between Multilingual Language Model Geometry and Crosslingual Transfer Performance. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), Torino, Italy, 20–25 May 2024. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Artetxe, M.; Labaka, G.; Agirre, E. Learning Principled Bilingual Mappings of Word Embeddings while Preserving Monolingual Invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2289–2294. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. Learning Bilingual Word Embeddings with (Almost) No Bilingual Data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 451–462. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. A Robust Self-Learning Method for Fully Unsupervised Cross-Lingual Mappings of Word Embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 789–798. [Google Scholar]
Chen, X.; Cardie, C. Unsupervised Multilingual Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 261–270. [Google Scholar]
Conneau, A.; Lample, G.; Ranzato, M.-A.; Denoyer, L.; Jégou, H. Word Translation without Parallel Data. arXiv 2017, arXiv:1710.04087. [Google Scholar]
Hoshen, Y.; Wolf, L. Non-Adversarial Unsupervised Word Translation. arXiv 2018, arXiv:1801.06126. [Google Scholar] [CrossRef]
Ruder, S.; Vulić, I.; Søgaard, A. A Survey of Cross-Lingual Word Embedding Models. J. Artif. Intell. Res. 2019, 65, 569–631. [Google Scholar] [CrossRef]
Smith, S.; Turban, D.; Hamblin, S.; Hammerla, N. Offline Bilingual Word Vectors. arXiv 2017, arXiv:1702.03859. [Google Scholar] [CrossRef]
Wu, S.; Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. arXiv 2019, arXiv:1904.09077. [Google Scholar] [CrossRef]
Sabet, M.; Dufter, P.; Schütze, H. SimAlign. arXiv 2020, arXiv:2004.08728. [Google Scholar]
Jones, A.; Wang, W.Y.; Mahowald, K. A Massively Multilingual Analysis of Cross-Linguality in Shared Embedding Space. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021. [Google Scholar]
Blevins, T.; Levy, O.; Zettlemoyer, L. Deep RNNs Encode Soft Hierarchical Syntax. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 14–19. [Google Scholar]
Peters, M.; Neumann, M.; Zettlemoyer, L.; Yih, W.-t. Dissecting Contextual Word Embeddings: Architecture and Representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1499–1509. [Google Scholar]
Tenney, I.; Das, D.; Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. arXiv 2019, arXiv:1905.05950. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time-Series. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmaa, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Liu, N.F.; Gardner, M.; Belinkov, Y.; Peters, M.; Smith, N.A. Linguistic Knowledge and Transferability of Contextual Representations. In Proceedings of the NAACL, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Hewitt, J.; Manning, C.D. A Structural Probe for Finding Syntax in Word Representations. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Coenen, A.; Reif, E.; Yuan, A.; Kim, B.; Pearce, A.; Viégas, F.; Wattenberg, M. Visualizing and Measuring the Geometry of BERT. arXiv 2019, arXiv:1906.02715. [Google Scholar] [CrossRef]
Littell, P.; Mortensen, D.; Lin, K.; Kairis, K.; Turner, C.; Levin, L. URIEL and lang2vec. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [Google Scholar]
Sedgwick, P. Pearson’s Correlation Coefficient. BMJ 2012, 345, e4483. [Google Scholar] [CrossRef]
Adi, Y.; Kermany, E.; Belinkov, Y.; Lavi, O.; Goldberg, Y. Fine-Grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Conneau, A.; Kruszewski, G.; Lample, G.; Barrault, L.; Baroni, M. What You Can Cram into a Single Vector. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
Köhn, A. Evaluating Embeddings using Syntax-based Classification Tasks as a Proxy for Parser Performance. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, 7–12 August 2016. [Google Scholar]
Veldhoen, S.; Hupkes, D.; Zuidema, W. Diagnostic Classifiers Revealing how Neural Networks Process Hierarchical Structure. In Proceedings of the CoCo Workshop at Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 69–77. [Google Scholar]
Kim, T.; Choi, J.; Edmiston, D.; Lee, S. Are Pre-trained Language Models Aware of Phrases? In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Menéndez, M.; Pardo, J.; Pardo, L.; Pardo, M. The Jensen–Shannon Divergence. J. Frankl. Inst. 1997, 334, 307–318. [Google Scholar] [CrossRef]
Shen, Y.; Lin, Z.; Huang, C.; Courville, A. Neural Language Modeling by Jointly Learning Syntax and Lexicon. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Shen, Y.; Tan, S.; Sordoni, A.; Courville, A. Ordered Neurons. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
de Marneffe, M.-C.; Manning, C.D.; Nivre, J.; Zeman, D. Universal Dependencies. Comput. Linguist. 2021, 47, 255–308. [Google Scholar] [CrossRef]
Zhang, K.; Shasha, D. Simple Fast Algorithms for the Editing Distance between Trees. SIAM J. Comput. 1989, 18, 1245–1262. [Google Scholar] [CrossRef]
Dellert, J.; Daneyko, T.; Münch, A.; Ladygina, A.; Buch, A.; Clarius, N.; Grigorjew, I.; Balabel, M.; Boga, H.; Baysarova, Z.; et al. NorthEuraLex: A Wide-Coverage Lexical Database of Northern Eurasia. Lang. Resour. Eval. 2019, 54, 273–301. [Google Scholar] [CrossRef]
Cettolo, M.; Girardi, C.; Federico, M. WIT3: Web Inventory of Transcribed and Translated Talks. In Proceedings of the 16th Conference of the European Association for Machine Translation, Trento, Italy, 28–30 May 2012. [Google Scholar]
Şahin, G.; Vania, C.; Kuznetsov, I.; Gurevych, I. LINSPECTOR. Comput. Linguist. 2020, 46, 335–385. [Google Scholar] [CrossRef]
Brinkmann, J.; Wendler, C.; Bartelt, C.; Mueller, A. Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages. arXiv 2025, arXiv:2501.06346. [Google Scholar] [CrossRef]
Wasserman, L. All of Statistics: A Concise Course in Statistical Inference; Springer: New York, NY, USA, 2004. [Google Scholar]

Figure 1. Editing distance between two trees.

Figure 2. Word-level cosine similarity correlations in Aya-Expanse-8B.

Figure 3. Word-level cosine similarity correlations in Gemma2-2B.

Figure 4. Word-level cosine similarity correlations in Gemma2-9B.

Figure 5. Word-level cosine similarity correlations in Llama-3.2-3B.

Figure 6. Sentence-level cosine similarity correlations in Aya-Expanse-8B.

Figure 7. Sentence-level cosine similarity correlations in Gemma2-2B.

Figure 8. Sentence-level cosine similarity correlations in Gemma2-9B.

Figure 9. Sentence-level cosine similarity correlations in Llama-3.2-3B.

Figure 10. Results of correlations of POS tagging accuracy with typological features in Aya-Expanse-8B.

Figure 11. Results of correlations of POS tagging accuracy with typological features in Gemma2-2B.

Figure 12. Results of correlations of POS tagging accuracy with typological features in Gemma2-9B.

Figure 13. Results of correlations of POS tagging accuracy with typological features in Llama-3.2-3B.

Figure 14. Correlation analysis of average tree distance values between attention maps for each language and the cross-lingual UD structures in Aya-Expanse-8B.

Figure 15. Correlation analysis of average tree distance values between attention maps for each language and the cross-lingual UD structures in Gemma2-2B.

Figure 16. Correlation analysis of average tree distance values between attention maps for each language and the cross-lingual UD structures in Gemma2-9B.

Figure 17. Correlation analysis of average tree distance values between attention maps for each language and the cross-lingual UD structures in Llama-3.2-3B.

Figure 18. Correlation analysis of tree distance values between attention maps across different languages and a given UD structure in Aya-Expanse-8B.

Figure 19. Correlation analysis of tree distance values between attention maps across different languages and a given UD structure in Gemma2-2B.

Figure 20. Correlation analysis of tree distance values between attention maps across different languages and a given UD structure in Gemma2-9B.

Figure 21. Correlation analysis of tree distance values between attention maps across different languages and a given UD structure in Llama-3.2-3B.

Figure 22. Correlation analysis of tree distances across attention maps in different languages and UD structures and POS tagging accuracy in Aya-Expanse-8B.

Figure 23. Correlation analysis of tree distances across attention maps in different languages and UD structures and POS tagging accuracy in Gemma2-2B.

Figure 24. Correlation analysis of tree distances across attention maps in different languages and UD structures and POS tagging accuracy in Gemma2-9B.

Figure 25. Correlation analysis of tree distances across attention maps in different languages and UD structures and POS tagging accuracy in Llama-3.2-3B.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mantri, R.; Chen, S.; Wang, Y.; Ataman, D. Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models. Information 2026, 17, 498. https://doi.org/10.3390/info17050498

AMA Style

Mantri R, Chen S, Wang Y, Ataman D. Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models. Information. 2026; 17(5):498. https://doi.org/10.3390/info17050498

Chicago/Turabian Style

Mantri, Raghav, Saun Chen, Yixuan Wang, and Duygu Ataman. 2026. "Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models" Information 17, no. 5: 498. https://doi.org/10.3390/info17050498

APA Style

Mantri, R., Chen, S., Wang, Y., & Ataman, D. (2026). Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models. Information, 17(5), 498. https://doi.org/10.3390/info17050498

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Investigating the Structural Properties of Linguistic Biases in Multilingual Language Models

Abstract

1. Introduction

2. Related Work

2.1. Cross-Lingual Language Representations

2.2. Structure and Geometry of Language Models

3. Methodology

3.1. Cosine Similarity Metrics for Comparing Hidden Representations

3.2. Typological Distances Across Languages

3.3. Structural Probing

3.4. Analyzing Dependency Structures in the Attention Layer Using Editing Tree Distance

4. Experiments

5. Results

6. Limitations and Reproducibility

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Average Cross-Lingual Mean Cosine Distances for Word- and Sentence-Level Representations in Selected Pre-Trained LLMs

Appendix B. F1 Accuracies for All Models in the Probing Task for POS Tagging

Appendix C. Cross-Lingual Tree Distance Analysis Between Attention Maps and UD in All Models

Appendix D. Editing Tree Distance Algorithm

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI