Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance

Eronen, Juuso; Ptaszynski, Michal; Wicherkiewicz, Tomasz; Borges, Robert; Janic, Katarzyna; Liu, Zhenzhen; Mahmud, Tanjim; Masui, Fumito

doi:10.3390/make8030065

Open AccessArticle

Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance

by

Juuso Eronen

^1,*

,

Michal Ptaszynski

²

,

Tomasz Wicherkiewicz

³

,

Robert Borges

⁴

,

Katarzyna Janic

³

,

Zhenzhen Liu

²

,

Tanjim Mahmud

⁵

and

Fumito Masui

²

¹

Department of Administrative Studies, Prefectural University of Kumamoto, Kumamoto 862-0920, Kumamoto Prefecture, Japan

²

Department of Computer Science, Kitami Institute of Technology, Kitami 090-8507, Hokkaido, Japan

³

Department of Language Policy & Minority Studies, Adam Mickiewicz University in Poznan, 61-874 Poznan, Poland

⁴

Department of Statistics, Uppsala University, 753 12 Uppsala, Sweden

⁵

Department of Computer Science and Engineering, Rangamati Science and Technology University, Rangamati 4500, Bangladesh

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2026, 8(3), 65; https://doi.org/10.3390/make8030065

Submission received: 24 January 2026 / Revised: 2 March 2026 / Accepted: 5 March 2026 / Published: 7 March 2026

(This article belongs to the Special Issue Advancing Natural Language Processing for Low-Resource Languages and Dialects)

Download

Browse Figures

Versions Notes

Abstract

Selecting a source language for zero-shot cross-lingual transfer is typically done by intuition or by defaulting to English, despite large performance differences across language pairs. We study whether linguistic similarity can predict transfer performance and support principled source-language selection. We introduce quantified WALS (qWALS), a typology-based similarity metric derived from features in the World Atlas of Language Structures, and evaluate it against existing similarity baselines. Validation uses three complementary signals: computational similarity scores, zero-shot transfer performance of multilingual transformers (mBERT and XLM-R) on four NLP tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) across eight languages, and an expert-linguist similarity survey. Across tasks and models, higher linguistic similarity is associated with better transfer, and the survey provides independent support for the computational metrics.

Keywords:

multilingual natural language processing; zero-shot learning; transfer learning; linguistics; language similarity

1. Introduction

In the fields of computational linguistics (CL) and natural language processing (NLP), the ability to measure linguistic similarity holds profound implications for addressing challenges posed by linguistic diversity and advancing multilingual applications. Traditional approaches to linguistic similarity tend to rely on purely lexical or syntactic comparisons, often limiting their scope to specific language pairs or features. In contrast, the World Atlas of Language Structures (WALS) offers an extensive typological dataset encompassing phonology, morphology, syntax, and lexical properties across over 2500 languages. However, existing linguistic similarity measures either do not utilize WALS or utilize it in only a small part due to the difficulty in quantification of non-binary linguistic features.

Our work is motivated by the recognition that the development of effective multilingual applications, especially tools for low-resource languages, necessitates an understanding of linguistic relationships. Therefore, in this paper we introduce quantified WALS (qWALS), a linguistic similarity metric that systematically integrates multiple typological features from WALS to provide a comprehensive measure of language similarity. Our primary objective in this paper is to assess the effectiveness of qWALS in predicting cross-lingual transfer performance of multilingual transformer models. Specifically, we explore the correlations between linguistic similarity and zero-shot transfer learning performance across multiple languages and multiple NLP tasks, including dependency parsing, named entity recognition, sentiment analysis, and abusive language identification.

Traditional similarity metrics often rely on a narrow set of cues (e.g., lexical overlap or a small set of typological features), which can miss important cross-linguistic structure. We therefore derive qWALS from a broad set of WALS typological features to capture multiple dimensions of language similarity in a single metric. Importantly, unlike monolithic representations such as lang2vec, qWALS is transparent (feature-level, inspectable) and task-tunable: its typological features can be selected or re-weighted to better predict transfer for a specific downstream task.

One area where language similarity plays an important role is transfer learning, specifically, cross-lingual transfer. Recently, cross-lingual transfer has emerged as a possible solution for one of the critical challenges in the field of natural language processing. One of the prevailing problem sin NLP is the necessity of accommodating datasets for each task to be solved in multiple languages. This results in the dominance of high-resource languages, like English or Chinese, and a consequent scarcity of resources for other, especially low-resource, languages. This scarcity impedes the practical and cost-effective development of tools for languages other than the most popular high-resource languages. In effect, this results in limiting the advancements in NLP for the majority of linguistic diversity.

The conventional approach to address this issue, namely, creating labeled datasets for each language, becomes impractical and costly. Here, cross-lingual transfer learning, which involves leveraging labeled data from other high-resource languages to enhance model performance on low-resource languages, becomes a useful tool. Notably, zero-shot cross-lingual transfer, where a model is trained for a certain task on one language and then tested on an entirely different language, has gained popularity due to its elimination of the need for labeled data in the target language during training. Large pre-trained multilingual transformer models, such as Multilingual BERT and XLM-RoBERTa, have become pivotal in this context.

Despite the growing adoption of cross-lingual transfer, the selection of an optimal source language for transfer remains a significantly understudied problem. Currently, the choice of the source language is often made experimentally or intuitively, with English being a default choice in many cases. This lack of a systematic method for source language selection poses a critical challenge to the efficiency and effectiveness of cross-lingual transfer learning.

To overcome this challenge, we develop a linguistic similarity metric based on typological features from the World Atlas of Language Structures (WALS) and apply it to source-language selection. We validate the metric using (i) computational similarity baselines (qWALS and lang2vec), (ii) transfer performance of fine-tuned multilingual models (mBERT and XLM-R), and (iii) an expert-linguist survey. Across four NLP tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) and eight languages, we show that linguistic similarity predicts transfer performance and can guide source-language choice beyond defaulting to English.

The key contributions of this work are as follows.

We propose qWALS, a language similarity metric based on typological features from the World Atlas of Language Structures (WALS).
We use qWALS to select cross-lingual transfer source languages based on similarity rather than intuition or defaulting to English.
We further make additional improvements to the qWALS language similarity metric by optimizing feature selection for specific NLP tasks.
We evaluate the proposed method across four NLP tasks and eight languages, highlighting within- and cross-family transfer dynamics.
We show that language similarity strongly correlates with transfer performance, particularly in tasks relying on syntactic information.
We provide empirical evidence that using a similar source language usually outperforms English, offering a more effective strategy for low-resource NLP.
To further validate the findings, we perform an expert-linguist survey on language similarity across several hundred language pairs.
Both qWALS in its initial form and after the improvements are released as open source to be freely applied in future research.

This paper is structured as follows: Section 1 introduced the research problem and motivation. Section 2 provides an overview of related research on language similarity and cross-lingual transfer learning. Section 3 details the overall methodology, including the approach overview, experimental setup, and the applied language similarity metrics (Section 3.2), specifically detailing the quantification of WALS (Section 3.2.4). Section 4 presents the cross-lingual transfer learning experiments, covering the tasks and datasets (Section 4.1), applied models, experiment design (Section 4.3), results and includes an additional experiment optimizing the proposed qWALS metric. Section 5 describes the design and results of the expert survey conducted to validate the computational similarity metrics. Section 6 discusses the main findings, compares results, reflects on the expert survey, summarizes contributions, and acknowledges limitations and ethical considerations. Finally, Section 7 concludes the paper and outlines directions for future work. Appendix A lists the WALS features selected after optimization for each task.

2. Related Research

2.1. Language Similarity

Common ways of exploring similarities between languages include using language families [1], comparing lexicons [2], morphology [3], syntax [4,5], or using different typological features [6,7]. Language family classification is a systematic approach to organizing and categorizing languages based on their shared linguistic features and historical relationships. It involves grouping languages into families that are presumed to have a common ancestral language. The classification process often considers lexical, phonological, and morphological similarities, as well as other linguistic features. Constructing family trees or language classification trees [6,8] visually represents the evolutionary connections between languages within a family. This method provides a framework for understanding the broader linguistic landscape and is a cornerstone of historical and comparative linguistics. Language families, such as the Indo-European or Afro-Asiatic families, illustrate how languages have diversified over time while sharing common roots. The study of this genealogical similarity often involves the comparative method, wherein linguists analyze languages to identify cognates—words sharing a historical origin [5,6,8]. The greater the number of cognates, the closer the genealogical relationship between languages. This methodology aids in classifying languages into families and reconstructing proto-languages, offering insights into linguistic evolution and historical connections. However, it is crucial to recognize that genealogical similarity is just one facet of linguistic similarity, and a comprehensive understanding must consider additional factors such as language contact, borrowing, and change. Moreover, the concept of a similarity between two languages, and thus, in our assumption, their applicability in language transfer, could go beyond language evolution. Two languages could be very similar from the point of view of linguistic features, even if they do not have any common language family or common proto-language or there is no other evolutionary connection between them. Such similarities have been brought up between Turkic and Japanese [9,10,11] or even between English and Chinese [12].

Lexical similarity in general refers to the degree of similarity or proximity, in other words overlap, or distance between two languages or language variants. By definition it is a measure that quantifies how close the two compared languages are. This could be based on common lexical items, such as words or terms [2,13]. The similarity could also be estimated by comparing cognates (words with a common origin) or using computational approaches that analyze and compare word lists or corpora [2]. The exploration of lexical similarity is pivotal in linguistic studies and is a basis of valuable insights into language relationships, historical trajectories, and diverse interdisciplinary applications. Originating from the roots of comparative philology, early investigations aimed to distinguish cognates across related languages. Key milestones, such as Morris Swadesh’s introduction of core vocabulary lists and glottochronology [14], propelled the field into quantitative analyses and automated methodologies exemplified by the Automated Similarity Judgment Program (ASJP) [15]. The evolution of computational linguistics, marked by algorithms grounded in distributional semantics and cross-lingual word embeddings, has significantly broadened the landscape of lexical comparison [7]. These efforts have practical uses in many areas, including language classification, understanding historical linguistics, automatic language identification, finding information across different languages, and studying cultural anthropology.

In contrast to lexical similarity, phonetic similarity refers to the degree of resemblance in the sounds or articulations of speech between two or more linguistic entities, such as words, phrases, or whole languages [16]. It is a fundamental concept in linguistics and is crucial for understanding the relationships and similarities between different linguistic elements. In the analysis of phonetic similarity, linguists often examine the acoustic and articulatory features of sounds. This can include the study of vowel quality [17,18], consonant articulation [19,20], intonation patterns [21], and other phonetic attributes. Phonetic similarity is not confined to individual phonemes; it encompasses the broader context of how these phonetic elements interact in spoken language. Various methods have been employed to measure phonetic similarity, ranging from manual auditory analysis by linguists [16,22] to more sophisticated computational techniques [22,23]. Especially advanced speech technologies, based on comparison of spectrograms [24], allow for a quantitative assessment of phonetic features, providing a precise understanding of how sounds align or diverge across languages or dialects.

Phonetic similarity plays a crucial role in fields such as historical linguistics, dialectology, and speech recognition. In historical linguistics, it helps trace the evolution of sounds and phonetic shifts over time [25], contributing to the reconstruction of proto-languages [26]. In dialectology, it aids in mapping the phonetic variations within a language across different regions [27]. In speech recognition technology, understanding phonetic similarity is essential for accurate and efficient processing of spoken language [28,29]. In essence, phonetic similarity provides valuable insights into the shared phonetic heritage of languages, helping linguists uncover patterns of sound change, linguistic contact, and divergence within the complex tapestry of human speech.

Typological similarity in linguistics refers to shared structural features among languages, irrespective of their genealogical origins [30,31]. It analyzes commonalities in various language features, such as grammatical structures, word order, morphological processes, and phonological patterns, studying parallels that arise independently across diverse language families. Unlike genealogical relationships, which denote shared ancestry, typological similarity highlights patterns that emerge due to similar environmental or communicative pressures. For instance, languages employing the Subject-Verb-Object (SVO) word order, like English and Mandarin Chinese, showcase typological similarity despite their distinct language families. Similarly, features such as agglutination or ergative-absolutive alignment can be shared typologically among languages with different genealogical roots. Recognizing typological similarities contributes to the field of linguistic typology, allowing for the classification of languages based on shared structural traits and unveiling linguistic universals that transcend genealogical boundaries. Essential sources for studying this concept include the works by Comrie, such as “Language Universals and Linguistic Typology” [32]. Linguistic typological similarities can also be explored through databases like the World Atlas of Language Structures (WALS), PHOIBLE [33], Ethnologue [34], and Glottolog [35], as well as lang2vec [36].

2.2. Computing Language Similarity

In 2006, in the field of language acquisition, a study conducted by Ringbom scrutinized the connection between the complexity of language learning and the overall similarity of languages [37]. This study, using Finnish as a focal point, highlighted the critical role of cross-linguistic similarity in foreign language acquisition. It provided empirical evidence, showing that individuals fluent in Finnish encountered greater challenges in learning English compared to their Swedish-speaking counterparts due to the closer linguistic ties between Swedish and English.

In order to compute linguistic similarity, various methodologies have been explored in previous research to enhance the understanding of language relationships. Cotterell et al. contributed to this domain by demonstrating that not all languages are equally difficult to model, revealing a correlation between a language’s morphological richness and the performance of language models [38]. This insight suggested that the complexity of a language influences the efficacy of modeling.

The research by Jones et al. [39] provided the first formal suggestion that multilingual language models function as “polyglots” by housing representations of many languages in a shared space. The study analyzed how linguistic factors, such as word order and morphological complexity, influence sentence-level alignment across 101 languages using BERT-based LaBSE and BiLSTM-based LASER models. Through extensive evaluations, the authors demonstrated that certain linguistic features predict better cross-lingual alignment, with in-family training data and linguistic similarity playing pivotal roles in enhancing performance across diverse language pairs. Their findings laid the foundation for understanding how language models manage multiple languages and inspired further work in cross-lingual transfer learning.

Efforts have also been made to quantify linguistic similarity metrics using various linguistic features (for a detailed description of metrics used in this paper, see Section 3.2). However, these metrics typically hinge on a singular or a limited set of linguistic features. One application of using lexical density to computationally measure linguistic similarity is the EzGlot metric, introduced by Kovacevic in 2021 [40]. This metric assesses the lexical resemblance between the two languages under consideration, concurrently considering the number of words these languages share with other linguistic counterparts. This methodology enables the computation of similarity between the two languages in correlation with their similarity to all other languages.

The eLinguistics similarity metric, as described in the work by Beaufils and Tomin [41], is designed to provide information about the direct relatedness of the languages being compared. However, as the compared languages become more distant, the metric introduces accidental similarities in consonants, leading to a notable rise in the error rate—a point acknowledged by the authors. Despite its ease of calculation, this metric has a limitation in that it entirely overlooks other linguistic features such as semantic, syntactic, or morphological.

The lang2vec resource, created by Littell and colleagues [36], represents languages through typological, phylogenetic, and geographical vectors derived from diverse resources such as WALS [42], PHOIBLE [33], Ethnologue [34], and Glottolog [35]. This heterogeneous construction can be robust in aggregate, but it is largely monolithic from the practitioner perspective: the feature set, its provenance, and the relative weighting of different components are not easily inspectable or adjustable for a specific downstream task. In contrast, qWALS is explicitly transparent (it is computed from a clearly defined subset of WALS features) and, crucially, tunable: because the metric is constructed directly from individual typological features, it can be optimized for a given task by selecting or re-weighting the features that matter most for transfer performance.

The World Atlas of Language Structures

A project called The World Atlas of Language Structures, or WALS [42], is a database that records phonological, semantic, and grammatical knowledge for 2662 languages with nearly two hundred distinct linguistic characteristics from various domains. Instead of depending just on one or a small number of linguistic traits, a more robust approach to measuring similarity might be achieved by utilizing a linguistic similarity measure calculated from the WALS database—which is what we propose in this paper. It would also be possible to maintain homogeneity and enable a more explicable and manageable solution if all of the focus was on utilizing WALS to generate a similarity score. Given that it is based on several types of linguistic traits, we hypothesize that the measure will be more reliable than the others. Rather than focusing on lang2vec, we chose to create a novel linguistic similarity metric based on the World Atlas of Language Structures, which also includes a range of phonological, grammatical, and lexical properties.

3. Methodology

3.1. Approach Overview and Evaluation

We test whether higher linguistic similarity between a source and target language leads to better zero-shot cross-lingual transfer. Because language similarity is not directly observable, we triangulate evidence using computational similarity metrics, multilingual transfer performance, and an expert-linguist survey, following the causal-inference framing in Figure 1.

To investigate the relationship between language similarity and cross-lingual transfer learning performance, we use three independently measured variables:

Computational similarity metrics (e.g., qWALS, lang2vec);
cross-lingual transfer performance of multilingual transformers in a zero-shot setting; and
an expert-linguist survey of perceived language similarity.

We analyze correlations among these variables to test whether language similarity is predictive of cross-lingual transfer performance and to triangulate evidence under a causal-inference framing.

The first variable is a set of computational similarity metrics. We use four measures: eLinguistics, EzGlot, lang2vec, and qWALS (our WALS-derived metric). These provide quantitative estimates of similarity that we compare against both expert judgments and transfer performance.

The second variable is cross-lingual transfer performance in a zero-shot setting. We fine-tune two multilingual transformer models (mBERT and XLM-R) on a source language and evaluate on a target language without target-language training. We consider four tasks (dependency parsing, named entity recognition, sentiment analysis, and abusive language identification) and use publicly available datasets for consistent benchmarking across languages.

The third variable is an expert-linguist survey of language similarity. Expert judgments serve as an independent benchmark for validating computational metrics and for triangulating the role of similarity in transfer. Participants rated overall similarity across phonological, morphological, syntactic, lexical, and historical dimensions; responses were aggregated and compared to the computational measures.

Finally, to evaluate the relationship between linguistic similarity and transfer learning performance, we computed correlation coefficients between the variables:

Expert survey results and model performance ( $ρ_{2}$ )
Computational similarity scores and model performance ( $ρ_{3}$ )
Expert survey results and computational similarity scores ( $ρ_{4}$ )

Following causal inference principles [43,44], we propose that if language similarity (LS,

ρ_{1}

in Figure 1 diagram) causally influences model results, and if the similarity metrics accurately approximate LS, their scores should correlate with both expert survey results and cross-lingual model performance. Moreover, if expert judgments serve as an independent proxy for LS, their correlation with model performance (

ρ_{2}

) should be comparable to the correlation between computational similarity scores and model performance (

ρ_{3}

). This allows us to test the hypothesis that linguistic similarity plays a direct role in cross-lingual transfer learning.

In the diagram Figure 1,

ρ_{1}

represents the (unobserved) association between true linguistic similarity (LS) and cross-lingual transfer performance. Because LS is latent, we approximate it with two observable proxies: expert judgments (survey) and computational similarity metrics. We therefore measure (i) the correlation between expert judgments and transfer performance (

ρ_{2}

), (ii) the correlation between computational similarity scores and transfer performance (

ρ_{3}

), and (iii) the correlation between expert judgments and computational similarity scores (

ρ_{4}

). If LS genuinely influences transfer, then both proxies should align with each other (

ρ_{4}

) and each proxy should predict transfer performance (

ρ_{2}

and

ρ_{3}

), providing triangulated evidence for the latent

ρ_{1}

relationship.

3.2. Similarity Metrics and Linguistic Resources

In order to assess the proposed relationships and test our hypothesis, we needed a method to accurately quantify the linguistic characteristics of all languages in our set, namely, a language similarity metric. Specifically, we employed four distinct language similarity measures: eLinguistics [41], EzGlot [40], lang2vec [36]—an average of genetic, geographic, syntactic, inventory, phonological, and featural metrics, and qWALS, or Quantified WALS—a metric derived from WALS [42] a multidomain database of almost two hundred linguistic features described for over 2500 languages. We suggest that these linguistic similarity metrics could aid in identifying optimal source language candidates for cross-lingual transfer. Our hypothesis posits that linguistic similarity is positively correlated with cross-lingual transfer performance, suggesting that employing more similar languages could lead to improved model performance.

Below we describe each similarity metric in more detail.

3.2.1. eLinguistics

A method by Beaufils and Tomin [41] called eLinguistics calculates a genetic proximity score for language pairs based on consonant use in predefined word sets. This score considers both the presence and order of consonants within the words being compared. The underlying assumption is that phonetically similar languages will exhibit more shared consonants and consonant sequences. This assessment of the relationship of the consonants is based on the research done by Brown et al. [45].

Despite its simplicity, which entirely disregards semantic, morphological, and syntactic similarities between languages, the eLinguistics method appears to generate results that align with expectations and other established multidomain metrics used in this study (WALS, lang2vec). However, the accuracy seems to decrease as the distance between the compared languages increases. This is likely due to a growing number of coincidental consonant similarities between unrelated languages. The eLinguistics similarity score can be obtained through a web service (http://www.elinguistics.net/Compare_Languages.aspx (accessed on 1 June 2023)). Table 1 presents the eLinguistics similarity values for the languages we explore in this research.

3.2.2. EzGlot

EzGlot [40] relies on vocabulary similarity, or lexical similarity, between two languages to compute its similarity metric. This metric considers not only the lexical similarity between the compared languages but also accounts for the number of shared words with every other language. Consequently, it can compute a similarity measure for a pair of languages in relation to their proximity to all other languages. Due to this inclusion of shared words with all other languages, the similarity measure is asymmetric between every pair of languages, aligning with studies that suggest mutual language intelligibility is also asymmetric [46,47].

A pre-computed language similarity matrix and its computation formula are available on the EzGlot similarity metric project’s webpage (https://www.ezglot.com (accessed on 1 June 2023)). However, the metric’s usability is limited by the significant number of missing values in the similarity matrix. For instance, in the case of Japanese, one of the languages utilized in our experiments, over half of the values are missing for the proposed languages. Additionally, the authors of the similarity measure have not disclosed their data source, making it challenging to assess the quality of the computations and fill in the missing values in the similarity matrix. Nonetheless, we extracted the similarity values from EzGlot’s similarity matrix for the proposed languages, as presented in Table 2.

3.2.3. Averaged Lang2vec

Averaged lang2vec is a metric for evaluating language similarity calculated across various domains. Lang2vec [36] is a database that assigns unique vector identifications to languages. These identifications are derived from an extensive collection of linguistic features, including genetic background, geographic distribution, syntactic structure, sound inventories, phonological properties, and other characteristics. The data for these features is compiled from various linguistic resources such as WALS [42], PHOIBLE [33], Ethnologue [34], and Glottolog [35]. Lang2vec allows users to query for specific linguistic features and retrieve pre-computed distances between languages based on these features (genetic, geographic, syntactic, etc.). In this study, we aimed to utilize lang2vec as a multidomain metric for assessing language similarity. To achieve this, we employed an average value calculated from the six distinct distance categories provided by lang2vec.

The method is grounded in multiple linguistic features, rendering it inherently more robust than EzGlot or eLinguistics similarity metrics, which each rely solely on a single type of linguistic feature. Moreover, this method utilizes a larger dataset compared to the WALS-based metric described in detail below. Additionally, lang2vec addresses missing values in linguistic resources by employing machine learning and predictive techniques [48]. However, the heterogeneous nature of the method raises several concerns. For instance, there may be inconsistencies as the process of selecting and weighting features from different sources remains opaque. Furthermore, the features are one-hot encoded, leading to a complete loss of ordinality between feature values. Additionally, the inclusion of geographical information as one of the vectors may be dubious, as its reliability in predicting language similarity has been questioned [49]. The averaged distance matrix for lang2vec is presented in Table 3.

3.2.4. Quantification of the World Atlas of Language Structures

The main goal of this study was to create a linguistic similarity metric, that would take multiple aspects of a language into account instead of only using a single feature or a handful of features. To achieve this, we propose a similarity metric quantified from the World Atlas of Language Structures, qWALS. Compared to existing metrics that often consider only few or a single linguistic feature, the proposed qWALS metric consists of nearly two hundred features, making it a more comprehensive and robust metric that captures various aspects of languages.

WALS Database

The World Atlas of Language Structures (WALS) [42], which is a vast language database, contains phonological, semantic, and grammatical data for 2662 languages spanning over 200 language families. As of April 2025, those languages are described in the database using 192 distinct linguistic features. Unfortunately, for the majority of the languages that are available, a significant amount of linguistic features are absent. For instance, the database has over 150 characteristics for English, one of the languages with the greatest documentation. The quantity drops off quickly for less known languages. Even when using Danish as an example, there are just 58 features that are recorded (Some much less researched languages have even fewer traits documented; for example, the Guatemalan Chuj language has just 29, while the Indonesian Kutai language has only one feature.). When all of the features and languages are taken into account, the WALS database has a total of more than 58,000 data points. This indicates that just about 12% of the database is populated, which signifies that a great majority of the data is missing or undocumented. A lot of important and well researched languages also lack several features. For English, for instance, 25% of all features are absent. When quantifying the WALS database into a linguistic similarity metric, the key issues are these missing values and the sparsity of the data since utilizing lesser-known and less-studied languages means that they have fewer features in common with each other.

Metric Calculation

One issue with creating the metric based on WALS was choosing the features to be used in the calculation. It was deemed not viable to use a pre-defined set of features as each language has a different set of features documented. In other words, as the amount of languages increased, the amount of features shared with them decreased due to missing values in the database. Because of this, using a pre-defined set would lead to having only a small handful of features or even no features at all that would be shared between the languages chosen.

To counter the issue caused by the diminishing feature count, we decided to use all the available features for each possible language pair separately. This was done by selecting all of the features that would have a defined value for both languages in all possible language pairs instead of having to be shared between all of the languages.

We developed two versions of the similarity metric: one utilizing ordinal features and the other employing one-hot encoded features. To calculate the similarities, we first downloaded a snapshot of the database and systematically scanned through its contents. For each possible language pair, we selected the features that had defined values for both languages. For the ordinal features, we examined the range of possible values each feature could assume and mapped these values to a numerical scale from zero to one, maintaining their relative order as presented in the WALS database. It was assumed that the ordering of possible feature values in the WALS database is an approximate representation of their semantic similarity. The objective was to construct a multidomain similarity metric that would maintain coherence while preserving the ordinality of feature values in one version, while the alternative version used simple one-hot encoding to separately represent the possible values each feature can take.

After converting the values, we used them to compare each language pair and calculate an average of the manhattan distances between the values of all of the features for every possible language pair, normalized with the total number of features. This resulted in a symmetric distance metric that gives the distance between over 2000 different languages. The finished distance matrix for the languages used in this study is shown in Table 4 for ordinal and Table 5 for binary features. The feature coverage for each calculation is shown in Table 6. For example, the score of 0.292 for English and Danish means that only about 30% of all WALS features were available for this language pair.

WALS coverage is uneven across languages and language families. As shown in Table 6, some of our language pairs share a relatively small overlap of annotated features, which increases variance in the estimated distance and can make distances less comparable across pairs. This limitation is particularly relevant for low-resource languages, where typological documentation may be sparse or missing for precisely the features that matter most for a given task.

4. Cross-Lingual Transfer Learning Experiments

To calculate the second independent variable necessary to confirm the hypothesis that linguistic similarity impacts model performance in cross-lingual transfer learning, we trained and zero-shot tested multilingual transformer models across various downstream natural language processing (NLP) tasks and multiple languages of various families. Specifically, we conducted experiments using two multilingual pre-trained language models: Multilingual BERT (mBERT) and XLM-RoBERTa (XLM-R). Both models were fine-tuned on labeled data from a source language and then evaluated on a target language in a zero-shot setting, where no additional training on the target language was performed.

4.1. Tasks and Datasets

We evaluated four tasks spanning syntax- and semantics-heavy settings: dependency parsing (DEP), named entity recognition (NER), sentiment analysis (SA), and abusive language identification. Experiments cover eight languages from three families—Germanic (English, German, Danish), Slavic (Polish, Russian, Croatian), and Koreano-Japonic (Japanese, Korean)—to test transfer both within and across families in a zero-shot setting.

4.1.1. Dependency Parsing

Cross-lingual transfer in dependency parsing (DEP) has been a subject of study predating the emergence of multilingual transformer models [50,51,52,53]. These earlier studies predominantly relied on deep neural network-based methods trained on parallel corpora. Additionally, the work by Duong et al. [54], focused on a dependency parsing task but employed syntactic cross-lingual word embeddings [55] trained over POS contexts, emphasizing syntax instead of using parallel corpora.

However, multilingual transformer models have also demonstrated efficacy in dependency parsing [56,57,58]. Notably, Lauscher et al. [59] found that structural and syntactic similarities between languages are key determinants of cross-lingual transfer success, particularly for lower-level tasks like POS-tagging and DEP.

For our study, we employed the Universal Dependencies v2 dataset [60] across all proposed languages. This dataset, widely used in NLP and linguistic research, was also utilized in the XTREME benchmark [61] and the aforementioned research by Lauscher et al. [59]. Universal Dependencies provides a framework for consistent annotation of grammar, encompassing parts-of-speech, morphological features, and syntactic dependencies across more than 100 languages.

4.1.2. Named Entity Recognition

In the realm of Named Entity Recognition (NER), there has been a notable shift towards employing Deep Neural Networks, particularly pretrained transformer models [62,63,64]. Cross-lingual transfer learning has emerged as a prominent approach in NER research as well. Fritzler et al. [65] introduced a metric-learning technique that, at the time, surpassed a state-of-the-art recurrent neural network method, demonstrating effectiveness in both few-shot and zero-shot scenarios. Moon et al. [66] utilized multilingual BERT to fine-tune NER models across multiple languages, showcasing its superiority over models fine-tuned solely on individual languages. This underscores the model’s ability to leverage insights from diverse languages to enhance performance.

Hvingelby et al. [67] introduced a Danish NLP resource leveraging the Danish Universal Dependencies treebank. Their study indicated that transferring knowledge from other Germanic languages, particularly English and Norwegian, to Danish can yield favorable results when using mBERT. However, they found that incorporating additional Germanic languages alongside Danish did not notably improve performance compared to fine-tuning solely with Danish data in their experimentation.

Entity projection, as discussed by Jain et al. [68] and Li et al. [69], has been employed to generate pseudo-labeled datasets for low-resource NER datasets using parallel corpora. However, Weber and Steedman [70] have demonstrated that entity projection can be surpassed by cross-lingual transfer methods such as XLM-RoBERTa. This observation aligns with the findings of Lauscher et al. [59], who discovered that the transfer performance, particularly with English as the source language, correlates with the linguistic similarity of the languages, especially in NER tasks.

In our study, we utilized the WikiANN dataset [71], also employed by the XTREME benchmark [61], for all proposed languages. The WikiANN dataset consists of Wikipedia articles annotated with NER tags for LOC (location), PER (person), and ORG (organization) entities. We utilized the version curated by Rahimi et al. [72], which provides balanced train, development, and test splits and supports 176 of the 282 languages included in the original WikiANN corpus.

4.1.3. Sentiment Analysis

In the realm of NLP, sentiment analysis stands as one of the most popular research domains [73]. Recent advancements in sentiment analysis, akin to numerous other NLP tasks, have primarily revolved around the utilization of deep neural networks and pretrained language models [74,75,76,77,78]. The widespread adoption of multilingual transformer models has facilitated the incorporation of cross-lingual transfer learning, enabling the training of models even for languages with limited resources.

Rasooli et al. [79] conducted a study employing 16 languages from diverse language families, including Indo-European, Turkic, Afro-Asiatic, Uralic, and Sino-Tibetan, to develop a sentiment analysis model. Their experiments revealed that leveraging multiple source languages simultaneously tended to yield the best results for most target languages. Moreover, datasets within similar genres and domains consistently produced superior results compared to those from out-of-domain or dissimilar genres.

Pelicon et al. [80] employed zero-shot cross-lingual transfer to classify Croatian news articles, achieving promising results by fine-tuning an mBERT model with Slovene data. Similarly, Kumar et al. [81] utilized XLM-R for cross-lingual transfer from English to Hindi, demonstrating competitive performance compared to established benchmarks, offering a robust solution for sentiment analysis in resource-scarce settings.

The sentiment analysis datasets utilized in this study predominantly comprised product reviews, as we aimed to maintain consistency in the domain across languages. However, for certain languages, such as Croatian, where such data was unavailable, news articles were utilized instead. Additionally, adjustments were made to ensure uniformity in dataset labels across all languages. Whenever feasible, training and evaluation splits were retained from the original datasets; otherwise, datasets were divided into 80% for training and 20% for evaluation.

Our research involved sentiment analysis on reviews in the proposed languages. For reviews in English, Japanese, and German, we used the Multilingual Amazon Reviews Corpus [82]. This corpus contains reviews collected between 2015 and 2019, with over 200,000 reviews available for each language. While the original ratings used a 1-to-5 star scale, we adjusted the labels to a two-point system (Positive: 5 & 4 stars, Negative: 2 & 1 stars; 3-star reviews were discarded) to maintain consistency with the other sentiment datasets used in our research.

For Danish reviews, we employed a dataset compiled by Alessandro Gianfelici (https://github.com/AlessandroGianfelici/danish_reviews_dataset (accessed on 1 June 2023)). This dataset consists of nearly 45,000 reviews crawled from the Trustpilot platform. Also with this dataset, we used a two-point positive/negative scale to ensure consistency across all languages in our analysis.

The PolEmo 2.0 corpus [83] served as our source for Polish reviews. This corpus offers over 8000 reviews collected from various domains, including medicine, hotels, products, and schools. Similar to the English dataset, we applied the two-point positive/negative conversion process to maintain uniformity in our sentiment analysis.

In this study, the Russian dataset utilized was a product review dataset curated by Smetatnin et al. [84]. This dataset comprises 90,000 reviews automatically labeled on the topic of “Women’s Clothes and Accessories,” evenly distributed among three classes: positive, neutral, and negative. For experimentation purposes, the neutral reviews were excluded to simplify the labeling into a binary classification.

The Croatian dataset employed in this research, as used by Pelicon et al. [80], consists of approximately 2000 news articles collected from 24sata, one of Croatia’s prominent media companies. Annotations were conducted by six individuals using a five-level Likert scale, later adjusted to a three-point scale by the authors. Similarly, for experimental consistency, neutral reviews were omitted, resulting in binary label classification.

For the Korean dataset, the Naver sentiment movie corpus v1.0 (https://github.com/e9t/nsmc (accessed on 1 June 2023)) was utilized. This dataset contains Naver Movie reviews, totaling 100,000 samples classified as positive or negative. Originally rated on a scale from one to ten, the dataset was binarized by its creators prior to publication.

4.1.4. Abusive Language Identification

Recently, research in abusive language detection has surged in popularity, with the used methods heavily relying on recurrent neural networks (RNNs), or transformers and large language models (LLMs) for improved accuracy [85,86,87,88]. Additionally, the rise of multilingual neural language models has further empowered transfer learning, enabling effective model training for languages with limited data [89].

Ranasinghe et al. [90,91] successfully applied cross-lingual transfer for abusive language detection in Hindi, Spanish, Danish, Greek, and Bengali. Their work demonstrates how multilingual transformers like mBERT and XLM-R leverage knowledge from well-resourced languages to enhance performance in low-resource languages. Notably, these models achieved good results even without target-specific data, highlighting the potential of cross-lingual pre-training for broader language coverage.

Similarly to the work by Ranasinghe et al., Bigoulaeva et al. [92] achieved success with English and German, demonstrating the effectiveness of cross-lingual transfer across Germanic languages. They further discovered that incorporating unlabeled data from the target language can improve model performance. Interestingly, Gaikwad et al. [93] observed that transferring knowledge from Hindi significantly outperformed other languages when classifying Marathi text. This suggests a potential link between language similarity and cross-lingual transfer effectiveness.

For our study, we prioritized high-quality datasets whenever possible (English, Polish, Japanese). While we aimed for consistent quality across all languages, Germanic languages offered readily available good-quality datasets. Unfortunately, finding similar resources for Russian, Croatian and Korean proved more challenging, necessitating compromise for these three languages. We maintained the training and evaluation splits from the original datasets whenever possible. Otherwise, we applied a random split of 80% for training and 20% for evaluation. A detailed breakdown of the abusive language identification datasets used is provided in Table 7.

We used the Kaggle Formspring Dataset for Cyberbullying Detection [94] as the source for our English abusive language identification data. However, a critical issue arose concerning the initial annotations. These annotations, completed by laypeople without specific training, were deemed unreliable for a sensitive topic like cyberbullying, as prior research emphasizes the importance of expert annotations [95]. To address this, we used a version of this dataset that underwent re-annotation with the assistance of experts with a background in psychology [96], ensuring high-quality labels for our experiments.

The German dataset originated from the GermEval 2018 shared task on offensive language identification, specifically focusing on Twitter data [97]. It comprised roughly 8000 entries. The authors opted for a targeted collection approach rather than a random sampling of tweets. They reasoned that a natural sample would likely contain a disproportionately low number of offensive tweets compared to non-offensive ones. Similarly, sampling based on specific query terms was also disregarded to avoid biasing the dataset towards those particular topics. Instead, the authors employed a heuristic method to identify users known for posting offensive tweets. By sampling these users’ timelines, they ensured a higher inclusion of offensive content compared to a random approach, while still avoiding bias introduced by specific search terms. However, this targeted sampling strategy had a drawback. Certain topics that these users frequently addressed, such as the situation of migrants or the German government, became dominant themes in the extracted data. To partially counteract this issue, the authors further sampled arbitrary tweets containing common terms associated with the dominant topics. These terms included names of politicians and the word “refugee”.

Sigurbergsson and Derczynski [98] assembled the Danish dataset of user-generated comments from Facebook and Reddit sources. The final dataset consists of 3600 comments in total: 800 comments from Ekstra Bladet’s Facebook page, 1400 comments from the r/DANMAG subreddit, and another 1400 comments from the r/Denmark subreddit. To ensure consistency in annotation, they followed the guidelines and annotation schemes developed by Zampieri et al. [99]. As a preliminary step to refine their understanding of the task and address any disagreements, two annotators independently labeled the first 100 comments. They then employed the Jaccard index to measure the level of agreement between the annotations.

The Polish dataset applied in this study leverages data from two sources. The first dataset, originating from the 2019 PolEval workshop, consists of Twitter discussions [100]. The second comes from Wykop (https://www.wykop.pl (accessed on 1 June 2023)), a Polish social networking platform. Preprocessing for both datasets focused solely on masking sensitive private information, such as usernames. Notably, the Twitter data underwent a two-step annotation process. Initial annotations by laypeople were reviewed and corrected by experts to address inconsistencies. Conversely, the Wykop dataset was annotated from the beginning by trained psychology experts.

The Russian dataset used in this study comes from the Kaggle Russian Language Toxic Comments Dataset, published in 2019 (https://www.kaggle.com/blackmoon/russian-language-toxic-comments (accessed on 1 June 2023)). This dataset combines comments from two popular online platforms in Russia: 2ch, an anonymous imageboard, and Pikabu, often referred to as the “Russian Reddit”. While the origin of the annotations remains unknown, the dataset employed a crowdsourcing approach for validation. Russian speakers, acting as laypeople, were enlisted through a crowdsourcing application to assess the quality of the labels [101].

The Japanese cyberbullying dataset merges data from two sources. The first part, described by Ptaszynski et al. [102], originates from unofficial school websites and forums. This data was collected and labeled by expert annotators (Internet Patrol members) with support from a government-provided manual [103]. The Human Rights Research Institute Against All Forms for Discrimination and Racism in Mie Prefecture, Japan, initially provided the raw data. The second dataset consists of tweets randomly collected by Arata [104] over a one-week period in July 2019. Annotators for this dataset followed guidelines established by the Safer Internet Association (https://www.saferinternet.or.jp/wordpress/wp-content/uploads/bullying_guideline_v3.pdf, https://www.safe-line.jp/wp-content/uploads/safeline_guidelines.pdf (accessed on 1 June 2023)) and research by Takenaka et al. [105].

The Korean data originates from the Kaggle Korean Hate Speech Dataset, published in 2020 (https://www.kaggle.com/captainnemo9292/korean-hate-speech-dataset (accessed on 1 June 2023)). This dataset compiles comments containing hate speech and discriminatory language, including attacks on political views, racist remarks, and gender-based insults. The source of these comments is a Korean far-right website known as “Daily Beast,” which was scraped to collect the data. However, it is important to note that the dataset’s annotation process lacks transparency. Since the identities of the annotators remain unknown, we cannot comment on the overall quality of the labels.

For Croatian, we utilized a dataset known as FRENK [106]. This dataset comprises comments posted on Facebook by mainstream media outlets regarding migrants and the LGBT community. It includes entire discussion threads, with annotations categorizing the comments based on various types of socially unacceptable discourse such as inappropriate language, offensive remarks, and violent speech. Additionally, the annotations specify the target of the discourse, whether it is directed towards migrants/LGBT individuals, other commentators, or the media outlets themselves. The annotation process was conducted using a crowdsourcing tool called PyBossa.

4.2. Applied Models

For the experiments we assumed that multilingual transformer models (mBERT, XLM-RoBERTa) would be able to generalize sufficiently well in a zero-shot cross-lingual setting [107]. This means that fine-tuning was done without using any data from the target language. Instead, the fine-tuning was performed only with data from one of the other proposed languages.

4.2.1. mBERT

Multilingual BERT (mBERT) [108] is a multilingual transformer model commonly used as a baseline for zero-shot cross-lingual transfer [56,109,110].

Even without explicit cross-lingual training data, mBERT has shown strong cross-lingual capabilities and performed well in various cross-lingual tasks [110], including zero-shot transfer tasks. It has been found that mBERT outperforms other cross-lingual embedding methods [56]. This cross-lingual generalization is thought to result from the use of shared word pieces, as well as elements like numbers and URLs, which appear in all languages. These shared elements are mapped to a common representation space, and as they co-occur with other word pieces, those are also drawn into the same space. Over time, this effect brings word pieces from different languages closer together, resulting in a shared multilingual representation [111].

4.2.2. XLM-R

XLM-RoBERTa (XLM-R) [112] is a multilingual transformer trained with masked language modeling on a much larger CommonCrawl corpus than mBERT (100 languages). Unlike BERT, it does not use the next sentence prediction objective [113].

XLM-R has been shown to outperform mBERT across a range of cross-lingual benchmarks, including zero-shot transfer tasks [114]. It has been reported to excel particularly in low-resource languages. Remarkably, XLM-R performs on par with leading monolingual models, demonstrating that multilingual models can achieve strong results without sacrificing performance in individual languages [112]. This success is likely due to the vast amount of data used during pre-training.

4.3. Experiment Design

To validate the effectiveness of our qWALS metric, we used the previously described natural language processing tasks (dependency parsing, named entity recognition, sentiment analysis and abusive speech detection). We also compared the metric to the existing linguistic similarity metrics (eLinguistics, EzGlot and Averaged lang2vec). These evaluations were done in a zero-shot cross-lingual setting by measuring the correlation between transfer language performance of multilingual transformer models and the similarity of the applied languages. We also did an additional evaluation by conducting a language similarity survey with professional linguists, which is described in detail in Section 5.

By using all of the proposed languages, namely, English, German, Danish, Polish, Russian, Croatian, Japanese, and Korean for each task, we fine-tuned both mBERT and XLM-R models. This process resulted in the creation of 64 models in total, with 16 models dedicated to each task. To evaluate cross-lingual transfer performance, these fine-tuned models were tested using datasets from each of the eight languages. Since a zero-shot setting does not involve overlap between the training and test datasets, we adopted a straightforward train-test evaluation approach rather than the standard train-dev-test split. Moreover, our objective is not to develop a product or fine-tune for specific datasets, but rather to assess the general cross-lingual capabilities of the models. Therefore, we did not aim to achieve the best scores, or state-of-the-art for each model. Rather, we were more interested in the correlations between the three independent variables measured in this study. The models were scored with a macro F1 for abusive language detection, sentiment analysis and NER, and Label Attachment Score (LAS) for dependency parsing (DEP).

After assessing each of the fine-tuned models, we analyzed the relationship between linguistic similarity and zero-shot cross-lingual transfer performance. This analysis utilized four language similarity metrics previously discussed: eLinguistics, EzGlot, lang2vec, and qWALS-the proposed here metric based on WALS. Specifically, we computed correlation coefficients (both Pearson’s and Spearman’s) between these similarity metrics and the models’ zero-shot cross-lingual transfer scores. The models were trained using PyTorch v.1.11.0 and the Huggingface Transformers library v.4.20.1 [115].

4.4. Results and Discussion

We fine-tuned the multilingual transformer models with all of the proposed languages for each task (abusive language identification, sentiment analysis, NER, DEP) described earlier. The models were fine-tuned using only the training dataset from a single language before evaluation. The model evaluation results are shown in Table 8 and Table 9 for abusive language detection, Table 10 and Table 11 for sentiment analysis, Table 12 and Table 13 for NER and Table 14 and Table 15 for DEP.

The results clearly show that XLM-R outperformed mBERT across all tasks, with the only exception being the sentiment analysis task, where mBERT showed slightly better performance. The results indicate that languages within the same family as the source language (English, German, and Danish from the Germanic family; Croatian, Polish, and Russian from the Slavic family; and Japanese and Korean from the Koreano-Japonic family) generally achieved higher transfer scores. Furthermore, a significant performance gap is observed between zero-shot cross-lingual transfer and evaluations conducted in the source language, with the exceptions being the Named Entity Recognition (NER) task for XLM-R and the sentiment analysis task for both models.

As expected, XLM-R slightly outperformed mBERT in both dependency parsing and abusive language identification tasks. However, both models performed well across all language pairs in the sentiment analysis task, with mBERT achieving a marginally higher overall score than XLM-R. Notably, some language pairs even achieved F-scores above 0.95 in zero-shot cross-lingual transfer. Interestingly, there doesn’t seem to be a consistent pattern regarding which language pairs yield better results for sentiment analysis. For example, in both models, Slavic languages appeared to be more effective as source languages for Danish than Germanic languages. In the NER task, XLM-R was able to achieve very high scores with zero-shot transfer, with overall performance remaining consistently strong. Additionally, the difference in the performances of mBERT and XLM-R is more clear in the NER task.

Table 16 shows that, with the exception of the dependency parsing (DEP) task, both Japanese and Korean performed relatively well as source languages. However, in the DEP task, their performance was significantly lower compared to the other languages, despite being the only non-Indo-European languages included. Aside from this, most languages performed similarly as cross-lingual transfer sources, with one exception: German, Croatian, and Russian tended to perform slightly better overall, particularly when using mBERT. A similar trend was also noted by Turc et al. [116].

Effect of Linguistic Similarity

Using both Pearson’s and Spearman’s correlation coefficients (

ρ

-value), we analyzed the relationship between the zero-shot cross-lingual transfer results of the two models and each of the four linguistic similarity metrics (EzGlot, eLinguistics, Lang2vec, WALS) across all four NLP tasks. Due to missing similarity values for some language pairs, some of the EzGlot metric values could not be used for correlation calculations. The results of the correlation analysis were presented in Table 17 for abusive language detection, Table 18 for sentiment analysis, Table 19 for Named Entity Recognition (NER) and Table 20 for dependency parsing (DEP).

Except for sentiment analysis, where the correlation for XLM-R is notably lower and remains moderate across all linguistic similarity metrics, the results from both Pearson’s and Spearman’s correlation coefficients generally show a strong correlation between the WALS and eLinguistics metrics and the cross-lingual zero-shot transfer scores. There seems to be no large differences in using ordinal or one-hot encoded features for WALS at this stage. Additionally, there is a strong to moderate correlation between the EzGlot metric and the transfer scores.

Compared to NER and dependency parsing, sentiment analysis is less constrained by fine-grained morphosyntax and more driven by lexicalized and discourse-level cues (e.g., polarity-bearing words, idioms, negation scope patterns, and pragmatics) that are not well captured by broad typological similarity. In multilingual pretraining, sentiment-relevant signals can also be learned through shared high-frequency subword units, punctuation and emoji conventions, and topical/genre regularities that transfer across unrelated languages. As a result, zero-shot sentiment performance can be dominated by semantic and contextual factors (domain match, label semantics, and lexical overlap) rather than by structural proximity as measured by typological features, which helps explain the weaker and less consistent correlations observed for this task.

The dependency parsing task with XLM-R demonstrates the strongest correlations, with the highest absolute Spearman’s correlation of 0.897 using the eLinguistics metric. Abusive language identification shows the second-highest correlation, followed closely by Named Entity Recognition (NER). Sentiment analysis, on the other hand, exhibits the weakest correlations overall. All tasks, models, and metrics yielded p-values lower than 0.05, indicating statistical significance. Except for sentiment analysis, where EzGlot shows slightly higher correlations for both Pearson and Spearman coefficients, correlations for the WALS and eLinguistics metrics are generally higher than those for EzGlot across both models. Additionally, XLM-R displayed stronger correlations in both NER and dependency parsing, while mBERT typically showed somewhat stronger correlations in sentiment analysis and abusive language identification.

However, after removing the anchor points corresponding to identical source-target language pairs (or monolingual setting), which left only the zero-shot transfer results, the findings changed significantly for all tasks except dependency parsing. This adjustment was necessary to eliminate bias introduced by the monolingual data points, as these scores are generally higher and the languages involved are the most similar (i.e., same). The updated results following the removal of identical source-target language pairs are presented in Table 21 for abusive language detection, Table 22 for sentiment analysis, Table 23 for NER and Table 24 for dependency parsing (DEP).

First thing to note here is that while WALS (one-hot) does provide some useful correlations, especially in tasks like NER and sentiment analysis, it shows weaker results across the other more complex tasks. This is likely due to the binary nature of one-hot encoding, which treats all feature mismatches equally, and causes the loss of all ordinal information. In contrast, WALS (ordinal) better reflects the gradation of features, making it more effective for capturing subtle syntactic and grammatical differences, which is particularly evident in dependency parsing. In tasks like dependency parsing, where understanding syntactic structures is crucial, ordinal features exhibit much stronger correlations (e.g., up to −0.769 in XLM-R), highlighting their ability to capture distinctions between languages. For these reasons, we opted to continue using the ordinal version of WALS in subsequent experiments, as it provides a more reliable and interpretable measure of cross-lingual similarity, especially for grammatically rich tasks like dependency parsing.

Figure 2 presents heatmaps of the qWALS distance matrix alongside the transfer performance matrices (XLM-R) for all four tasks. A clear visual alignment is observable, particularly for dependency parsing and named entity recognition: language pairs that are typologically closer (darker in qWALS) tend to yield stronger transfer results (darker in performance matrices). This pattern is weaker for sentiment analysis, where performance appears less tightly coupled to structural proximity, consistent with earlier correlation analyses.

After removing identical source–target pairs, correlations for abusive language identification and NER dropped from high to moderate for WALS (ordinal) and eLinguistics, while EzGlot largely lost significance. Sentiment analysis weakened most (especially for XLM-R), whereas dependency parsing remained comparatively stable.

Additionally, because we tested correlations across four similarity metrics, four NLP tasks, and two models (32 hypothesis tests in total), we can apply a Bonferroni correction to control the family-wise Type I error rate. Accordingly, we can use an adjusted significance threshold of

α = 0.05 / 32 \approx 0.00156

and treat results as statistically significant only when

p < α

. Under the Bonferroni-adjusted threshold, most remained statistically significant, except for several sentiment-analysis correlations. However, In the current experimental setup, we performed one fine-tuning/evaluation run per language pair and therefore do not report standard deviations or confidence intervals. In future work, we will run multiple random seeds and/or compute bootstrap confidence intervals to quantify variance and provide more robust comparisons. Additionally, Given the relatively small sample of 28 language pairs, correlation coefficients may be unstable and sensitive to outliers; accordingly, results should be interpreted as indicative patterns rather than conclusive statistical evidence.

After analyzing correlations between model results and language similarity score calculated with qWALS we did an additional analysis taking into account the reliability of qWALS scores. In qWALS, the scores are calculated with all features available for each language pair. This means, that form some language pairs there will be many common features, while for others—only few (see Table 6 for the coverage of features available for each language pair studied in this paper). The coverage will increase when WALS becomes more populated. Although the coverage is not related to either model performance, or language similarity—after all it simply reflects the extent to which the database is populated—it could be important to take it into account when choosing a source transfer language. For example, if qWALS provides two potential source languages with a very similar or even identical language similarity score to the target language, but one of those scores was calculated on a significantly fewer number of linguistic features, one could argue, that the language pair which shares more features would have a more reliable similarity score.

Thus, additionally, we also proposed a composite score that multiplies WALS-based linguistic similarity (Table 4 as

1 - d i s t a n c e

) with the feature coverage factor from Table 6 between language pairs. This approach, shown in Table 25, penalizes scores that rely on sparse and thus unreliable feature sets while emphasizing well-supported similarity scores. However, as can be seen from Table 26, this simple similarity × coverage heuristic does not work well as-is for transfer language selection. The correlations between this composite score and actual model performance are weak or even negative across tasks. For instance, named entity recognition shows a significant negative correlation, and other tasks like sentiment analysis and abusive language identification yield near-zero correlations.

These results point to plausible failure modes of using coverage as a reliability proxy. First, coverage is not missing at random: WALS is substantially more complete for well-studied languages and families, so multiplying by coverage can introduce a documentation bias that favors well-documented families (often also high-resource languages) regardless of their true typological proximity to the target language. Second, coverage may be correlated with other confounders (e.g., availability of NLP resources, presence in multilingual pretraining corpora, and research attention), so the composite score can inadvertently up-weight pairs that already transfer well for reasons unrelated to typology. Third, low coverage does not necessarily imply low informativeness: a small set of highly predictive features (especially if task-relevant) may be sufficient, while additional annotated features may be redundant or irrelevant. Finally, the optimal way to incorporate uncertainty is likely non-linear (e.g., using coverage to compute confidence intervals or to regularize learned feature weights) rather than a direct multiplicative penalty.

Overall, this could be interpreted as evidence that feature coverage requires more careful modeling. Future work could explore, for example, explicit missingness-aware distance estimation, uncertainty propagation for distances computed from sparse feature sets, and task-specific feature weighting learned from data while controlling for coverage and resource-related confounders.

4.5. Additional Experiment: Optimization of Quantified WALS

In its original state, the Quantified WALS metric uses as many features as are available for a specific language pair and these features are weighted equally. However, as we showed in previous section, the coverage of linguistic features two languages share, could greatly impact the applicability of qWALS. Therefore, to overcome the problem of insufficient coverage between some language pairs, we devised an additional idea to optimize the qWALS score for each task to use the minimum necessary number of features while retaining the highest similarity score possible.

In order to further improve the applicability of Quantified WALS similarity metric to cross-lingual transfer learning, we first decided to use a leave-one-feature-out method to evaluate the impact of individual features. This way we could remove redundant and less important features, giving space to those better suited for measuring similarity between languages for the need of transfer learning application.

Thus, to further optimize qWALS for transfer learning, in addition to exploring the effect of ordinal features and one-hot encoded features, we applied the following optimization method. We omitted one feature at a time and checked the correlation with cross-lingual transfer performance. After going through all the features, we dropped the feature whose removal caused the greatest decrease in the correlation between linguistic similarity and transfer performance. The process was continued until no more improvement was observed. Additionally, we repeated this for every task, and language pair separately.

By systematically excluding each feature and analyzing its subsequent impact, we were able to pinpoint which features contributed most significantly to similarity measurement. This iterative refinement ensured that only the most pertinent features were retained, thus enhancing the precision and reliability of the Quantified WALS metric. Recalculating the similarity scores and reassessing their correlation with transfer performance, allowed for more accurate selection of optimal source language for transfer learning.

Moreover, the leave-one-feature-out approach provided insights into the relative importance of different linguistic features across various tasks. This enabled us to tailor the metric to different linguistic contexts and applications. By isolating features that either redundantly overlapped with others or the ones having minimal contribution to the similarity calculation, we streamlined the metric to be more effective and efficient. This on the one hand optimizes the feature selection process from the practical standpoint, but also has potential to shed some light on the underlying linguistic attributes that are most influential in determining language similarity, either for specific language pairs, or in general.

After implementing the leave-one-feature-out method, we observed significant improvements in the results. The refined Quantified WALS metric demonstrated a visibly higher correlation with cross-lingual transfer performance across various tasks. The results were presented in Table 27.

The application of the leave-one-feature-out method yielded substantial improvements in our results across all tasks, as presented by the significantly extremely strong Pearson correlation coefficients. For abusive language identification, the Pearson correlation (

ρ

) improved from −0.646 using 169 features to −0.822 using only 53 features. This significant increase highlights the effectiveness of our refined metric in capturing the relevant linguistic similarities for this task.

In the dependency parsing (DEP) task, the Pearson correlation improved dramatically from −0.771 (169 features) to −0.9903 (75 features). This means that the 75 features relevant for dependency parsing task can predict transfer language performance with nearly perfectly. Thus even if some differences could arise in the future for language pairs not studied in this paper, this optimization could still be a good starting point. Next, for named entity recognition (NER) we also saw clear enhancement, with the Pearson correlation increasing from −0.561 to −0.808 while reducing the number of features from 169 to 63. This improvement indicates a more accurate reflection of linguistic similarity pertinent to the NER task. Lastly, for sentiment analysis, the correlation coefficient improved from −0.361 (169 features) to −0.803 (21 features). This significant leap demonstrates the refined metric’s enhanced ability to capture linguistic nuances that influence sentiment analysis. As an interesting note, it can be noticed, that tasks relying more on grammatical or syntactic information (DEP, NER), required more features for optimization, while tasks relying more on the semantics of language (Sentiment Analysis, Abusive language identification) required fewer features. Although the difference was not large, further studies could verify to what extent language models rely on syntactic (inter-word relations) or purely semantic (looking at specific tokens that important for given context) information for each type of task.

All the features applied in qWALS with specific annotations on which features were the most optimal for each task were presented in Appendix A at the end of this paper.

Overall, these results clearly demonstrated that the leave-one-feature-out method not only streamlined the feature set but also significantly boosted the metric’s effectiveness in predicting the cross-lingual transfer performance across different linguistic tasks. By removing less important and redundant features, the metric became more sensitive to the characteristics that genuinely influence linguistic similarity. This enhancement was evident in the more accurate predictions of transfer success between language pairs, which, in turn, facilitates more effective cross-lingual applications. The streamlined feature sets not only reduce computational complexity but also improve the overall reliability and validity of the similarity measurements, confirming that our refinement process was effective. Consequently, the improved metric provided clearer insights and more robust tools for linguistic research and practical language technology applications.

While these results are promising, it is important to acknowledge that correlation does not imply causation. The observed improvements might be influenced by random factors, and further validation is necessary to confirm these findings. Future work will include more language pairs, different tasks, additional tests and cross-validation to ensure that the enhanced correlations are genuinely indicative of improved metric performance and not statistical anomalies. We also note that the leave-one-feature-out optimization is inherently brute-force and does not explicitly model non-linear feature interactions. A promising alternative is to learn the mapping from WALS feature values to transfer scores directly using function-learning architectures such as Kolmogorov–Arnold Networks (KANs) [117,118], which could capture complex, task-dependent relationships among typological features and provide a more elegant route to feature importance in future work.

5. Survey on Language Similarity

In the previous sections, we demonstrated that language similarity scores correlate strongly with the results of language models trained using cross-lingual transfer learning on specific downstream tasks. These correlations varied across tasks, indicating that different tasks may rely on language similarity to differing extents. Furthermore, we optimized the qWALS language similarity score, achieving correlations ranging from very high to near-ideal levels (−0.803 to −0.99) (see Table 27), providing strong evidence that language models perform better when the target language is more similar to the source language.

While these results are compelling, they represent only partial evidence, as correlation does not imply causation. Additionally, the ideal measure of universal language similarity cannot be directly quantified, posing a challenge for confirming causal relationships. In our study, we assumed that language similarity is adequately represented by the similarity score, but the correlation with fine-tuned language model results alone does not definitively establish a causal link between language similarity and model performance.

However, causal inference principles suggest that while correlations between two variables do not guarantee causation, correlations among three or more independently measured variables can provide reasonable support for causal assumptions [119]. So, as the third independent variable, we conducted an expert-based survey in which language experts assessed the similarity between various language pairs. This survey serves as an additional validation measure, providing an external human-based benchmark to further investigate the relationship between language similarity and model performance.

5.1. Survey Design

Since the survey results were of significant importance, we needed to ensure the highest reliability of its design and performance. Primarily, as the respondents of the survey we asked three expert linguists with around 20 to nearly 40 years of expertise in linguistics, as well as extensive knowledge represented by their scientific bibliography, especially in the areas of language variety, dialects, multilingualism, and language similarity. Two of the experts (referred to as Linguist 1 and Linguist 2) were men in their forties and fifties, while the third expert (Linguist 3) was a woman in her forties. The experts were interviewed, and the general design of the survey was discussed with them to ensure it was effective, informative, and minimally troublesome. The experts were also reimbursed for their time spent on the survey according to local standards.

Although in any research it would be desirable to have as many annotators as possible, we assumed that, differently to a group of laypeople annotating subjective information, like in a sentiment analysis task, experts would have much higher inter-rater agreement. This difference has been also shown in other studies applying experts in comparison to laypeople annotators [120,121]. Therefore, we decided that asking three expert linguists for participation in the survey is sufficient for this study. However, the survey can be reproduced for other groups to study differences among expert linguists.

To ensure the reliability of the survey, we carefully designed the questions. Each question was formulated as a simple binary choice, as follows.

Language X is more similar to

- Language A

- Language B

Here, by “similar” the linguistics experts were to estimate the similarity based on all and any of their knowledge of the three languages, including grammatical, phonological, semantic, historical (e.g., whether one language pair has closer past relatives than the other), and geographical factors (e.g., whether the speakers of one language were historically more exposed to another). For example, to the question “Polish is more similar to (A) Korean or (B) Latvian?” all experts answered (B), due to the fact that Polish and Latvian are both Balto-Slavic languages, thus have a close past relative, and have been exposed to each other for centuries. However, there were also questions where the experts did not fully agree. For example, “Japanese is more similar to (A) Hindi or (B) Portuguese?”, there was a disagreement, due to different importance being put on various aspects of each language. On one hand, Japanese culture and thus language are greatly influenced by Buddhism, which developed in India, and thus many words from old Hindi appear in Japanese. On the other hand, Japanese has also been influenced by Portuguese, especially during the 17–19 century, colonial expansion of the Portuguese Empire, and there are some words in Japanese suspected to come directly from Portuguese. For example, ありがとう arigatō meaning “thank you” in Japanese is said to be derived directly from obrigado which has the same meaning in Portuguese.

Regarding the number of questions, qWALS covers 2662 languages. This represents 3,541,791 unique language pairs (2662 × 2662 = 7,086,244; then after subtracting 2662 same language pairs, for which similarity is 100% (or distance is 0), we get 7,083,582; however, language similarity is the same both ways, so only half of that is necessary, resulting in 3,541,791 unique language pairs.). Assuming that we would like to obtain an expert score on similarity between each pair, would mean making 3.5 million questions, which would not be realistic, especially since each expert would need to address all the questions. Therefore, instead we selected a minimal viable sample assuming a standard 95% confidence level and 5% margin of error. This resulted in a minimum of 385 samples, or number of questions containing a unique language pair. To additionally assure quality of the survey, we also added an identical set of control questions, using the same language pair but compared to a different language. Moreover, in the selection of the exact language pairs for the survey we made sure to include all of the language pairs covered by this study, while the remaining pairs were chosen randomly.

Then, the samples were randomized before the creation of the survey. To assure a smooth execution of the survey, we also implemented a number of Quality-of-Life improvements in the survey design. Firstly, to prevent participants from feeling overwhelmed by the size of the survey, we divided it into two subsets (Eager readers who would like to test their ability in recognizing similarity between languages can try the following two parts of the survey. Leaving a valid email will allow us to send feedback with correlation score of the participant with experts employed in this study. URL 1: https://forms.gle/Fjb4j1Z4WZufyENm7, URL 2: https://forms.gle/ifCDaB9gNaUDBV6r6). Within each subset, the samples were also divided into pages, with ten questions per page, so the participants did not need to scroll extensively each time. We also created the survey in Google forms, which allowed the participants to automatically store their progress when accessing the survey while signed into their Google accounts.

5.2. Survey Results

The analysis of the linguistic similarity survey revealed important insights into the levels of agreement among linguists and their accuracy in selecting the answers that aligned most with the language similarity scores as well as with the models’ results. The overall agreement rate among the linguists was found to be 0.5917, indicating that, on average, the linguists concurred on their choices slightly more than half of the time. This suggests a moderate level of consensus across the board. The overall divergence rate, representing the proportion of choices that differed from the models’ results, was 0.2315. This indicates that the linguists selected the same option as calculated using qWALS approximately 77% of the time, with an overall divergence rate of 23%.

When examining the agreement rates for individual languages, Serbo-Croatian emerged as the language with the highest agreement rate at 90.6%. This was followed closely by Russian at 90.3% and French at 88.9%, with Ukrainian and Dutch also showing high agreement rates of 85.7% and 81.8%, respectively. These languages exhibited strong consensus among the expert linguists, possibly due to the well-defined linguistic features and clear distinctions in their grammatical or phonological structures.

In contrast, Indonesian had the lowest agreement rate at 22.2%, indicating significant difficulty among linguists in reaching a consensus. Modern Standard Arabic followed with an agreement rate of 27.3%, and Estonian, Japanese, and Turkish showed similarly low agreement rates of 28.6%, 31.3%, and 31.8%, respectively. These low rates suggest that the linguistic characteristics of these languages were more challenging for the linguists to consistently agree upon. The languages with the top five and bottom five agreement rates are shown in Table 28.

The error rate analysis (For terminological consistency we use the term “error rate”, however, as the survey does not have actual correct answers, this term is used in the same meaning as “divergence rate.”) further supports these findings. The top five languages with the highest error rates indicated substantial difficulty in selecting one common answer. These were Urdu (87%), Arabic (59.1%), Indonesian (44.4%), Bengali (40.1%) and Mandarin Chinese (40.7%). These languages had error rates significantly above average, revealing the areas where linguists struggled the most.

Conversely, the languages with the lowest error rates showed that linguists demonstrated higher agreement with the expected linguistic similarities. The languages with the lowest error rates were Danish (0%), Latvian (0%), French (3.7%), Russian (4.2%) and Serbian-Croatian (6.3%). The languages with the top five and bottom five error rates are shown in Table 29.

We also studied the agreements among the three linguists to compare their choices of most similar languages. The agreement percentages and Cohen’s kappa values, which measure the level of agreement while accounting for chance, were calculated for these comparisons. The results are shown in Table 30 indicating that Linguist 1 and Linguist 2 had the highest agreement at 80.36% with a Cohen’s kappa of 0.795. The agreement between Linguist 1 and Linguist 3 was 68.53% (kappa = 0.673), and between Linguist 2 and Linguist 3, it was 69.44% (kappa = 0.682).

Overall, the survey encompassed a total of 30 languages and involved 769 questions, providing a comprehensive dataset for analysis. The variations in both agreement and error rates highlight the complexities involved in linguistic categorization. Languages with higher agreement and lower error rates were easier to categorize, while those with lower agreement and higher error rates posed greater challenges. These patterns are also informative from a theoretical perspective. Low-consensus languages may reflect cases where different notions of “similarity” compete: structural/typological similarity (phonology, morphology, and syntax) may point to one comparator language, while contact history and borrowing may point to another. For example, Indonesian and Modern Standard Arabic can be difficult to place in binary comparisons because heavy contact, widespread loanwords, and strong standardization effects can obscure purely structural cues. In addition, typological profiles may be less familiar to some experts outside their specialization, or may vary substantially across varieties (e.g., colloquial vs. standard Arabic), leading to different implicit baselines when judging similarity.

The divergence of Linguist 3 relative to Linguist 1 and Linguist 2 (Table 30) may similarly reflect a different weighting of evidence. One possible explanation is that Linguist 3 might have placed relatively more weight on sociohistorical and contact-driven signals (e.g., lexical borrowing, script/cultural influence, and areal diffusion), whereas the other two linguists weighted structural typology more heavily. This would naturally lead to disagreements precisely for languages with complex contact histories or mixed typological signals, and it highlights that “expert similarity” is not a single ground truth but a composite judgment that depends on the theoretical lens (contact vs. inheritance; lexicon vs. structure) used in the comparison.

This analysis offers valuable insights into where future survey designs could be strengthened, e.g., by explicitly asking raters to score separate dimensions of similarity (lexical vs. phonological vs. morphosyntactic vs. contact history) or by stratifying questions by expert specialization, thereby improving interpretability while retaining expert-level judgments.

5.3. Correlation of Survey Results with Linguistic Similarity Metrics and Cross-Lingual Transfer Scores

In addition to the agreement among linguists, we also examined the agreement between linguists and two language similarity metrics: qWALS and lang2vec. Moreover, we examined the agreement between the linguists and the best languages by cross-lingual transfer scores. The results are shown in Table 31.

When comparing the linguists’ choices with the qWALS system, the highest agreement was observed between qWALS and Linguist 1, with an agreement rate of 76.5% and a Cohen’s Kappa of 0.755, indicating strong agreement. Linguist 2 showed a slightly lower agreement rate of 73.0% (Kappa = 0.719), while Linguist 3 exhibited an agreement rate of 71.3% (Kappa = 0.701). These results suggest that qWALS is generally aligned with the judgments of all three linguists, particularly with Linguist 1, where the agreement was the strongest.

The lang2vec system demonstrated a lower level of agreement with the linguists compared to qWALS. The agreement rate between lang2vec and Linguist 1 was 71.9% (Kappa = 0.707), followed by 67.9% (Kappa = 0.666) for Linguist 2, and 65.7% (Kappa = 0.643) for Linguist 3. Although the agreements were within a moderate to substantial range, they consistently trailed behind the results obtained with qWALS, indicating that lang2vec may be less consistent in matching linguists’ judgments.

In terms of cross-lingual transfer scores, the agreement rates were more variable. Both Linguist 1 and Linguist 2 had identical agreement rates of 75.0% with the best language model (Kappa = 0.714), indicating substantial agreement. However, for Linguist 3, the agreement rate dropped significantly to 48.2% (Kappa = 0.436), suggesting only moderate alignment. This disparity highlights the variability in agreement across different linguists when utilizing cross-lingual transfer scores.

Further analysis across specific language tasks revealed varying levels of agreement. For abusive language detection, the agreement between linguists and the best language model was perfect, with a 100% agreement rate and a Kappa of 1.0, indicating complete agreement. In contrast, sentiment analysis showed a lower agreement rate of 61.5% (Kappa = 0.575), reflecting moderate alignment. This is again probably due to the model attaining a high score across all languages in this task due to its simplicity. Named entity recognition (NER) had a higher agreement rate of 69.2% (Kappa = 0.658), while dependency parsing exhibited the highest agreement in specific tasks, with a rate of 76.9% (Kappa = 0.723), indicating strong agreement in this task.

In short, qWALS consistently showed stronger alignment with the linguists compared to lang2vec. While the best-performing language models based on cross-lingual transfer scores demonstrated substantial agreement with Linguist 1 and Linguist2, the agreement with Linguist 3 was notably lower. Across specific tasks, dependency parsing and abusive language detection demonstrated the highest levels of agreement, underscoring the potential of computational models to align with human judgments in particular areas of linguistic analysis.

6. Discussion

6.1. Language Similarity and Cross-Lingual Transfer

In this study, we investigated the correlation between language similarity metrics and the performance of multilingual language models trained for cross-lingual transfer learning.

The results of fine-tuning the multilingual transformer models (mBERT and XLM-R) on abusive language detection, sentiment analysis (SA), named entity recognition (NER), and dependency parsing (DEP) tasks provided valuable insights into the performance dynamics of these models across different languages and tasks. Across most tasks, XLM-R consistently outperformed mBERT, demonstrating its superior capacity for cross-lingual transfer. However, the sentiment analysis task presented a notable exception, where mBERT slightly surpassed XLM-R, suggesting that certain tasks might benefit from mBERT’s architectural features, particularly for sentiment-related nuances.

In abusive language detection and dependency parsing, XLM-R’s edge over mBERT aligned with expectations, as these tasks often require deeper syntactic and semantic representations. XLM-R’s performance in NER was particularly noteworthy, with its high zero-shot transfer scores, suggesting it may be better at handling complex cross-lingual name entity recognition due to its robust multilingual pretraining.

Interestingly, the results revealed that cross-lingual transfer tends to perform better between languages within the same family. For instance, Germanic and Slavic languages showed higher transfer scores when evaluated against one another, highlighting the impact of linguistic similarity on model performance. However, this pattern was less clear in the sentiment analysis task, where languages from different families also exhibited strong performance. This could indicate that sentiment analysis relies more on shared semantic or contextual features than on structural linguistic similarities.

Japanese and Korean, as the only non-Indo-European languages applied in the study, performed comparably to Indo-European languages across most tasks, except for dependency parsing, where their performance was significantly lower. This finding highlights the unique challenges posed by languages with different typological structures, such as Koreano-Japonic languages, especially in tasks like dependency parsing that rely strongly on syntactic alignment.

Additionally, the high performance of German, Croatian, and Russian as source languages, particularly for mBERT, suggests that certain languages may serve as better general-purpose bases for cross-lingual transfer. This aligns with previous studies, such as by Turc et al. [116], which have observed similar trends with these languages. These findings imply that leveraging specific languages as source data could enhance model transferability in multilingual contexts.

In summary, the results demonstrate that while XLM-R generally offers stronger performance in cross-lingual tasks, there are exceptions, such as sentiment analysis, where mBERT holds a slight advantage. The impact of linguistic similarity is also evident, with better transfer between languages in the same family, although this effect is task-dependent. These insights contribute to our understanding of how multilingual models perform in diverse linguistic landscapes and can guide future improvements in cross-lingual NLP applications.

The optimized WALS feature subsets (Appendix A) offer additional interpretability beyond aggregate correlation scores. Because qWALS is computed directly from typological features, optimizing the feature set per task provides a data-driven indication of which linguistic properties are most predictive of successful transfer. For example, the dependency-parsing subset emphasizes features related to syntactic configuration (e.g., word order and clause structure), consistent with DEP being strongly syntax-driven, whereas the sentiment-analysis subset is smaller and contains fewer clearly syntax-defining features, consistent with sentiment transfer being dominated by lexical and discourse cues. We therefore discuss these feature subsets as linguistic hypotheses about what matters most for transfer in each task.

The analysis of the impact of linguistic similarity on zero-shot cross-lingual transfer revealed several key patterns across the tasks and models. For both Pearson’s and Spearman’s correlation coefficients, a strong relationship was generally observed between linguistic similarity, as measured by WALS and eLinguistics, and transfer performance in abusive language identification, NER, and DEP tasks. On the other hand, the correlation with the EzGlot metric was somewhat weaker, except in sentiment analysis, where it showed slightly higher correlation values than WALS and eLinguistics, particularly for mBERT.

The strongest correlation was observed in the DEP task for XLM-R, with a Spearman’s correlation coefficient of 0.897 for the eLinguistics metric, followed by abusive language identification and NER. Sentiment analysis displayed the weakest correlations overall, particularly with XLM-R, where the correlation coefficients were only moderate across all metrics. Interestingly, mBERT showed slightly higher correlations in abusive language identification and sentiment analysis, while XLM-R outperformed in NER and DEP.

After removing the anchor points of monolingual source-target language pairs (same language pairs), the correlation results changed significantly for all tasks except DEP. This adjustment aimed to eliminate the bias of higher transfer scores for monolingual scenarios, which also has the highest linguistic similarity, thus automatically boosting the correlation score. In abusive language identification and NER, both WALS (ordinal) and eLinguistics correlations dropped from strong to moderate, while EzGlot correlations fell close to zero and lost statistical significance. In sentiment analysis, the opposite was true, with WALS (ordinal) and eLinguistics correlations degrading, while EzGlot maintaining a slight drop but remaining statistically significant.

Overall, the strongest correlations remained in DEP, followed by abusive language identification and NER, while sentiment analysis consistently showed the weakest correlations after removing monolingual anchor points. These findings suggest that linguistic similarity plays a more significant role in tasks like dependency parsing and abusive language identification, but its influence is less prominent in sentiment analysis, where other factors may have a greater impact on cross-lingual transfer performance.

We compared two commonly used representations of features from the World Atlas of Language Structures (WALS): ordinal and one-hot encoding. The ordinal representation retains the numeric order of feature values, reflecting underlying gradience where it exists. In contrast, one-hot encoding transforms categorical values into binary vectors, assuming no inherent structure among them.

Our evaluation revealed that ordinal encoding generally yields stronger for instance, in the dependency parsing task, which is particularly sensitive to grammatical structure, the ordinal WALS metric exhibited very strong negative correlations with model performance, outperforming the one-hot variant in all comparisons. On the other hand, one-hot features yielded better performance in more simple tasks like sentiment analysis.

By treating all feature values as mutually unrelated, one-hot representations ignore the internal structure present in many WALS features. Consider, for example, the feature describing gender distinctions in pronouns. Languages range from having no gender system to distinguishing masculine and feminine, and some extend to animate/inanimate or additional noun classes. Similarly, features like the number of grammatical cases exhibit a clear ordinal structure, progressing from zero to more than ten. One-hot encoding collapses these gradation into discrete, unconnected dimensions, thereby obscuring meaningful typological similarities. Ordinal encoding, by contrast, preserves such gradience. Breaking features with inherent ordinal structure into binary variables via one-hot encoding is not only conceptually inappropriate but may also result in information loss. Such transformations obscure the gradience encoded in the original values, and might diminish their capacity to capture meaningful similarities between languages.

We also introduced a composite metric that combined WALS-based similarity scores with a reliability factor derived from feature coverage. This score adjusted raw similarity by the extent of shared, non-missing data between each language pair, effectively down-weighting comparisons that are based on sparse or incomplete information. The intuition behind this was simple: even if two languages appear similar, the similarity which is calculated from only a handful of overlapping features is less meaningful or reliable. By multiplying the WALS similarity (expressed as

1 - d i s t a n c e

) with the corresponding coverage values, we give more weight to comparisons grounded in richer typological evidence. However, despite this theoretical appeal, the combined similarity × reliability score did not improve correlations with zero-shot transfer performance. In several settings it even produced negative correlations. We argue that this outcome is plausible for at least three reasons. First, the reliability term is itself strongly tied to WALS coverage: when two languages share only a small fraction of features (e.g., in the most extreme cases around 30% overlap), the coverage-based reliability becomes small and the product acts like a broad penalty. This can disproportionately downweight precisely those typologically distant or under-described language pairs for which transfer performance is both informative and noisy, changing the ranking in a way that may hurt correlation. Second, a multiplicative adjustment can distort the scale of similarities by compressing mid-range values and amplifying small differences at the extremes; if the similarity–transfer relationship is non-linear and task-dependent, such rescaling can reduce linear (Pearson) and rank (Spearman) associations. Third, because the reliability estimate is computed from the same sparse feature matrix, it can be unstable when overlap is low; multiplying by an unstable factor can inject additional noise rather than mitigate it. These negative findings suggest that reliability should be incorporated more carefully than via a single global multiplicative weight.

Additionally, the application of the leave-one-feature-out optimization method led to substantial improvements in the results, as demonstrated by the refined Quantified WALS metric. The refined metric showed notable gains in Pearson correlation coefficients across all tasks. For abusive language identification, the Pearson correlation (

ρ

) improved significantly from −0.646 with 169 features to −0.8221 with just 53 features. This considerable increase highlights the refined metric’s effectiveness in capturing the most relevant linguistic similarities for this task.

In dependency parsing, the Pearson correlation surged from −0.771 (169 features) to nearly ideal −0.9903 (75 features). This dramatic improvement indicates that the refined metric more accurately reflects the linguistic similarities pertinent to the DEP task. Named entity recognition also saw a significant enhancement, with the Pearson correlation rising from −0.561 to −0.808 while reducing the feature count from 169 to 63. This indicates that the refined metric better captures the linguistic characteristics relevant to NER. Lastly, for sentiment analysis, the correlation coefficient improved from −0.3609 (169 features) to −0.8027 (21 features), demonstrating the refined metric’s improved ability to account for linguistic nuances affecting sentiment analysis.

Overall, these improvements clearly indicate that the leave-one-feature-out optimization method not only streamlined the feature set but also enhanced the metric’s effectiveness in predicting cross-lingual transfer performance across different tasks. By eliminating less important and redundant features, the refined metric became more sensitive to the linguistic characteristics that genuinely impact similarity and transfer success. This enhancement led to more accurate predictions of transfer performance between language pairs, facilitating more effective cross-lingual applications. The reduction in computational complexity, coupled with the improved reliability and validity of the similarity measurements, confirms the effectiveness of the refinement process.

However, while these results were promising, it is essential to recognize that correlation does not imply causation. The observed improvements could be influenced by random factors, and further validation is needed to confirm these findings. Future work will involve additional tests and cross-validation to ensure that the enhanced correlations are genuinely indicative of improved metric performance and not merely statistical anomalies. This will help solidify the refined metric’s utility in both linguistic research and practical language technology applications. We also plan to test the influence of language similarity using other forms of model alignment, such as continuous pretraining, or probing (updating the last linear layer), since it has been pointed out that fine-Tuning can distort pretrained features and in effect underperform in out-of-distribution settings [122].

The comparison between the optimized qWALS features and various Lang2vec feature groups also revealed both complementary aspects and unique strengths that each approach offers. The comparison showed valuable insights into their respective roles and applications in capturing linguistic similarities.

Both the optimized qWALS features and Lang2vec’s grammatical feature group cover a broad spectrum of syntactic, morphological, and phonological characteristics. They underscore the importance of grammatical structures in understanding linguistic nuances across languages. For instance, features such as word order, negation patterns, and syntactic constructions were central to both qWALS and Lang2vec, highlighting their shared emphasis on grammatical elements.

In terms of semantic and pragmatic features, both approaches strive to capture contextual meaning and language use. Lang2vec provides a more extensive overview of semantic and pragmatic categories, encompassing a wider range of generalizable features. In contrast, qWALS focuses on specific semantic distinctions, modalities, and pragmatic markers that are particularly relevant for tasks like sentiment analysis and named entity recognition. This specificity in qWALS allows for a better understanding of language that can enhance task-specific performance.

Lang2vec also includes geographical and typological features that account for regional and typological variations in language. These features consider factors such as language families, geographical proximity, and typological traits. On the other hand, qWALS features are more concentrated on linguistic properties and their direct impact on specific tasks, with less emphasis on geographical or typological aspects. This focus makes qWALS particularly effective for analyzing linguistic traits and their relevance to targeted NLP tasks.

The comparison highlights the advantages of integrating both approaches for a more comprehensive understanding of linguistic similarity. While Lang2vec provides a broad overview that incorporates regional and typological variations, qWALS offers detailed insights into specific linguistic traits and their task-specific relevance. Combining these perspectives can lead to more robust and interpretable linguistic similarity metrics.

An integrated approach that leverages the strengths of both qWALS and Lang2vec could significantly enhance the accuracy and adaptability of linguistic similarity metrics. By incorporating geographical and typological insights from Lang2vec alongside the detailed linguistic features identified by qWALS, researchers can develop models that are not only more accurate but also more generalizable across different languages and tasks. This integrated perspective paves the way for improved natural language processing models and more effective cross-lingual applications, capturing both linguistic universals and language-specific variations.

Overall, the findings demonstrate strong correlations between language similarity and model performance across different tasks, supporting the proposed hypothesis that models benefit more from cross-lingual transfer when the source and target languages are similar. However, the variability in correlation across tasks indicates that certain tasks might depend more heavily on language similarity than others, which opens up an important avenue for further research.

6.2. Expert Survey

One of the key contributions of this work was the optimization of the qWALS language similarity score, which yielded correlations up to −0.99 in some cases. This suggests that language similarity can be a highly reliable predictor of model performance in cross-lingual tasks. However, the caveat remains that correlation does not imply causation. The unknown, non-measurable ideal language similarity complicates this analysis, as we must rely on proxy measurements like qWALS and cross-lingual transfer results. Thus, while we presented evidence of strong correlations, establishing a direct causal link between language similarity and model performance required additional verification.

To address this challenge, we designed an expert-based survey as a third independent variable, which helped infer causal relations between the measured variables using methods from causal inference theory. This triangulation of evidence, based on: qWALS scores, cross-lingual transfer performance, and expert survey results, provided a compelling case for the causal role of language similarity in improving cross-lingual transfer learning. The results showed that expert judgments aligned with both computational similarity scores and model performance, particularly when considering high-agreement languages such as Serbian-Croatian and Russian. This bolstered our confidence in the hypothesis, but also revealed that some languages (e.g., Indonesian and Arabic) remain challenging to categorize consistently, reflecting the complexity of linguistic diversity.

The expert survey also highlighted the inherent difficulties in estimating language similarity, as evidenced by the variation in agreement rates among experts. This variation points to the subjectivity in human judgment, especially for languages with more complex historical and cultural relationships. Nevertheless, the moderate-to-high agreement rates between expert evaluations and computational methods reinforce the validity of using both human expertise and algorithmic measures to assess language similarity.

The survey results revealed insightful correlations between linguistic similarity metrics, transfer performance, and expert evaluations. The agreement between linguists and qWALS was notably high, indicating that qWALS’s assessments of linguistic similarity align closely with expert judgments. Linguists generally concurred with qWALS’s similarity scores for languages like Serbian-Croatian, Russian, and French, where the agreement rates were notably high. This suggests that qWALS provides a robust representation of linguistic similarity as perceived by experts. In contrast, the lang2vec system demonstrated lower agreement with the linguists’ choices. This discrepancy suggests that while lang2vec is a useful similarity metric, it may not align as closely with expert evaluations, particularly for languages such as Indonesian and Arabic, which showed weaker agreement.

Regarding transfer performance, the correlation with expert judgments varied across different models and tasks. The best-performing language models, based on cross-lingual transfer scores, exhibited substantial agreement with Linguist 1 and Linguist2, reflecting that high-performing models in cross-lingual tasks often align well with expert evaluations. However, Linguist 3 showed lower agreement with these models, highlighting that transfer performance may not consistently align with expert judgments across all linguists. The disagreements were observed most of the time in ambiguous cases like the example of Japanese being closer to Hindi or Portuguese. The survey also indicated that specific language tasks, such as dependency parsing and abusive language detection, demonstrated the highest levels of agreement with experts. This suggests that language models excelling in these tasks may better match expert evaluations. Conversely, tasks like sentiment analysis, which are relatively simpler, and often relying on specific keyword matching, even in transformer language models, showed only moderate alignment with expert judgments.

Overall, the survey highlights that qWALS provides a strong alignment with expert evaluations of linguistic similarity. Moreover, language similarity as determined by the experts, also accurately predicts transfer performance. High-performing models in specific language tasks tend to align better with experts, though variability exists, particularly among different linguists and tasks. This underscores the complexity of aligning computational metrics with human judgments and suggests that both linguistic similarity metrics and transfer performance need to be considered together for a comprehensive understanding of language model effectiveness.

The inclusion of the survey as a third independent variable offers a better view of causation between language similarity and transfer performance. By comparing the correlations between expert evaluations, linguistic similarity metrics, and transfer performance, we can clearly infer that there exists a causal relationship between language similarity and the performance of transfer learning. Specifically, the survey results provide additional validation for the proposed hypothesis that language similarity influences transfer performance. The strong alignment between expert evaluations and qWALS, combined with the observed correlation between qWALS scores and model performance, suggests that language similarity in general, as well as measured by qWALS likely plays a causal role in improving transfer performance. This is supported by the fact that languages with high agreement rates among experts also tend to show better performance in transfer learning tasks.

The consistency of correlations between qWALS and expert evaluations strengthens the argument for causation, as it demonstrates that a well-established similarity measure correlates strongly with model performance. The survey data help confirm that when the similarity between languages is accurately represented (as indicated by qWALS), it correlates with better performance in cross-lingual transfer learning tasks. However, the variability in agreement with lang2vec, as well as differences in expert linguists’ evaluations show that the influence might not be uniform across all metrics, tasks or contexts.

While the survey data strengthens the evidence for a causal relationship between language similarity and transfer performance, it also highlights that this relationship is complex and influenced by various factors. The presence of strong correlations and the consistency of qWALS with expert judgments support the causation hypothesis, but variability in other metrics and evaluations indicates that further investigation is necessary to fully understand the causal dynamics.

Thus, even though qWALS can be considered an effective tools for predicting model performance in cross-lingual settings, there is still room for refinement, particularly in tasks where language similarity might play a lesser role. Furthermore, the success of the expert-based survey in enhancing the interpretation of our findings opens up possibilities for hybrid approaches that combine computational and human insights to further optimize language models for cross-lingual transfer.

In short, while our findings offer robust support for the influence of language similarity in cross-lingual transfer learning, the complexity of language relationships and task-specific dependencies highlight the need for further investigation. Future work should explore other factors, such as cultural or typological features, and their impact on transfer learning, as well as the development of even more precise language similarity metrics. Additionally, expanding the scope of expert surveys to cover more languages and refining the methodology to reduce subjectivity could further enhance our understanding of the role language similarity plays—from the practical perspective—in model performance, as well as—from a more general perspective in language itself as the tool for everyday communication.

6.3. Summary of Contributions

This study explored the relationship between linguistic similarity and the performance of multilingual language models in cross-lingual transfer learning tasks. Key contributions and findings from the investigation are as follows.

6.3.1. Compared Multilingual Transformer Model Performance Across Various NLP Tasks

The fine-tuning results for multilingual transformer models, mBERT and XLM-R, revealed that XLM-R generally outperforms mBERT across most tasks, including abusive language detection, named entity recognition, and dependency parsing. However, mBERT showed a slight edge over XLM-R in sentiment analysis, suggesting that task-specific nuances might favor different model architectures.

6.3.2. Verified Language Similarity-Based Transfer Learning Performance

The study demonstrated that cross-lingual transfer learning performance is generally better between languages within the same family, such as Germanic and Slavic languages, or between languages that are in general similar to each other. This pattern indicates that linguistic similarity positively impacts model performance. For sentiment analysis, however, the relationship was less clear, suggesting that sentiment tasks may depend more on semantic and contextual features than on structural linguistic similarities.

6.3.3. Proposed qWALS Language Similarity Metric

We proposed a coherent language similarity metric based on quantified World Atlas of Language Structures. The metric covers over two and a half thousand languages described with almost two hundred linguistic features.

6.3.4. Proposed Additional Refining and Optimization of qWALS

The optimization of the qWALS metric, based on leave-one-feature-out approach showed substantial improvements in predicting cross-lingual transfer performance. The refined metric achieved high Pearson correlation coefficients across tasks, indicating that it optimally captures relevant linguistic similarities. This refinement enhances the metric’s effectiveness in predicting model performance and reduces computational complexity.

6.3.5. Confirmed Challenges in Processing Non-Indo-European and Low-Resource Languages

Japanese and Korean, the non-Indo-European languages used in the study, as well as various low-resource languages used in supplementary validation, performed lower in general. While Japanese and Korean performed similarly to Indo-European languages in most tasks, they lagged significantly in dependency parsing. This underscores the challenges posed by typologically distinct languages in tasks requiring syntactic alignment. Moreover, although cross-lingual transfer learning was possible between some of the applied low-resource languages, it was in general much lower suggesting that, although multilingual models have the potential to handle such under-resourced languages, there is a need to pretrain multilingual transformer language models with more exposure to such languages to achieve optimal performance.

6.3.6. Conducted Expert Linguist-Based Survey on Language Similarity

The expert-based survey, used as a third independent variable in the study, causally supported the hypothesis that linguistic similarity influences transfer performance. The survey showed strong alignment between linguistics experts evaluations and qWALS, particularly for languages with high inter-annotator agreement rates, such as Serbian-Croatian and Russian. However, variability in expert judgments and lower agreement with lang2vec suggested that the relationship between linguistic similarity and model performance is complex and context-dependent.

6.3.7. Causally Confirmed Correlations Between Language Similarity and Model Performance

While the survey results provided strong evidence for a causal relationship between linguistic similarity and transfer performance, the findings also highlighted the complexity of this relationship. The alignment between qWALS and expert judgments additionally strengthened the causal characteristic of the correlations, although further investigation is needed to fully understand the causal dynamics.

The integration of computational metrics with human expertise offers a promising approach to optimizing language models for cross-lingual applications. Expanding expert surveys and refining similarity metrics will enhance our understanding of how linguistic similarity impacts model performance and guide future developments in multilingual NLP. Overall, the research confirmed the influence of linguistic similarity on cross-lingual transfer learning, with refined metrics and expert insights providing valuable guidance for future improvements in language modeling and practical applications of presented here findings.

6.4. Limitations and Ethical Considerations

This study was challenged with several limitations and ethical considerations that must be acknowledged. One significant limitation is the incomplete WALS feature coverage, which affects low-resource languages most strongly. For some language pairs, only a small fraction of the available WALS features is shared (as low as roughly 30% overlap), meaning that qWALS is computed from a limited and potentially unrepresentative subset of typological properties. In such cases, similarity estimates become more sensitive to which specific features happen to be available (and to systematic missingness), which increases variance and reduces metric stability. As a result, conclusions involving low-resource languages, and more generally, any pairs with low WALS overlap should be interpreted cautiously, and we expect stronger and more stable associations once typological coverage improves. A related limitation is the absence of a dedicated Romance-language cluster in our experimental set. Without at least two Romance languages tested on the same task, we cannot directly verify that qWALS preserves fine-grained similarity within this well-studied family. This limits the strength of our universality claims: the current results demonstrate cross-lingual transfer trends across the included families, but they do not yet establish that the metric is equally precise within Romance languages or that similar within-family behavior would necessarily generalize.

Another inherent limitation is the very reliance on existing linguistic similarity metrics, such as qWALS or lang2vec, which may not fully capture the complexities of language relationships. These metrics are based on predefined sets of linguistic features, and their ability to accurately represent language similarity across a set of diverse languages can be constrained by their inherent limitations and the quality of the data they are based on. To address these issues, future research should explore alternative measures, such as the Grambank project (https://grambank.clld.org/ (accessed on 4 March 2026)), which offers an updated, more comprehensive set of linguistic features and may provide a more accurate assessment of linguistic similarity, especially for low-resource languages.

Another limitation is the potential bias introduced by the choice of language models and tasks used in this study. The focus on specific models like mBERT and XLM-R, and tasks such as abusive language detection and sentiment analysis, may not generalize to other models or linguistic tasks. Different models may exhibit varied performance landscape, and the tasks chosen may not encompass the full range of linguistic phenomena that could affect transfer performance.

Additionally, the study’s methodology involves a relatively small sample of languages, which may not be fully representative of the global linguistic landscape. This limited sample size can impact the generalizability of the findings and may overlook important nuances present in less-represented languages or language pairs.

More concretely, our main experiments cover only eight languages drawn from three language families. Therefore, the observed patterns may not transfer to typologically distant or underrepresented language families (e.g., Afro-Asiatic, Niger–Congo, Austronesian). Likewise, because we evaluate only two multilingual encoder models (mBERT and XLM-R), the results may not extend to newer model families and paradigms (e.g., large decoder-only LLMs such as GPT-4, LLaMA, or Claude), which differ substantially in pretraining data, objectives, and inference-time behavior.

The survey-based approach to evaluating linguistic similarity and transfer performance introduces another limitation. The subjectivity is inherent for human judgments—even for experts—and can lead to variability and inconsistencies in the assessment of language similarity. While the survey provides valuable insights, the variability among experts highlights the challenge of achieving consensus on linguistic relationships and may affect the reliability of the conclusions drawn from these evaluations.

There were also some ethical considerations that need to be taken into consideration in the context of this research. Ensuring the fair and equitable treatment of all languages, including those with fewer resources, is essential. It is important to recognize that linguistic diversity extends beyond the well-represented languages, and efforts should be made to include and accurately represent currently under-represented languages in future research. This includes being mindful of potential biases in language data and ensuring that the models do not perpetuate or exacerbate linguistic inequalities. Additionally, researchers should consider the broader implications of their work on the communities and cultures associated with the studied languages, ensuring that the outcomes of their research contribute positively to the field of natural language processing and its applications.

7. Conclusions and Future Work

In this study, we set out to address a persistent challenge in cross-lingual transfer learning: how to reliably select source languages for transfer in a way that is both systematic and grounded in linguistic theory. Our central hypothesis was that linguistic similarity plays a key role in determining the effectiveness of transfer learning across languages and that this similarity can be quantified using a broad set of typological features.

To test this, we introduced Quantified WALS, a new language similarity metric built from the World Atlas of Language Structures (WALS). Unlike previous metrics, which often focus on narrow linguistic domains or rely heavily on lexical data, qWALS draws from nearly 200 features, including features based on syntax, phonology, morphology, and lexical properties. We addressed the sparsity issues in WALS by dynamically selecting the features that are shared between language pairs, allowing us to produce meaningful comparisons even for under-documented languages.

First, we evaluated the proposed metric in a series of experiments using two multilingual language models (mBERT and XLM-RoBERTa) across four NLP tasks, namely dependency parsing, named entity recognition, sentiment analysis, and abusive language detection, and eight languages from distinct families. To strengthen our conclusions, we also compared qWALS to other existing metrics and collected expert assessments of language similarity through a survey. Across these different evaluation methods, qWALS consistently showed a strong correlation with model performance, especially in syntactically driven tasks.

Second, to assess the reliability of qWALS as a proxy for human intuition, we conducted an expert survey involving professional linguists with experience in typology, comparative linguistics, and multilingual analysis. The experts rated the overall similarity of selected language pairs based on their linguistic knowledge. We found that the results from the qWALS metric were well-aligned with expert judgments, providing external validation and interpretability for the metric.

Third, we carried out supplementary experiments focused on abusive language identification across multiple low-resource languages. These experiments served as a stress test for qWALS in more challenging, less standardized domains and further confirmed the utility of selecting typologically related source languages for effective transfer.

Taken together, our findings support the idea that linguistic similarity, when properly quantified, can and should inform cross-lingual transfer strategies. Rather than relying on convenience or historical defaults, practitioners can use metrics like qWALS to guide source language selection, leading to more effective and equitable outcomes, particularly for languages with limited resources.

These findings support the view that linguistic similarity is a meaningful predictor of cross-lingual transfer success. In practical terms, our results show that using a typologically similar language as the source for transfer often leads to better performance than relying on English, which is commonly used by default. This has important implications for building NLP tools in low-resource settings, where choosing the right source language can make a significant difference in outcomes. Additionally, this research contributes to the effort that seeks to bridge linguistic theory and computational practice. By grounding language model design in typological knowledge, we move toward more explainable, adaptable, and equitable multilingual systems.

For the future, we see several directions for extending this work. These include improving the metric by incorporating additional typological databases like Grambank or PHOIBLE, refining the way features are weighted, and applying qWALS to guide model pretraining or curriculum learning. We hope that this approach can help shift the field toward more informed and linguistically grounded methods for cross-lingual transfer—especially in support of the many languages that remain underrepresented in NLP research.

Another important next step is to expand the experimental setup with other languages, for example, with a Romance language cluster (e.g., Spanish, Portuguese, Italian, and French). This extension would test whether qWALS maintains predictive precision not only between distant language families but also within a tight genealogical group where typological distances are small and transfer-performance differences may be subtle.

In addition, while our experiments focus on multilingual encoder models (mBERT and XLM-R), a complementary direction is to evaluate larger-scale generative LLMs. Extending the analysis to decoder-only models would help verify whether the predictive utility of qWALS generalizes beyond encoder-based transfer setups to modern instruction-tuned and generative systems and whether language-similarity effects persist at greater parameter scales.

Author Contributions

Conceptualization, J.E. and M.P.; methodology, J.E., M.P. and T.M.; software, J.E., M.P. and Z.L.; validation, J.E. and M.P.; formal analysis, J.E. and M.P.; investigation, J.E. and M.P.; resources, J.E.; data curation, T.W., R.B. and K.J.; writing—original draft preparation, J.E. and M.P.; writing—review and editing, J.E. and M.P.; visualization, J.E. and M.P.; supervision, M.P. and F.M.; project administration, M.P. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Data Availability Statement

Data will be available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. WALS Features Optimized for Each Task

Table A1. WALS Features for each task after optimization.

Task	Abusive Language Identification	Sentiment Analysis
	Zero Copula for Predicate Nominals	Number of Genders
	Expression of Pronominal Subjects	SNegVO Order
	Semantic Distinctions of Evidentiality	Glottalized Consonants
	Inflectional Synthesis of the Verb	Position of Negative Word With Respect to Subject, Object, and Verb
	Suppletion in Imperatives and Hortatives	Purpose Clauses
	SNegVO Order	Action Nominal Constructions
	Glottalized Consonants	Postnominal relative clauses
	Passive Constructions	Fixed Stress Locations
	Situational Possibility	Order of Relative Clause and Noun
	Purpose Clauses	Sex-based and Non-sex-based Gender Systems
	Definite Articles	Epistemic Possibility
	Reason Clauses	Position of Tense-Aspect Affixes
	Reciprocal Constructions	Indefinite Articles
	The Associative Plural	Position of Pronominal Possessive Affixes
	Comitatives and Instrumentals	Systems of Gender Assignment
	Action Nominal Constructions	NegSVO Order
	The Velar Nasal	Voicing in Plosives and Fricatives
	Absence of Common Consonants	When’ Clauses
	Order of Person Markers on the Verb	Reduplication
	Postnominal relative clauses	Voicing and Gaps in Plosive Systems
	Distributive Numerals	Vowel Quality Inventories
	Postverbal Negative Morphemes
	Order of Object, Oblique, and Verb
	Presence of Uncommon Consonants
	SONegV Order
	Person Marking on Adpositions
	Number of Cases
	Order of Demonstrative and Noun
	Verbal Number and Suppletion
	Coding of Nominal Plurality
	Utterance Complement Clauses
	The Position of Negative Morphemes in SOV Languages
	Perfective/Imperfective Aspect
	The Prohibitive
	Symmetric and Asymmetric Standard Negation
	Subtypes of Asymmetric Standard Negation
	Position of Case Affixes
	SVNegO Order
	Negative Indefinite Pronouns and Predicate Negation
	Syncretism in Verbal Person/Number Marking
	Order of Numeral and Noun
	NegSVO Order
	Coding of Evidentiality
	M-T Pronouns
	Voicing in Plosives and Fricatives
	Para-Linguistic Usages of Clicks
	Imperative-Hortative Systems
	When’ Clauses
	Overlap between Situational and Epistemic Modal Marking
	Reduplication
	Voicing and Gaps in Plosive Systems
	Vowel Quality Inventories
	Adjectives without Nouns
Total	52	20
Task	Named Entity Recognition	Dependency Parsing
	Gender Distinctions in Independent Personal Pronouns	Zero Copula for Predicate Nominals
	Zero Copula for Predicate Nominals	Order of Degree Word and Adjective
	Order of Degree Word and Adjective	Number of Genders
	Number of Genders	Tone
	Tone	Glottalized Consonants
	Nominal and Locational Predication	Relativization on Subjects
	Situational Possibility	Passive Constructions
	Uvular Consonants	Situational Possibility
	Third Person Pronouns and Demonstratives	Uvular Consonants
	Consonant Inventories	Third Person Pronouns and Demonstratives
	Pronominal and Adnominal Demonstratives	Consonant Inventories
	SOVNeg Order	Purpose Clauses
	Minor morphological means of signaling negation	Pronominal and Adnominal Demonstratives
	The Associative Plural	SOVNeg Order
	Finger and Hand	Minor morphological means of signaling negation
	Action Nominal Constructions	Reason Clauses
	Multiple Negative Constructions in SVO Languages	Finger and Hand
	Order of Adverbial Subordinator and Clause	Comitatives and Instrumentals
	Postnominal relative clauses	Order of Adverbial Subordinator and Clause
	Want’ Complement Subjects	Absence of Common Consonants
	Order of Object, Oblique, and Verb	Order of Person Markers on the Verb
	Prefixing vs. Suffixing in Inflectional Morphology	Postnominal relative clauses
	M in Second Person Singular	Want’ Complement Subjects
	Genitives, Adjectives and Relative Clauses	Order of Object, Oblique, and Verb
	Presence of Uncommon Consonants	Prefixing vs. Suffixing in Inflectional Morphology
	Position of negative words relative to beginning and end of clause and with respect to adjacency to verb	M in Second Person Singular
	SONegV Order	Genitives, Adjectives and Relative Clauses
	Order of Adjective and Noun	SONegV Order
	Conjunctions and Universal Quantifiers	Order of Adjective and Noun
	Predicative Adjectives	Person Marking on Adpositions
	Fixed Stress Locations	Inclusive/Exclusive Distinction in Independent Pronouns
	Order of Relative Clause and Noun	Vowel Nasalization
	Obligatory Possessive Inflection	Number of Possessive Nouns
	Red and Yellow	Predicative Adjectives
	Order of Demonstrative and Noun	Locus of Marking in Possessive Noun Phrases
	Verbal Number and Suppletion	Order of Relative Clause and Noun
	Noun Phrase Conjunction	Obligatory Possessive Inflection
	Coding of Nominal Plurality	Red and Yellow
	Sex-based and Non-sex-based Gender Systems	Order of Demonstrative and Noun
	Green and Blue	Noun Phrase Conjunction
	Periphrastic Causative Constructions	Coding of Nominal Plurality
	Order of Adposition and Noun Phrase	Sex-based and Non-sex-based Gender Systems
	Preverbal Negative Morphemes	Green and Blue
	The Position of Negative Morphemes in SOV Languages	Ordinal Numerals
	Perfective/Imperfective Aspect	Order of Adposition and Noun Phrase
	The Prohibitive	Preverbal Negative Morphemes
	Possessive Classification	Epistemic Possibility
	Position of Pronominal Possessive Affixes	Predicative Possession
	Order of Subject and Verb	Utterance Complement Clauses
	Nonperiphrastic Causative Constructions	Perfective/Imperfective Aspect
	Lateral Consonants	The Prohibitive
	NegSOV Order	Possessive Classification
	Position of Case Affixes	Position of Pronominal Possessive Affixes
	Systems of Gender Assignment	Comparative Constructions
	Productivity of the Antipassive Construction	Nonperiphrastic Causative Constructions
	Negative Indefinite Pronouns and Predicate Negation	Lateral Consonants
	Order of Numeral and Noun	Occurrence of Nominal Plurality
	NegSVO Order	Position of Case Affixes
	Voicing in Plosives and Fricatives	SVNegO Order
	Imperative-Hortative Systems	Systems of Gender Assignment
	Order of Negative Morpheme and Verb	Productivity of the Antipassive Construction
	Reduplication	Order of Numeral and Noun
	Negative Morphemes	Hand and Arm
		NegSVO Order
		Voicing in Plosives and Fricatives
		Position of Interrogative Phrases in Content Questions
		Imperative-Hortative Systems
		Order of Negative Morpheme and Verb
		Overlap between Situational and Epistemic Modal Marking
		Reduplication
		Voicing and Gaps in Plosive Systems
		Vowel Quality Inventories
		Plurality in Independent Personal Pronouns
		Negative Morphemes
		Adjectives without Nouns
Total	62	74

References

Murawaki, Y. Continuous Space Representations of Linguistic Typology and their Application to Phylogenetic Inference. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; pp. 324–334. [Google Scholar] [CrossRef][Green Version]
Bakker, D.; Müller, A.; Velupillai, V.; Wichmann, S.; Brown, C.H.; Brown, P.; Egorov, D.; Mailhammer, R.; Grant, A.; Holman, E.W. Adding typology to lexicostatistics: A combined approach to language classification. Linguist. Typology 2009, 13, 169–181. [Google Scholar] [CrossRef]
Bentz, C.; Ruzsics, T.; Koplenig, A.; Samardžić, T. A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC); The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 142–153. [Google Scholar]
Olga, A.; Alexander, M. Automatic Language Classification by means of Syntactic Dependency Networks. J. Quant. Linguist. 2011, 18, 291–336. [Google Scholar] [CrossRef]
Jäger, G.; Wahle, J. Phylogenetic Typology. Front. Psychol. 2021, 12, 682132. [Google Scholar] [CrossRef] [PubMed]
Dunn, M.; Levinson, S.C.; Lindström, E.; Reesink, G.; Terrill, A. Structural Phylogeny in Historical Linguistics: Methodological Explorations Applied in Island Melanesia. Language 2008, 84, 710–759. [Google Scholar] [CrossRef]
Johannes, B.; Isabelle, A. From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings. In North American Chapter of the Association for Computational Linguistics; Association for Computational Linguistics: New Orleans, LA, USA, 2018. [Google Scholar] [CrossRef]
Jäger, G. Support for linguistic macrofamilies from weighted sequence alignment. Proc. Natl. Acad. Sci. USA 2015, 112, 12752–12757. [Google Scholar] [CrossRef]
Robbeets, M. How the actional suffix chain connects Japanese to Altaic. Turk. Lang. 2007, 11, 3–58. [Google Scholar]
Robbeets, M. The Japanese Inflectional Paradigm in a Transeurasian Perspective; Routledge: Oxfordshire, UK, 2017. [Google Scholar] [CrossRef]
Brown, C.H. Beck-Wichmann-Brown Evaluation of Lexical Comparisons for the Transeurasian Proposal. In The Oxford Guide to the Transeurasian Languages; Oxford University Press: Oxford, UK, 2020; p. 735. [Google Scholar]
Gao, H.; Cheng, C.C. Verbs of contact by impact in English and their equivalents in Mandarin Chinese. Lang. Linguist. 2003, 4, 485–508. [Google Scholar]
Holman, E.W.; Brown, C.H.; Wichmann, S.; Müller, A.; Velupillai, V.; Hammarström, H.; Sauppe, S.; Jung, H.; Bakker, D.; Brown, P.; et al. Automated dating of the world’s language families based on lexical similarity. Curr. Anthropol. 2011, 52, 841–875. [Google Scholar] [CrossRef]
Swadesh, M. The Origin and Diversification of Language; Routledge: Oxfordshire, UK, 2017. [Google Scholar]
Brown, C.H.; Holman, E.W.; Wichmann, S.; Velupillai, V. Automated classification of the worlds languages: A description of the method and preliminary results. Lang. Typology Universals 2008, 61, 285–308. [Google Scholar] [CrossRef]
Ladefoged, P. The measurement of phonetic similarity. In Proceedings of the International Conference on Computational Linguistics COLING 1969: Preprint No. 57, Sånga-Säby, Sweden, 1–3 September 1969. [Google Scholar]
Yaniv, I.; Meyer, D.E.; Gordon, P.C.; Huff, C.A.; Sevald, C.A. Vowel similarity, connectionist models, and syllable structure in motor programming of speech. J. Mem. Lang. 1990, 29, 1–26. [Google Scholar] [CrossRef]
Strange, W. Cross-language phonetic similarity of vowels: Theoretical and methodological issues. In Language Experience in Second Language Speech Learning: In Honor of James Emil Flege; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2008; pp. 35–55. [Google Scholar]
Kondrak, G. Phonetic alignment and similarity. Comput. Humanit. 2003, 37, 273–291. [Google Scholar] [CrossRef]
Wireback, K.J. On the palatalization of Latin/ŋn/in Western Romance and Italo-Romance. Roman. Philol. 2010, 64, 295–306. [Google Scholar] [CrossRef]
Juhász, K.; Bartos, H. Could L1 intonation patterns be applied in teaching Mandarin tones to atonal learners of Chinese?—An acoustic phonetic study. Chin. Second. Lang. Res. 2024, 13, 157–182. [Google Scholar] [CrossRef]
Ryan, J.O.; Pakhomov, S.; Marino, S.; Bernick, C.; Banks, S. Computerized analysis of a verbal fluency test. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 884–889. [Google Scholar]
Ahmed, T.; Suffian, M.; Khan, M.Y.; Bogliolo, A. Discovering lexical similarity using articulatory feature-based phonetic edit distance. IEEE Access 2021, 10, 1533–1544. [Google Scholar] [CrossRef]
Mielke, J. A phonetically based metric of sound similarity. Lingua 2012, 122, 145–163. [Google Scholar] [CrossRef]
Blevins, J. Phonetic explanations for recurrent sound patterns: Diachronic or synchronic? In Raimy Cairns (2009); The MIT Press: Cambridge, MA, USA, 2009; pp. 325–336. [Google Scholar]
Marsico, E.; Flavier, S.; Verkerk, A.; Moran, S.; Calzolari, N. BDPROTO: A database of phonological inventories from ancient and reconstructed languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018); European Language Resources Association (ELRA): Paris, France, 2018; pp. 1654–1658. [Google Scholar]
Yang, C. Classifying Lalo languages: Subgrouping, phonetic distance, and intelligibility. Linguist. Tibet. Burman Area 2012, 35, 113–137. [Google Scholar] [CrossRef]
Kuo, J.s.; Li, H.; Yang, Y.k. A phonetic similarity model for automatic extraction of transliteration pairs. ACM Trans. Asian Lang. Inf. Process. (TALIP) 2007, 6, 6-es. [Google Scholar] [CrossRef]
Mohammed, Z.R.; Aliwy, A.H. Review of current Trends in Information Technology concerning Phonetic Similarity. Al-Bahir J. Eng. Pure Sci. 2024, 5, 7. [Google Scholar] [CrossRef]
Birnbaum, H. Typology, genealogy, and linguistic universals. Linguistics 1975, 13, 5–26. [Google Scholar] [CrossRef]
Comrie, B. Linguistic typology. Annu. Rev. Anthropol. 1988, 17, 145–159. [Google Scholar] [CrossRef]
Comrie, B. Language Universals and Linguistic Typology: Syntax and Morphology; University of Chicago Press: Chicago, IL, USA, 1989. [Google Scholar]
Moran, S.; McCloy, D.; Wright, R. (Eds.) PHOIBLE Online; Max Planck Institute for Evolutionary Anthropology: Leipzig, Germany, 2014. [Google Scholar]
Eberhard, D.M.; Simons, G.F.; Fennig, C.D. (Eds.) Ethnologue: Languages of the World, 25th ed.; SIL International: Dallas, TX, USA, 2022. [Google Scholar]
Hammarström, H.; Forkel, R.; Haspelmath, M.; Bank, S. Glottolog/Glottolog: Glottolog Database 4.5. 2021. Available online: https://zenodo.org/records/5772642 (accessed on 1 June 2023).
Littell, P.; Mortensen, D.R.; Lin, K.; Kairis, K.; Turner, C.; Levin, L. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 3–7 April 2017; pp. 8–14. [Google Scholar]
Ringbom, H. Cross-Linguistic Similarity in Foreign Language Learning; Multilingual Matters: Bristol, UK, 2006. [Google Scholar] [CrossRef]
Cotterell, R.; Mielke, S.J.; Eisner, J.; Roark, B. Are All Languages Equally Hard to Language-Model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 536–541. [Google Scholar] [CrossRef]
Jones, A.; Wang, W.Y.; Mahowald, K. A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Moens, M.F., Huang, X., Specia, L., Yih, S.W.T., Eds.; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 5833–5847. [Google Scholar] [CrossRef]
Kovacevic, L.; Bradic, V.; de Melo, G.; Zdravkovic, S.; Ryzhova, O. EzGlot. 2021. Available online: https://www.ezglot.com/ (accessed on 1 June 2023).
Beaufils, V.; Tomin, J. Stochastic Approach to Worldwide Language Classification: The Signals and the Noise Towards Long-Range Exploration. 2020. Available online: https://www.academia.edu/129642884/Stochastic_approach_to_worldwide_language_classification_the_signals_and_the_noise_towards_long_range_exploration (accessed on 1 June 2023).
Dryer, M.S.; Haspelmath, M. (Eds.) WALS Online (v2020.4); Zenodo: Geneva, Switzerland, 2013. [Google Scholar] [CrossRef]
Pearl, J. Graphs, causality, and structural equation models. Sociol. Methods Res. 1998, 27, 226–284. [Google Scholar] [CrossRef]
Pearl, J. Causal inference. In Causality: Objectives and Assessment; Microtome Publishing: Brookline, MA, USA, 2010; pp. 39–58. [Google Scholar]
Brown, C.; Holman, E.; Wichmann, S. Sound Correspondences in the World’s Languages. Language 2013, 89, 4–29. [Google Scholar] [CrossRef]
Gooskens, C. The contribution of linguistic factors to the intelligibility of closely related languages. J. Multiling. Multicult. Dev. 2007, 28, 445–467. [Google Scholar] [CrossRef]
Gooskens, C.; van Heuven, V.J.; Golubović, J.; Schüppert, A.; Swarte, F.; Voigt, S. Mutual intelligibility between closely related languages in Europe. Int. J. Multiling. 2018, 15, 169–193. [Google Scholar] [CrossRef]
Malaviya, C.; Neubig, G.; Littell, P. Learning Language Representations for Typology Prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 2529–2535. [Google Scholar] [CrossRef]
Szmrecsanyi, B. Geography is overrated. In Dialectological and Folk Dialectological Concepts of Space; De Gruyter: Berlin, Germany, 2012; pp. 215–231. [Google Scholar]
Xiao, M.; Guo, Y. Distributed Word Representation Learning for Cross-Lingual Dependency Parsing. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, Baltimore, MD, USA, 26–27 June 2014; pp. 119–129. [Google Scholar] [CrossRef]
Tiedemann, J. Cross-lingual dependency parsing with universal dependencies and predicted pos labels. In Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), Uppsala, Sweden, 24–26 August 2015; pp. 340–349. [Google Scholar]
Guo, J.; Che, W.; Yarowsky, D.; Wang, H.; Liu, T. Cross-lingual Dependency Parsing Based on Distributed Representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 1234–1244. [Google Scholar] [CrossRef]
Lacroix, O.; Aufrant, L.; Wisniewski, G.; Yvon, F. Frustratingly easy cross-lingual transfer for transition-based dependency parsing. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1058–1063. [Google Scholar]
Duong, L.; Cohn, T.; Bird, S.; Cook, P. Cross-lingual transfer for unsupervised dependency parsing without parallel data. In Proceedings of the Nineteenth Conference on Computational Natural Language Learning, Beijing, China, 30–31 July 2015; pp. 113–122. [Google Scholar]
Bansal, M. Dependency Link Embeddings: Continuous Representations of Syntactic Substructures. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA, 5 June 2015; pp. 102–108. [Google Scholar] [CrossRef]
Wu, S.; Dredze, M. Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 833–844. [Google Scholar] [CrossRef]
Kondratyuk, D.; Straka, M. 75 Languages, 1 Model: Parsing Universal Dependencies Universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 2779–2795. [Google Scholar] [CrossRef]
Ulčar, M.; Robnik-Šikonja, M. Finest bert and crosloengual bert. In Proceedings of the International Conference on Text, Speech, and Dialogue; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–111. [Google Scholar]
Lauscher, A.; Ravishankar, V.; Vulić, I.; Glavaš, G. From Zero to Hero: On the Limitations of Zero-Shot Language Transfer with Multilingual Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 4483–4499. [Google Scholar] [CrossRef]
Nivre, J.; de Marneffe, M.C.; Ginter, F.; Hajič, J.; Manning, C.D.; Pyysalo, S.; Schuster, S.; Tyers, F.; Zeman, D. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 4034–4043. [Google Scholar]
Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the International Conference on Machine Learning; PMLR: Birmingham, UK, 2020; pp. 4411–4421. [Google Scholar]
Yadav, V.; Bethard, S. A survey on recent advances in named entity recognition from deep learning models. arXiv 2019, arXiv:1910.11470. [Google Scholar] [CrossRef]
Li, J.; Sun, A.; Han, J.; Li, C. A Survey on Deep Learning for Named Entity Recognition. IEEE Trans. Knowl. Data Eng. 2022, 34, 50–70. [Google Scholar] [CrossRef]
Ali, S.; Masood, K.; Riaz, A.; Saud, A. Named Entity Recognition using Deep Learning: A Review. In Proceedings of the 2022 International Conference on Business Analytics for Technology and Security (ICBATS); IEEE: New York, NY, USA, 2022; pp. 1–7. [Google Scholar]
Fritzler, A.; Logacheva, V.; Kretov, M. Few-Shot Classification in Named Entity Recognition Task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, New York, NY, USA, 8–12 April 2019; pp. 993–1000. [Google Scholar] [CrossRef]
Moon, T.; Awasthy, P.; Ni, J.; Florian, R. Towards Lingua Franca Named Entity Recognition with BERT. arXiv 2019, arXiv:1912.01389. [Google Scholar] [CrossRef]
Hvingelby, R.; Pauli, A.B.; Barrett, M.; Rosted, C.; Lidegaard, L.M.; Søgaard, A. DaNE: A Named Entity Resource for Danish. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4597–4604. [Google Scholar]
Jain, A.; Paranjape, B.; Lipton, Z.C. Entity Projection via Machine Translation for Cross-Lingual NER. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1083–1092. [Google Scholar] [CrossRef]
Li, B.; He, Y.; Xu, W. Cross-Lingual Named Entity Recognition Using Parallel Corpus: A New Approach Using XLM-RoBERTa Alignment. arXiv 2021, arXiv:2101.11112. [Google Scholar]
Weber, S.; Steedman, M. Zero-Shot Cross-Lingual Transfer is a Hard Baseline to Beat in German Fine-Grained Entity Typing. In Proceedings of the Second Workshop on Insights from Negative Results in NLP, Online and Punta Cana, Dominican Republic, 10 November 2021; pp. 42–48. [Google Scholar] [CrossRef]
Pan, X.; Zhang, B.; May, J.; Nothman, J.; Knight, K.; Ji, H. Cross-lingual Name Tagging and Linking for 282 Languages. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1946–1958. [Google Scholar] [CrossRef]
Rahimi, A.; Li, Y.; Cohn, T. Massively Multilingual Transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 151–164. [Google Scholar]
Liu, B. Sentiment Analysis and Opinion Mining; Synthesis Lectures on Human Language Technologies; Springer: Berlin/Heidelberg, Germany, 2012; Volume 5, pp. 1–167. [Google Scholar]
Chakraborty, K.; Bhattacharyya, S.; Bag, R. A Survey of Sentiment Analysis from Social Media Data. IEEE Trans. Comput. Soc. Syst. 2020, 7, 450–464. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. Sentiment Analysis Using Deep Learning Architectures: A Review. Artif. Intell. Rev. 2020, 53, 4335–4385. [Google Scholar] [CrossRef]
Xu, H.; Liu, B.; Shu, L.; Yu, P. BERT Post-Training for Review Reading Comprehension and Aspect-based Sentiment Analysis. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 2324–2335. [Google Scholar] [CrossRef]
Sarkar, A.; Reddy, S.; Iyengar, R.S. Zero-Shot Multilingual Sentiment Analysis Using Hierarchical Attentive Network and BERT. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, NLPIR 2019, Tokushima, Japan, 28–30 June 2019; pp. 49–56. [Google Scholar] [CrossRef]
Birjali, M.; Kasri, M.; Beni-Hssane, A. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowl. Based Syst. 2021, 226, 107134. [Google Scholar] [CrossRef]
Rasooli, M.S.; Farra, N.; Radeva, A.; Yu, T.; McKeown, K. Cross-lingual sentiment transfer with limited resources. Mach. Transl. 2018, 32, 143–165. [Google Scholar] [CrossRef]
Pelicon, A.; Pranjić, M.; Miljković, D.; Škrlj, B.; Pollak, S. Zero-Shot Learning for Cross-Lingual News Sentiment Classification. Appl. Sci. 2020, 10, 5993. [Google Scholar] [CrossRef]
Kumar, A.; Albuquerque, V.H.C. Sentiment Analysis Using XLM-R Transformer and Zero-Shot Transfer Learning on Resource-Poor Indian Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 1–13. [Google Scholar] [CrossRef]
Keung, P.; Lu, Y.; Szarvas, G.; Smith, N.A. The multilingual Amazon reviews corpus. arXiv 2020, arXiv:2010.02573. [Google Scholar] [CrossRef]
Kocoń, J.; Miłkowski, P.; Zaśko-Zielińska, M. Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, 3–4 November 2019; pp. 980–991. [Google Scholar] [CrossRef]
Smetanin, S.; Komarov, M. Sentiment Analysis of Product Reviews in Russian using Convolutional Neural Networks. In Proceedings of the 2019 IEEE 21st Conference on Business Informatics (CBI); IEEE: New York, NY, USA, 2019; Volume 1, pp. 482–486. [Google Scholar] [CrossRef]
Agrawal, S.; Awekar, A. Deep Learning for Detecting Cyberbullying Across Multiple Social Media Platforms. arXiv 2018, arXiv:1801.06482. [Google Scholar] [CrossRef]
Mozafari, M.; Farahbakhsh, R.; Crespi, N. A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media. In Proceedings of the Complex Networks and Their Applications VIII; Cherifi, H., Gaito, S., Mendes, J.F., Moro, E., Rocha, L.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 928–940. [Google Scholar]
Dadvar, M.; Eckert, K. Cyberbullying detection in social networks using deep learning based models. In Proceedings of the International Conference on Big Data Analytics and Knowledge Discovery; Springer: Berlin/Heidelberg, Germany, 2020; pp. 245–255. [Google Scholar]
Yadav, J.; Kumar, D.; Chauhan, D. Cyberbullying Detection using Pre-Trained BERT Model. In Proceedings of the 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC), Tamil Nadu, India, 28–30 April 2020; pp. 1096–1100. [Google Scholar] [CrossRef]
Pamungkas, E.W.; Basile, V.; Patti, V. Towards multidomain and multilingual abusive language detection: A survey. In Personal and Ubiquitous Computing; Springer: Berlin/Heidelberg, Germany, 2021; pp. 1–27. [Google Scholar]
Ranasinghe, T.; Zampieri, M. Multilingual offensive language identification with cross-lingual embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 5838–5844. [Google Scholar]
Ranasinghe, T.; Zampieri, M. Multilingual Offensive Language Identification for Low-resource Languages. arXiv 2021, arXiv:2105.05996. [Google Scholar] [CrossRef]
Bigoulaeva, I.; Hangya, V.; Fraser, A. Cross-Lingual Transfer Learning for Hate Speech Detection. In Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, Kyiv, Ukraine, 19–20 April 2021; pp. 15–25. [Google Scholar]
Gaikwad, S.; Ranasinghe, T.; Zampieri, M.; Homan, C.M. Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi. arXiv 2021, arXiv:2109.03552. [Google Scholar] [CrossRef]
Reynolds, K.; Edwards, A.; Edwards, L. Using Machine Learning to Detect Cyberbullying. In Proceedings of the 10th International Conference on Machine Learning and Applications, ICMLA 2011, Honolulu, HI, USA, 18–21 December 2011; Volume 2. [Google Scholar] [CrossRef]
Ptaszynski, M.; Masui, F. Automatic Cyberbullying Detection: Emerging Research and Opportunities; IGI Global: Hershey, PA, USA, 2018. [Google Scholar]
Ptaszynski, M.; Leliwa, G.; Piech, M.; Smywiński-Pohl, A. Cyberbullying Detection—Technical Report 2/2018, Department of Computer Science AGH, University of Science and Technology. arXiv 2018, arXiv:1808.00926. [Google Scholar]
Wiegand, M.; Siegel, M.; Ruppenhofer, J. Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. In Proceedings of the GermEval 2018 Shared Task on the Identification of Offensive Language, Vienna, Austria, 21 September 2018. [Google Scholar]
Sigurbergsson, G.I.; Derczynski, L. Offensive Language and Hate Speech Detection for Danish. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 3498–3508. [Google Scholar]
Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Predicting the Type and Target of Offensive Posts in Social Media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 1415–1420. [Google Scholar] [CrossRef]
Ptaszynski, M.; Pieciukiewicz, A.; Dybała, P. Results of the PolEval 2019 Shared Task 6: First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter. In Proceedings of the PolEval 2019 Workshop, Warszawa, Polska, 31 May 2019; pp. 89–110. [Google Scholar]
Smetanin, S. Toxic Comments Detection in Russian. In Proceedings of the Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2020”, Moscow, Russia, 17–20 June 2020. [Google Scholar] [CrossRef]
Ptaszynski, M.; Dybala, P.; Matsuba, T.; Masui, F.; Rzepka, R.; Araki, K. Machine Learning and Affect Analysis Against Cyber-Bullying. In Proceedings of the Linguistic and Cognitive Approaches to Dialog Agents Symposium, Leicester, UK, 29 March–1 April 2010. [Google Scholar]
MEXT. ‘Netto-jō no Ijime’ ni Kansuru Taiō Manyuaru Jirei Shū (Gakkō, Kyōin Muke) [“Bullying on the Net”Manual for Handling and Collection of Cases (for Schools and Teachers)]); Ministry of Education, Culture, Sports, Science and Technology (MEXT): Tokyo, Japan, 2008. (In Japanese) [Google Scholar]
Arata, M. Study on Change of Detection Accuracy over Time in Cyberbullying Detection. Master’s Thesis, Kitami Institute of Technology, Department of Computer Science, Kitami, Japan, 2019. [Google Scholar]
Takenaka, I.; Ochiai, M.; Matsui, Y. The Situation of Occupational Stress and Related Factors of Harmful Information Countermeasure Workers. Soc. Psychol. Res. (Jpn. Soc. Soc. Psychol.) 2018, 33, 135–148. (In Japanese) [Google Scholar] [CrossRef]
Ljubešić, N.; Fišer, D.; Erjavec, T. The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English. arXiv 2017, arXiv:1906.02045. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NeurIPS: Sydney, Australia, 2017; Volume 30. [Google Scholar]
K, K.; Wang, Z.; Mayhew, S.; Roth, D. Cross-Lingual Ability of Multilingual BERT: An Empirical Study. In Proceedings of the International Conference on Learning Representations, Online, 26–30 April 2020. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the ACL, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Conneau, A.; Lample, G. Cross-lingual language model pretraining. Adv. Neural Inf. Process. Syst. 2019, 32, 7059–7069. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar] [CrossRef]
Turc, I.; Lee, K.; Eisenstein, J.; Chang, M.W.; Toutanova, K. Revisiting the Primacy of English in Zero-Shot Cross-Lingual Transfer. arXiv 2021, arXiv:2106.16171. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, A.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. Kolmogorov—Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Noorizadegan, A.; Wang, S.; Ling, L.; Dominguez-Morales, J.P. A Practitioner’s Guide to Kolmogorov—Arnold Networks. arXiv 2026, arXiv:2510.25781. [Google Scholar]
Chawla, S.; Pund, A.; Kulkarni, S.; Diwekar-Joshi, M.; Watve, M. Inferring causal pathways among three or more variables from steady-state correlations in a homeostatic system. PLoS ONE 2018, 13, e0204755. [Google Scholar] [CrossRef]
Ptaszynski, M.; Zasko-Zielinska, M.; Marcinczuk, M.; Leliwa, G.; Fortuna, M.; Soliwoda, K.; Dziublewska, I.; Hubert, O.; Skrzek, P.; Piesiewicz, J.; et al. Looking for Razors and Needles in a Haystack: Multifaceted Analysis of Suicidal Declarations on Social Media—A Pragmalinguistic Approach. Int. J. Environ. Res. Public Health 2021, 18, 11759. [Google Scholar] [CrossRef]
Ptaszynski, M.; Pieciukiewicz, A.; Dybala, P.; Skrzek, P.; Soliwoda, K.; Fortuna, M.; Leliwa, G.; Wroczynski, M. Expert-annotated dataset to study cyberbullying in polish language. Data 2023, 9, 1. [Google Scholar] [CrossRef]
Kumar, A.; Raghunathan, A.; Jones, R.; Ma, T.; Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv 2022, arXiv:2202.10054. [Google Scholar] [CrossRef]

Figure 1. Causal/probability diagram of the three variables.

Figure 2. Heatmaps of WALS distances and XLM-R model performances.

Table 1. eLinguistics similarity scores for all pairs of the proposed languages.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	0.00	20.60	38.20	66.20	68.20	66.20	95.20	97.20
English	20.60	0.00	30.80	60.30	66.90	60.30	88.30	90.00
German	38.20	30.80	0.00	64.50	68.10	64.50	87.40	95.50
Croatian	66.20	60.30	64.50	0.00	10.70	5.60	90.70	87.20
Polish	68.20	66.90	68.10	10.70	0.00	5.10	93.30	89.50
Russian	66.20	60.30	64.50	5.60	5.10	0.00	93.30	89.50
Japanese	95.20	88.30	87.40	90.70	93.30	93.30	0.00	88.00
Korean	97.20	90.00	95.50	87.20	89.50	89.50	88.00	0.00

Table 2. EzGlot similarity scores for all pairs of the proposed languages. Scores for some pairs were not available.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	100	9	17	N/A	13	N/A	N/A	9
English	6	100	28	6	19	14	7	26
German	6	15	100	4	8	4	N/A	5
Croatian	N/A	4	5	100	14	9	N/A	5
Polish	6	12	9	14	100	15	N/A	5
Russian	N/A	11	7	11	19	100	N/A	11
Japanese	N/A	2	N/A	N/A	N/A	N/A	100	8
Korean	1	5	2	1	1	3	4	100

Table 3. Averaged lang2vec similarity scores for all pairs of the proposed languages.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	0.000	0.511	0.487	0.550	0.565	0.597	0.694	0.691
English	0.511	0.000	0.352	0.578	0.486	0.488	0.635	0.578
German	0.487	0.352	0.000	0.550	0.470	0.471	0.594	0.579
Croatian	0.550	0.578	0.550	0.000	0.513	0.505	0.709	0.699
Polish	0.565	0.486	0.470	0.513	0.000	0.344	0.624	0.619
Russian	0.597	0.488	0.471	0.505	0.344	0.000	0.589	0.585
Japanese	0.694	0.635	0.594	0.709	0.624	0.589	0.000	0.518
Korean	0.691	0.578	0.579	0.699	0.619	0.585	0.518	0.000

Table 4. Quantified WALS (qWALS) similarity scores for all pairs of the proposed languages using ordinal features.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	0.000	0.109	0.140	0.167	0.197	0.155	0.236	0.202
English	0.109	0.000	0.136	0.179	0.164	0.141	0.252	0.209
German	0.140	0.136	0.000	0.221	0.196	0.140	0.248	0.225
Croatian	0.167	0.179	0.221	0.000	0.160	0.080	0.272	0.229
Polish	0.197	0.164	0.196	0.160	0.000	0.097	0.249	0.210
Russian	0.155	0.141	0.140	0.080	0.097	0.000	0.225	0.196
Japanese	0.236	0.252	0.248	0.272	0.249	0.225	0.000	0.108
Korean	0.202	0.209	0.225	0.229	0.210	0.196	0.108	0.000

Table 5. Quantified WALS (qWALS) similarity scores for all pairs of the proposed languages using one-hot features.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	0.000	0.102	0.107	0.051	0.062	0.110	0.136	0.129
English	0.102	0.000	0.080	0.118	0.107	0.081	0.150	0.136
German	0.107	0.080	0.000	0.133	0.116	0.086	0.146	0.136
Croatian	0.051	0.118	0.133	0.000	0.053	0.100	0.134	0.129
Polish	0.062	0.107	0.116	0.053	0.000	0.085	0.145	0.137
Russian	0.110	0.081	0.086	0.100	0.085	0.000	0.139	0.133
Japanese	0.136	0.150	0.146	0.134	0.145	0.139	0.000	0.066
Korean	0.129	0.136	0.136	0.129	0.137	0.133	0.066	0.000

Table 6. Feature coverage for each language pair.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	-	0.292	0.297	0.234	0.271	0.286	0.245	0.234
English	0.292	-	0.771	0.307	0.458	0.807	0.745	0.734
German	0.297	0.771	-	0.281	0.432	0.760	0.719	0.714
Croatian	0.234	0.307	0.281	-	0.292	0.302	0.271	0.260
Polish	0.271	0.458	0.432	0.292	-	0.458	0.401	0.406
Russian	0.286	0.807	0.760	0.302	0.458	-	0.729	0.729
Japanese	0.245	0.745	0.719	0.271	0.401	0.729	-	0.760
Korean	0.234	0.734	0.714	0.260	0.406	0.729	0.760	-

Table 7. Statistics of the applied abusive language identification datasets.

	English	German	Danish	Polish	Russian	Croatian	Japanese	Korean
Category	CB	Off.	Off.	Off.	Toxic	Hate	CB	Hate
Samples	12,772	8407	3289	34,953	14,412	10,970	4096	189,995
Abusive	913	2838	425	7367	4826	8851	2048	89,999
Non-off.	11,859	5569	2864	27,586	9586	2119	2048	99,996
Train/Eval	80/20	60/40	90/10	83/17	80/20	80/20	80/20	80/20

Table 8. F1-scores for the Abusive language identification task (mBERT).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.745	0.544	0.502	0.370	0.513	0.551	0.369	0.396
S	English	0.529	0.771	0.409	0.293	0.440	0.414	0.328	0.344
O	German	0.566	0.565	0.700	0.496	0.634	0.720	0.477	0.448
U	Croatian	0.533	0.538	0.552	0.757	0.599	0.655	0.493	0.565
R	Polish	0.509	0.485	0.405	0.403	0.825	0.581	0.329	0.345
C	Russian	0.502	0.473	0.615	0.536	0.601	0.901	0.637	0.611
E	Japanese	0.504	0.525	0.464	0.313	0.493	0.515	0.876	0.486
	Korean	0.433	0.439	0.331	0.568	0.382	0.499	0.450	0.950

Table 9. F1-scores for the Abusive language identification task (XLM-R).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.747	0.672	0.487	0.399	0.564	0.569	0.388	0.394
S	English	0.576	0.805	0.426	0.309	0.468	0.452	0.460	0.412
O	German	0.664	0.660	0.732	0.458	0.640	0.713	0.552	0.516
U	Croatian	0.631	0.680	0.630	0.782	0.704	0.790	0.561	0.642
R	Polish	0.633	0.598	0.500	0.566	0.842	0.783	0.462	0.497
C	Russian	0.573	0.526	0.658	0.631	0.654	0.906	0.718	0.764
E	Japanese	0.557	0.575	0.474	0.355	0.548	0.647	0.896	0.609
	Korean	0.431	0.495	0.342	0.569	0.387	0.479	0.518	0.952

Table 10. F1-scores for the Sentiment analysis task (mBERT).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.976	0.875	0.951	0.941	0.934	0.876	0.800	0.881
S	English	0.942	0.935	0.935	0.921	0.921	0.849	0.645	0.838
O	German	0.901	0.816	0.971	0.908	0.889	0.828	0.711	0.741
U	Croatian	0.952	0.883	0.948	0.973	0.940	0.863	0.802	0.881
R	Polish	0.952	0.876	0.948	0.948	0.967	0.861	0.771	0.878
C	Russian	0.949	0.862	0.939	0.938	0.933	0.957	0.774	0.867
E	Japanese	0.908	0.799	0.903	0.894	0.870	0.807	0.914	0.869
	Korean	0.940	0.848	0.935	0.930	0.909	0.850	0.815	0.957

Table 11. F1-scores for the Sentiment analysis task (XLM-R).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.972	0.857	0.932	0.934	0.926	0.869	0.749	0.846
S	English	0.939	0.925	0.920	0.922	0.916	0.844	0.705	0.816
O	German	0.882	0.791	0.966	0.890	0.871	0.816	0.671	0.711
U	Croatian	0.945	0.859	0.925	0.969	0.929	0.876	0.763	0.835
R	Polish	0.946	0.853	0.929	0.939	0.960	0.865	0.632	0.834
C	Russian	0.918	0.792	0.895	0.913	0.901	0.953	0.726	0.832
E	Japanese	0.888	0.793	0.880	0.871	0.851	0.773	0.905	0.832
	Korean	0.921	0.804	0.903	0.911	0.880	0.824	0.690	0.953

Table 12. F1-scores for the NER task (mBERT).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.957	0.813	0.801	0.480	0.763	0.791	0.675	0.640
S	English	0.778	0.930	0.827	0.744	0.852	0.729	0.773	0.652
O	German	0.770	0.866	0.936	0.751	0.879	0.805	0.766	0.648
U	Croatian	0.667	0.710	0.748	0.876	0.727	0.770	0.707	0.622
R	Polish	0.691	0.676	0.702	0.695	0.956	0.648	0.754	0.625
C	Russian	0.759	0.825	0.764	0.761	0.867	0.946	0.759	0.564
E	Japanese	0.754	0.827	0.761	0.659	0.693	0.743	0.926	0.673
	Korean	0.602	0.694	0.668	0.670	0.675	0.705	0.700	0.867

Table 13. F1-scores for the NER task (XLM-R).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.975	0.868	0.876	0.723	0.958	0.910	0.873	0.755
S	English	0.955	0.941	0.941	0.820	0.961	0.934	0.920	0.781
O	German	0.960	0.926	0.948	0.846	0.975	0.930	0.918	0.765
U	Croatian	0.920	0.842	0.872	0.911	0.919	0.889	0.834	0.723
R	Polish	0.949	0.885	0.897	0.877	0.981	0.897	0.891	0.740
C	Russian	0.923	0.908	0.915	0.611	0.951	0.951	0.880	0.737
E	Japanese	0.955	0.909	0.920	0.847	0.961	0.926	0.936	0.789
	Korean	0.827	0.752	0.799	0.491	0.751	0.820	0.840	0.900

Table 14. LAS-scores for the DEP task (mBERT).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.860	0.545	0.631	0.619	0.556	0.647	0.092	0.026
S	English	0.652	0.891	0.670	0.624	0.570	0.653	0.165	0.021
O	German	0.635	0.603	0.842	0.672	0.613	0.733	0.130	0.062
U	Croatian	0.581	0.607	0.633	0.893	0.645	0.778	0.124	0.030
R	Polish	0.520	0.518	0.577	0.676	0.924	0.760	0.112	0.023
C	Russian	0.594	0.604	0.643	0.730	0.666	0.878	0.131	0.020
E	Japanese	0.132	0.148	0.163	0.114	0.117	0.126	0.926	0.033
	Korean	0.058	0.065	0.060	0.035	0.045	0.054	0.035	0.293

Table 15. LAS-scores for the DEP task (XLM-R).

					TARGET
		Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
	Danish	0.888	0.679	0.725	0.706	0.672	0.715	0.095	0.366
S	English	0.733	0.911	0.728	0.720	0.700	0.716	0.112	0.364
O	German	0.712	0.681	0.854	0.751	0.732	0.784	0.066	0.405
U	Croatian	0.639	0.668	0.702	0.910	0.798	0.818	0.069	0.375
R	Polish	0.614	0.603	0.676	0.780	0.945	0.804	0.049	0.384
C	Russian	0.642	0.645	0.722	0.801	0.796	0.890	0.101	0.378
E	Japanese	0.118	0.132	0.172	0.098	0.122	0.106	0.937	0.317
	Korean	0.326	0.292	0.381	0.298	0.325	0.304	0.183	0.877

Table 16. Average results for every task in each source language.

		mBERT				XLM-R
	Abusive	Sentiment	NER	DEP	Abusive	Sentiment	NER	DEP
Danish	0.499	0.904	0.740	0.497	0.528	0.886	0.867	0.606
English	0.441	0.873	0.786	0.531	0.489	0.874	0.907	0.623
German	0.576	0.845	0.803	0.536	0.617	0.825	0.909	0.623
Croatian	0.586	0.905	0.728	0.536	0.677	0.888	0.864	0.622
Polish	0.485	0.900	0.719	0.514	0.610	0.870	0.890	0.607
Russian	0.610	0.902	0.781	0.533	0.679	0.866	0.860	0.622
Japanese	0.522	0.871	0.754	0.220	0.583	0.849	0.905	0.250
Korean	0.506	0.898	0.698	0.081	0.521	0.861	0.773	0.373

Table 17. Abusive language identification: Pearson’s and Spearman’s correlation coefficients for linguistic similarity metrics and model F1 scores.

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.694	<0.001	−0.743	<0.001	−0.602	<0.001	−0.585	<0.001
WALS (one-hot)	−0.642	<0.001	−0.695	<0.001	−0.534	<0.001	−0.522	<0.001
EzGlot	0.660	<0.001	0.764	<0.001	0.476	<0.001	0.433	0.002
eLinguistics	−0.669	<0.001	−0.667	<0.001	−0.625	<0.001	−0.621	<0.001
lang2vec	−0.701	<0.001	−0.778	<0.001	−0.569	<0.001	−0.554	<0.001

Table 18. Sentiment analysis: Pearson’s and Spearman’s correlation coefficients for linguistic similarity metrics and model F1 scores.

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.297	0.017	−0.645	<0.001	−0.331	0.008	−0.537	<0.001
WALS (one-hot)	−0.354	0.004	−0.556	<0.001	−0.360	0.004	−0.410	0.001
EzGlot	0.389	0.005	0.729	<0.001	0.533	<0.001	0.586	<0.001
eLinguistics	−0.355	0.004	−0.648	<0.001	−0.413	0.001	−0.652	<0.001
lang2vec	−0.418	0.001	−0.746	<0.001	−0.482	<0.001	−0.623	<0.001

Table 19. NER: Pearson’s and Spearman’s correlation coefficients for linguistic similarity metrics and model F1 scores.

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.514	<0.001	−0.500	<0.001	−0.510	<0.001	−0.486	<0.001
WALS (one-hot)	−0.560	<0.001	−0.550	<0.001	−0.646	<0.001	−0.613	<0.001
EzGlot	0.494	<0.001	0.427	0.002	0.464	0.001	0.401	0.004
eLinguistics	−0.580	<0.001	−0.517	<0.001	−0.608	<0.001	−0.553	<0.001
lang2vec	−0.465	<0.001	−0.432	<0.001	−0.504	<0.001	−0.461	<0.001

Table 20. DEP: Pearson’s and Spearman’s correlation coefficients for linguistic similarity metrics and model LAS scores.

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.781	<0.001	−0.718	<0.001	−0.844	<0.001	−0.693	<0.001
WALS (one-hot)	−0.734	<0.001	−0.695	<0.001	−0.789	<0.001	−0.663	<0.001
EzGlot	0.588	<0.001	0.516	<0.001	0.694	<0.001	0.561	<0.001
eLinguistics	−0.845	<0.001	−0.840	<0.001	−0.897	<0.001	−0.867	<0.001
lang2vec	−0.702	<0.001	−0.679	<0.001	−0.848	<0.001	−0.754	<0.001

Table 21. Abusive language identification: Pearson’s and Spearman’s correlation coefficients for linguistic similarity metrics and model F1 scores (zero-shot only).

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.409	0.002	−0.382	0.004	−0.415	0.001	−0.383	0.004
WALS (one-hot)	−0.288	0.032	−0.241	0.074	−0.320	0.016	−0.290	0.030
EzGlot	0.029	0.855	−0.055	0.731	0.140	0.376	0.054	0.734
eLinguistics	−0.451	<0.001	−0.382	0.004	−0.450	0.001	−0.437	0.001
lang2vec	−0.373	0.005	−0.343	0.010	−0.369	0.005	−0.338	0.011

Table 22. Sentiment analysis: Pearson’s and Spearman’s correlation coefficients for linguistic similarity metrics and model F1 scores (zero-shot only).

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.111	0.415	−0.262	0.051	−0.168	0.216	−0.315	0.018
WALS (one-hot)	−0.206	0.127	−0.052	0.704	−0.201	0.137	−0.320	0.016
EzGlot	0.327	0.035	0.303	0.051	0.403	0.008	0.313	0.044
eLinguistics	−0.229	0.090	−0.392	0.003	−0.284	0.034	−0.487	<0.001
lang2vec	−0.380	0.004	−0.454	<0.001	−0.374	0.005	−0.440	0.001

Table 23. NER: Pearson’s and Spearman’s correlation coefficients for model F1 scores and linguistic similarity metrics for zero-shot only.

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.336	0.011	−0.347	0.009	−0.316	0.017	−0.293	0.028
WALS (one-hot)	−0.418	0.001	−0.437	0.001	−0.507	<0.001	−0.474	<0.001
EzGlot	0.173	0.274	0.120	0.448	0.170	0.282	0.095	0.549
eLinguistics	−0.453	<0.001	−0.384	0.003	−0.458	<0.001	−0.389	0.003
lang2vec	−0.234	0.082	−0.214	0.113	−0.307	0.021	−0.256	0.057

Table 24. DEP: Pearson’s and Spearman’s correlation coefficients for model LAS scores and linguistic similarity metrics for zero-shot only.

		Pearson				Spearman
	XLM-R		mBERT		XLM-R		mBERT
	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value	$ρ$	p-Value
WALS (ordinal)	−0.738	<0.001	−0.661	<0.001	−0.769	<0.001	−0.588	<0.001
WALS (one-hot)	−0.655	<0.001	−0.622	<0.001	−0.687	<0.001	−0.542	<0.001
EzGlot	0.421	0.005	0.373	0.015	0.488	0.001	0.349	0.024
eLinguistics	−0.795	<0.001	−0.822	<0.001	−0.848	<0.001	−0.842	<0.001
lang2vec	−0.714	<0.001	−0.709	<0.001	−0.775	<0.001	−0.676	<0.001

Table 25. qWALS similarity × Reliability score.

	Danish	English	German	Croatian	Polish	Russian	Japanese	Korean
Danish	-	0.260	0.255	0.195	0.217	0.242	0.187	0.187
English	0.260	-	0.666	0.252	0.383	0.693	0.557	0.581
German	0.255	0.666	-	0.219	0.348	0.654	0.541	0.553
Croatian	0.195	0.252	0.219	-	0.245	0.278	0.197	0.201
Polish	0.217	0.383	0.348	0.245	-	0.414	0.301	0.321
Russian	0.242	0.693	0.654	0.278	0.414	-	0.565	0.586
Japanese	0.187	0.557	0.541	0.197	0.301	0.565	-	0.678
Korean	0.187	0.581	0.553	0.201	0.321	0.586	0.678	-

Table 26. Pearson’s and Spearman’s correlation coefficients for model scores and qWALS similarity × Reliability score (zero-shot only, XLM-R).

	Pearson		Spearman
	$ρ$	p-Value	$ρ$	p-Value
Abusive language identification	0.060	0.660	0.120	0.378
Sentiment analysis	0.120	0.377	0.114	0.403
Named entity recognition	−0.460	<0.001	−0.483	<0.001
Dependency parsing	−0.141	0.300	0.010	0.937

Table 27. Correlation with cross-lingual transfer performance before and after improvement.

	Before			After
	Pearson $ρ$	p-Value	Features	Pearson $ρ$	p-Value	Features
DEP	−0.771	<0.001	169	−0.990	<0.001	75
Abusive language identification	−0.646	<0.001	169	−0.822	<0.001	53
NER	−0.561	<0.001	169	−0.808	<0.001	63
Sentiment analysis	−0.361	0.003	169	−0.803	<0.001	21

Table 28. Languages with highest (left) and lowest (right) agreement rates.

Language	Agreement Rate (%)	Language	Agreement Rate (%)
Serbian-Croatian	90.6	Turkish	31.8
Russian	90.3	Japanese	31.3
French	88.9	Estonian	28.6
Ukrainian	85.7	Arabic	27.3
Dutch	81.8	Indonesian	22.2

Table 29. Languages with highest (left) and lowest (right) error rates.

Language	Error Rate (%)	Language	Error Rate (%)
Urdu	87.0	Danish	6.3
Arabic	59.1	Latvian	4.2
Indonesian	44.4	French	3.7
Bengali	40.9	Russian	0.0
Mandarin Chinese	40.7	Serbian-Croatian	0.0

Table 30. Agreement rate and Cohen’s Kappa for pairs of linguists.

		Agreement Rate %	Cohen’s Kappa
Linguist 1	Linguist 2	80.4	0.795
Linguist 1	Linguist 3	68.5	0.673
Linguist 2	Linguist 3	69.4	0.682

Table 31. Agreement rate and Cohen’s Kappa for qWALS/lang2vec and best languages by cross-lingual transfer score.

		Agreement Rate %	Cohen’s Kappa
qWALS	Linguist 1	76.5	0.755
qWALS	Linguist 2	73.0	0.719
qWALS	Linguist 3	71.3	0.701
lang2vec	Linguist 1	71.9	0.707
lang2vec	Linguist 2	67.9	0.666
lang2vec	Linguist 3	65.7	0.643
Best language	Linguist 1	75.0	0.714
Best language	Linguist 2	75.0	0.714
Best language	Linguist 3	48.2	0.436
Best language (Abusive)	Linguists (avg)	100	1.0
Best language (Sentiment)	Linguists (avg)	61.5	0.575
Best language (NER)	Linguists (avg)	69.2	0.658
Best language (DEP)	Linguists (avg)	76.9	0.723

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Eronen, J.; Ptaszynski, M.; Wicherkiewicz, T.; Borges, R.; Janic, K.; Liu, Z.; Mahmud, T.; Masui, F. Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance. Mach. Learn. Knowl. Extr. 2026, 8, 65. https://doi.org/10.3390/make8030065

AMA Style

Eronen J, Ptaszynski M, Wicherkiewicz T, Borges R, Janic K, Liu Z, Mahmud T, Masui F. Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance. Machine Learning and Knowledge Extraction. 2026; 8(3):65. https://doi.org/10.3390/make8030065

Chicago/Turabian Style

Eronen, Juuso, Michal Ptaszynski, Tomasz Wicherkiewicz, Robert Borges, Katarzyna Janic, Zhenzhen Liu, Tanjim Mahmud, and Fumito Masui. 2026. "Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance" Machine Learning and Knowledge Extraction 8, no. 3: 65. https://doi.org/10.3390/make8030065

APA Style

Eronen, J., Ptaszynski, M., Wicherkiewicz, T., Borges, R., Janic, K., Liu, Z., Mahmud, T., & Masui, F. (2026). Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance. Machine Learning and Knowledge Extraction, 8(3), 65. https://doi.org/10.3390/make8030065

Article Menu

Language Models Are Polyglots: Language Similarity Predicts Cross-Lingual Transfer Learning Performance

Abstract

1. Introduction

2. Related Research

2.1. Language Similarity

2.2. Computing Language Similarity

The World Atlas of Language Structures

3. Methodology

3.1. Approach Overview and Evaluation

3.2. Similarity Metrics and Linguistic Resources

3.2.1. eLinguistics

3.2.2. EzGlot

3.2.3. Averaged Lang2vec

3.2.4. Quantification of the World Atlas of Language Structures

WALS Database

Metric Calculation

4. Cross-Lingual Transfer Learning Experiments

4.1. Tasks and Datasets

4.1.1. Dependency Parsing

4.1.2. Named Entity Recognition

4.1.3. Sentiment Analysis

4.1.4. Abusive Language Identification

4.2. Applied Models

4.2.1. mBERT

4.2.2. XLM-R

4.3. Experiment Design

4.4. Results and Discussion

Effect of Linguistic Similarity

4.5. Additional Experiment: Optimization of Quantified WALS

5. Survey on Language Similarity

5.1. Survey Design

5.2. Survey Results

5.3. Correlation of Survey Results with Linguistic Similarity Metrics and Cross-Lingual Transfer Scores

6. Discussion

6.1. Language Similarity and Cross-Lingual Transfer

6.2. Expert Survey

6.3. Summary of Contributions

6.3.1. Compared Multilingual Transformer Model Performance Across Various NLP Tasks

6.3.2. Verified Language Similarity-Based Transfer Learning Performance

6.3.3. Proposed qWALS Language Similarity Metric

6.3.4. Proposed Additional Refining and Optimization of qWALS

6.3.5. Confirmed Challenges in Processing Non-Indo-European and Low-Resource Languages

6.3.6. Conducted Expert Linguist-Based Survey on Language Similarity

6.3.7. Causally Confirmed Correlations Between Language Similarity and Model Performance

6.4. Limitations and Ethical Considerations

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. WALS Features Optimized for Each Task

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI