Next Article in Journal
Security and Privacy of Large Language Models: Threat Taxonomy, Ethical Implications, and Governance
Previous Article in Journal
Remaining Useful Life Prediction of Rolling Bearings Based on Federated Domain Generalization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Chemistry-Inspired Cross-Lingual Transfer in Multi-Lingual NLP via Graph Structural Optimization

by
Befekadu Bekuretsion
1,*,
Wolfgang Menzel
2 and
Solomon Teferra
1
1
School of Information Science, Addis Ababa University, Addis Ababa P.O. Box 1176, Ethiopia
2
Department of Informatics, University of Hamburg, Vogt-Kölln-Strasse 30, D-22527 Hamburg, Germany
*
Author to whom correspondence should be addressed.
AI 2026, 7(5), 151; https://doi.org/10.3390/ai7050151
Submission received: 11 February 2026 / Revised: 4 April 2026 / Accepted: 8 April 2026 / Published: 23 April 2026
(This article belongs to the Section AI Systems: Theory and Applications)

Abstract

Multilingual learning is key in natural language processing, but is challenged by the transfer–interference trade-off, where positive transfer benefits certain languages, while negative interference affects others. Prior methods, including linguistic-based and embedding-based language clustering, have attempted to address this; yet, they remain constrained by their static design and lack of task-specific feedback. In this study, we propose a novel computational strategy inspired by molecular design that constructs molecules with targeted properties. Languages are modeled as nodes in an undirected graph, with edges representing the transfer strength. This language molecule is optimized via Reinforcement Learning to adjust edge connections and weights to enhance positive transfer and minimize interference, where graph clustering is applied, and clusters are then evaluated on the Named Entity Recognition and POS tagging tasks using 25 languages from the WikiANN dataset and 12 typologically diverse languages from the UDPOS dataset. Compared to linguistic and embedding-based language clustering baselines, our method yields substantial improvements, especially for low-resource languages, with some showing over 35% increase in F1 score, while high-resource languages benefit moderately, confirming reduced transfer–interference trade-off. Our atom–language model offers a novel path for multilingual learning, inspired by molecular principles from physical sciences.

1. Introduction

In the presence of diverse languages, multilingual learning has become a core method in natural language processing, particularly for many low-resource languages that are already underrepresented in the data. Modern approaches leverage joint training among different languages to enable cross-lingual transfer between datasets of participating languages. However, such an approach suffers from the transfer–interference trade-off, i.e., although some languages benefit from the transfer, the same improvement may not occur in other languages because of negative interference due to imbalanced data availability or linguistic dissimilarity.
Several attempts have been made to address the transfer–interference trade-off problem: multilingual pretraining, such as Multilingual Bidirectional Encoder Representations from Transformers (mBERT), Cross-lingual Language Model RoBERTa (XLM-R), and handling data imbalance [1,2,3]; cross-lingual transfer and Byte Pair Encoding (BPE), artificial languages, adapters, and alignment [4,5,6,7,8,9].
Other studies also addressed the issue using linguistic features and typology from the World Atlas of Language Structures (WALS), word order, and morphology [10,11,12,13,14,15,16,17]; clustering strategies such as clustering based on linguistic cues or embeddings [18,19,20]; and hybrid models such as the integration of typological information with embeddings [21,22].
Cross-lingual transfer has made recent advances in low-resource settings by using denoising-based multi-source transfer [23], leveraging multi-level semantic alignment [24], utilizing language adapters [25], or using retrieval and context-aware learning [26].
Our research focuses on the language-clustering approach, which subsumes linguistically informed and embedding-based approaches. Linguistic-informed clustering (referred to as linguistic clustering for simplicity) uses genealogical or typological similarities to group languages into clusters. This method assumes that transferability is implied by genealogical proximity, which is not always the case, and fails to capture task-specific compatibility in the cross-lingual transfer process. Its static nature and its inability to adapt to varying tasks or data lead to suboptimal grouping of languages for transfer learning.
The embedding-based approach rests on similarity at the level of abstract numerical representations used in language models. It assumes that languages that are similar in the embedding space will transfer knowledge better (or profit more) during joint training in the same cluster. Like the linguistic clustering method, the embedding-based approach ignores task-specific information. Moreover, it is influenced by pretraining data artifacts (e.g., token frequencies, shared domains, scripts) that may not directly align with the task under consideration. For both methods, no direct mechanism exists to evaluate the transfer or interference of language clusters and make the necessary corrections, i.e., they are static. In other words, no performance feedback or correction mechanism exists during training if two languages have similar embedding or linguistic features, but interfere. Embedding-based clustering learns from a flat structure and is unable to learn rich transfer–interference patterns. Language clustering is heavily influenced by the quality of clusters; i.e., negative transfer due to poor or noisy clusters may reduce overall performance.
To address the above problems, we transform the static method (used in linguistic or representation-level embedding-based approaches) into a dynamic, task-aware, and adaptive cluster learning approach that is data-aware with a performance-driven feedback mechanism. It also contains a mechanism that directly controls the transfer and interference patterns.
Inspired by molecular discovery from computational chemistry, we propose a novel approach to address the transfer–interference trade-off problem of multilingual training using a graph-based language clustering method. In our atom–language model, each language is represented as an atom, and the transfer strength between languages is represented as a chemical bond connecting them. The resulting graph is referred to as a ‘language molecule’ and used interchangeably with a language graph in the coming sections. In computational chemistry, atoms and their chemical bonds are manipulated to form molecules with desired chemical and physical characteristics. Our approach manipulates the connections between languages in the language molecule to come up with an optimized graph with desired characteristics, for instance, maximum performance in a downstream task, such as Named Entity Recognition (NER) or Part-Of-Speech (POS) tagging tasks. This structural optimization is accomplished by a Reinforcement Learning (RL) agent optimizing a non-flat network structure.
More specifically, we model languages as nodes in a graph, with edges initialized based on typological similarity between them. RL agent iteratively updates this graph by adding, removing, or modifying edges (representing transfer strengths) to maximize the desired performance of the downstream task (the F1 score of NER or POS tagging task used to simulate the target molecule characteristic) across all languages, and preserve other desirable graph structures (e.g., clustering coefficient, spectral gap). Once this training is finished, spectral clustering is applied to the optimal language molecule to obtain optimized language clusters, which are then used to train multilingual NER or POS tagging models with a reduced effect of the transfer–interference trade-off.
After experimental validation on the NER and POS tagging tasks using 25 languages from the WikiANN, and 12 typologically diverse languages from the UDPOS (Universal Dependencies Part-Of-Speech) dataset, respectively, our results demonstrate significant and consistent improvements over the linguistic-based and embedding-based baselines. More importantly, compared to these baselines, low-resource languages benefit substantially from our graphical approach, with F1 score gains exceeding 35% in some cases, suggesting that transfer is successfully promoted. Even high-resource languages show moderate improvements, better than the two baselines, indicating that our approach effectively mitigates interference.
To make our approach clearer, we describe a very intuitive and simplistic example of our approach using a small number of languages, particularly six languages chosen from the WikiANN dataset: English, German, Spanish, French, Hindi, and Pashto. The languages are chosen systematically; for instance, Spanish and French are closely related, while English and Hindi are distant. When these languages are trained in multilingual learning, positive transfer occurs between similar languages, improving performance, whereas dissimilar languages may introduce interference.
To mitigate this problem, our approach models languages as nodes in a language graph, and edges connecting the languages represent the degree of transfer, i.e., how beneficial it is to train using the merged dataset of the connected languages. After initializing the edges with linguistic similarity values between the connected languages, the RL agent iteratively manipulates the graph to strengthen connections that promote performance (transfer) and simultaneously weakens or removes interfering connections that reduce performance.
To make the demonstration more concrete, if Spanish and French, when trained together, increase performance, the RL agent makes their connection stronger. On the other hand, the connection between Spanish and French will be weakened or removed if their joint training leads to a reduction in performance. When this action is continued many times, languages that are strongly connected in the graph inherently form groups.
Groups of languages are formed when a clustering algorithm is applied to the optimized graph. These clusters are used to train a multilingual task separately, i.e., languages in the same cluster are trained together, promoting transfer and reducing interference. The method can dynamically discover task-specific clusters based on a feedback mechanism from downstream task performance instead of the previous static clustering approaches, such as linguistic or embedding-based methods.
The main research questions raised by this research are: (1) Can we learn language clusters dynamically in a task-specific manner, instead of relying on static typological or embedding methods? (2) Does modeling transfer interference as a structural optimization of a graph help to better transfer cross-lingual knowledge, especially for low-resource languages?
This work has both theoretical and practical contributions to multilingual NLP, particularly in addressing its transfer–interference trade-off. The theoretical contribution involves proposing a novel, task-aware, language clustering framework with a feedback mechanism and a graph-based formulation of cross-lingual transfer. Our atom–language model, which uses atoms and chemical bonds to represent languages and their relationships (or transfer/interference strength), establishes a novel pathway, with substantial future research implications, to apply insights from the physical sciences (such as chemistry and physics) to computational linguistics. The practical contribution includes an extensive experimental validation of the proposed framework in the NER and POS tagging tasks using 25 languages from the WikiANN dataset, and 12 typologically diverse languages from the UDPOS dataset. Our approach provides robust support for low-resource languages without degrading high-resource performance, even in multilingual systems that contain diverse languages. We show that the hidden transfer–interference characteristics of languages can be revealed through structural or network interaction from a graphical representation.

2. Related Work

In our world of diverse languages, the importance of multilingual models is growing. However, these languages are not equally rich with respect to the data used to produce such models. Hence, low-resource languages need to learn from their high-resource counterparts, whose performance may be degraded by negative transfer from low-resource data. Several attempts have been made to alleviate this transfer–interference trade-off: multilingual pretraining (mBERT, XLM-R, data imbalance), cross-lingual transfer (BPE, artificial languages, adapters, alignment), linguistic features and typology (WALS, word order, morphology), clustering strategies (linguistic vs. embedding-based approach), and hybrid models (integrating typological information with embeddings).
Large pre-trained language models, such as Multilingual BERT (mBERT) [1], and XLM-RoBERTa [2], have been developed in the past to accommodate a large number of low-resource languages alongside high-resource ones. These models perform well in several tasks that were not originally used to train them in a task-specific mode. However, because the quality of the model depends on the presence of a sufficient amount and quality of language-specific data, only a small number of high-resource languages benefit, while several low-resource languages remain weakly represented [3].
Language diversification is also addressed by cross-lingual transfer, which allows the transfer of knowledge represented in high-resource languages into low-resource languages. In this regard, multilingual sentence embeddings were employed [4], while subword modeling using Byte-Pair Encoding [5] was applied to facilitate transfer [6]. Furthermore, artificial languages were used to analyze where and why cross-lingual transfer is successful [7]. Other studies leveraged enhanced models’ generalization to unseen languages by employing alignment-based methods, such as few-shot and zero-shot methods, using parallel data [8], or adapter-based transfer [9]. However, these methods are not effective without the existence of a shared representation at the subword or semantic level.
Beyond direct transfer, for typologically different languages, explicit language features, such as typology or phylogenetic data, were used to augment the model’s training. This improves the generalization of the model to unseen languages with overlapping linguistic features. For example, multilingual fine-tuning is improved using morphology and word order [10,11], while cross-lingual transfer models were enhanced through guidance from structural typology [12,13]. Furthermore, the effect of typological priors was investigated on different tasks such as machine translation [14,15], dependency parsing [16], and pre-training objectives [17].
Cross-lingual transfer, particularly for low-resource languages, has made recent advances to improve performance. For instance, denoising-based multi-source transfer has been proposed to increase performance by reducing noisy supervision [23]. Similarly, improving zero-shot cross-lingual transfer at different linguistic levels is modeled by leveraging multi-level semantic alignment [24]. A method that utilizes language adapters is exploited to promote cross-lingual transfer without modifying the full model parameters, resulting in a parameter-efficient method [25]. An analysis investigating the robustness of cross-lingual transfer stated that performance can be improved when tested on low-resource settings, suggesting a more reliable mechanism. A more recent work utilizes retrieval and context-aware learning for improving cross-lingual performance [26]. Despite these advances, most of the existing approaches rely on static or implicitly learned cross-lingual relationships between languages.
Language clustering partitions languages into groups based on similarity, with each group used to train the multilingual model separately, enabling effective and efficient knowledge transfer. The embedding-based clustering approach outperforms the linguistic-based approach in a machine translation task [18]. In another study, embedding-based clustering was found to perform better than typological-based clustering in NER tasks [19]. This embedding-based approach is outperformed by a recent linguistic-based clustering method [20]. Hence, we used both approaches as a baseline for our experiment, which was conducted in the same setting.
Generally, language clustering has emerged as a promising technique to mitigate the transfer–interference trade-off and to enhance the ability of multilingual models to generalize across languages. Language clustering allows high intra-cluster transfer while blocking negative transfer from the languages in other clusters. This is especially useful for low-resource languages with sparse data. However, the main limitation of language clustering is that it is heavily influenced by the quality of the chosen clusters, i.e., negative transfer due to poor or noisy clusters may reduce overall performance. Furthermore, the decision on the similarity metric and the grouping method of the clustering that should be taken before the clustering begins may not be generalizable across tasks and domains. Despite these challenges, language clustering is particularly attractive, as it provides an opportunity to directly inspect and interpret model details to feed these insights back into further improvements.
Although classifications based on linguistic features are explainable and can capture etymological insights, they are sometimes outperformed by the embedding approach, which has a superior ability to capture nuanced data-driven features not found in linguistic classification. Another approach recommends the use of hybrid methods [27]. The hybridization of contextual embedding with the syntactic features of WALS [21] and the combination of typological embedding with pretrained multilingual models [22] were investigated. This indicates that linguistic insights remain promising for enhancing modern machine learning and deep learning models.

3. Materials and Methods

Existing language clustering approaches use an embedding (data-driven), linguistic, or hybrid approach. Inspired by computational chemistry, particularly molecular discovery, which generates molecules that have target properties, our design, as shown in Figure 1, considers participating languages as nodes represented as a language molecule, whose edges are instantiated with linguistic similarity and can be modified using RL optimization to a specified metric in an NLP task (chemical property). A high value of edge weight between languages in the language molecule represents high transfer and vice versa. The final language molecule is then broken down to find clusters with high transfer and low interference. The main motivation is that the transfer–interference trade-off can be better addressed by leveraging a connected language graph, which allows direct manipulation of its edges toward high transfer or low interference to a greater degree than the previous approach, which relied on embedding or linguistic typology.
First, the language molecule is instantiated as a fully connected graph whose edge values are initialized with the linguistic similarity value calculated using the language distance between the connected languages. This language molecule is then passed through a generative process by making it part of RL, i.e., the language molecule is considered as an environment in which the agent takes action on it, such as adding and removing edges, and modifying edge values. We limit the actions only to edges, not to the languages, so that the final language molecule will contain all the participating languages.
The actions change the language molecule towards an optimum value, high transfer but low interference, using reward values calculated from the molecule itself and used to guide the RL. The reward function involves values calculated from the specific NLP task (such as NER or POS tagging F1 score) after being trained and validated using a small portion of a multilingual dataset. Although all participating languages are included, only a small amount of each language’s data is used because this step is a preliminary one, which serves only the purpose of finding the optimal clusters that later on will be trained and tested using the full dataset. In addition, measures that encode the target properties of the language molecule can be calculated from the graph structure and combined to represent the target property of the language molecule.
The reward value in the above RL algorithm is calculated from the sum of the three normalized values. The first value is about the empirical characteristics of the graph. Hence, an edge value, representing the transfer strength, between two languages in a language molecule, is the resulting performance reward of the considered NLP task after being trained and evaluated on the merged dataset of the two languages. The average performance score of edges on the graph is calculated on a portion of the validation dataset, weighted by the current linguistic proximity-based edge value of all language pairs where the edge is present in the current language molecule. Every distinct value of such a score is stored and reused when an identical edge value calculation reappears, significantly reducing computational overhead. This reward enables the RL algorithm to produce a graph in which languages with high transfer have connections or edges with high value, while those with interference result in smaller edge values or disconnected nodes. Hence, transfer is promoted between similar languages, while interference is penalized among incompatible languages.
The second (spectral transfer of the graph) and third (average clustering coefficient) values are related to structural characteristics of the graph. Higher spectral transfer (difference between the largest and second-largest eigenvalues of the Laplacian) measures the global ease of information propagation. Higher values promote smooth information flow, where transfer is rewarded, but uneven connections or interference spots are penalized. The third value, the average clustering coefficient (the ratio of the actual number of neighbors’ edges of a single node to the maximum number of such connections, averaged across all nodes), measures the potential to form local groups in the language molecule. High local clustering promotes the grouping of typologically similar languages (or local language communities) that reside together in the language molecule among compatible languages (high transfer) while pushing out dissimilar ones (high interference).
Generally, the reward values balance the empirical value (e.g., the F1 score for the NER or POS tagging task) with global and local structural rewards (spectral transfer and clustering coefficients, respectively). Using these reward values, the RL agent will be able to learn which graphs combine a high amount of high transfer with a low degree of interference.
In addition to the rewards, the agent requires the current state of the environment to decide on the next action. The degree of the language molecule (the total number of edges connected to each node) is such a parameter. The mean value of the performance scores of each connected language in the language molecule also serves the same purpose. Other state attributes are also used to measure how closely the neighbors of a language are connected (the clustering coefficient for each language) and how strongly the graph is connected, with few weak or disconnected spots (algebraic connectivity). All these values for all the languages in the language molecule are concatenated into a single vector to represent the state of the graph. The RL agent observes the current state and iteratively takes actions until a preset maximum number of steps has been reached, ending in an optimal language graph with high transfer and low interference.
We aim to produce better clusters that, if trained separately, provide better results than those formed by previous baseline methods. Therefore, we need to choose a clustering method that operates on the edge values of the language molecule and its global connectivity. Spectral clustering is a suitable method that fulfills this requirement, i.e., the resulting language clusters facilitate a high amount of high transfer combined with a low degree of interference among the participating languages.
Generally, after linguistic priors are used to instantiate the edge values of the language molecule represented as an undirected graph, we leverage generative AI driven by RL to create a language molecule that is optimized structurally—by adding and removing its edges—and parametrically optimized—by modifying the corresponding edge values—towards high transfer and low interference. This is then followed by graph clustering to find optimal language clusters to be trained and used separately in a downstream task. An undirected graph is chosen because the relationship among languages is correlational rather than causal.
The reward function combines empirical task performance and structural graph properties and needs a mathematical formulation for precise explanation, as follows.
Let G = ( V , E , W ) denote the language molecule (or graph), where V is the set of languages, E the set of edges, and W = { w i j } the edge weights representing transfer strength between languages i and j, and instantiated with the linguistic similarity between i and j.
The reward function at step t is defined as a weighted combination of three normalized components:
R t = λ 1 · R ^ emp ( G t ) + λ 2 · R ^ spec ( G t ) + λ 3 · R ^ clust ( G t ) ,
where λ 1 , λ 2 , λ 3 [ 0 , 1 ] are weighting coefficients such that λ 1 + λ 2 + λ 3 = 1 .
  • Empirical Performance Reward.
The empirical reward measures the average task performance across edges:
R emp ( G ) = 1 | E | ( i , j ) E w i j · P e r f ( i , j ) ,
where P e r f ( i , j ) is the task-specific performance (e.g., F1 score) obtained by training on the merged dataset of languages i and j, and w i j is the current edge weight.
  • Spectral Transfer Reward.
The spectral reward is defined using the Laplacian matrix L of the graph:
R spec ( G ) = λ 2 ( L ) λ 1 ( L ) ,
where λ 1 ( L ) and λ 2 ( L ) are the largest and second-largest eigenvalues of L, respectively. This term captures the global connectivity and information flow in the graph.
  • Clustering Reward.
The clustering reward is defined as the average clustering coefficient:
R clust ( G ) = 1 | V | i V C i ,
where C i is the local clustering coefficient of node i.
  • Normalization.
Each reward component is normalized using min–max normalization:
R ^ k ( G ) = R k ( G ) min ( R k ) max ( R k ) min ( R k ) , k { emp , spec , clust } .
  • Weight Selection.
In our experiments, we use λ 1 = λ 2 = λ 3 = 1 / 3 , giving equal importance to empirical task performance and graph properties (global and local structural properties). The effect of different settings can be investigated in the future.
To increase the reproducibility and clarity of our approach, in addition to the reward functions, the state vector observed by the RL agent requires a mathematical formulation. At each step, the RL agent observes the current value of the state vector and takes action accordingly. For each language (node), the following features are computed: (1) node degree (number of connected edges), (2) average empirical transfer performance with its neighbors, (3) local clustering coefficient, and (4) contribution to graph connectivity (approximated by algebraic connectivity). The state vector is constructed from node-level and graph-level properties, which capture both local and global structural properties.
These features are computed for all nodes and concatenated into a single vector as follows:
s t = concat d i , p ¯ i , C i , κ i i = 1 N ,
where d i is the degree, p ¯ i is the mean transfer performance, C i is the clustering coefficient, and κ i represents connectivity-related features for node i. This state representation captures both local and global structural properties of the language graph.
The computational burden of RL optimization raises concerns regarding the scalability and applicability of the proposed method in large multilingual settings. However, the RL optimization is applied only once as an offline procedure to construct the language graph, which can further be reused across multiple downstream experiments, without incurring additional RL costs. This RL cost is one-time, comparable to other meta optimization steps in hyperparameter optimization or architecture search that are commonly used in modern NLP systems. Moreover, the caching and reusing of evaluations for the previously computed edges avoids redundant computation, improving the scalability of our approach for multilingual settings, which involve a larger number of languages. Also, during optimization of the language graph, the RL agent takes actions to remove edges with weaker transfer relationships, reducing the effective edges and implicitly reducing the computational overhead as the number of languages increases.
The use of an RL agent may induce computational overhead, requiring mathematical computational complexity analysis. The number of language pairs and the number of reinforcement learning steps are the primary determinants of the complexity measures. In the worst-case analysis, for our undirected graph, the number of possible edges grows quadratically with the number of languages, rather than exponentially. If N is the number of languages, edges grow with O ( N 2 ) . However, in practice, edge removal during the optimization process and the caching of pre-calculated edge values significantly reduce the effective search space, as the RL policy focuses on connections with high transfer strength. The overall complexity can be approximated as O ( T · E eff · C eval ) , where T is the number of RL steps, E eff is the number of effectively explored edges, and C eval is the cost of model evaluation.
The number of languages may be in the hundreds in large (or very large) multilingual settings. In such cases, because the edges in the graph are initialized with linguistic similarity, our approach can be further enhanced by filtering out less valuable edges before the main optimization starts. Additionally, each edge evaluation process is inherently independent, allowing parallel execution that can be exploited in modern multi-GPU environments. This makes our approach still scalable even in extreme cases, despite the quadratic growth in the number of possible connections.
As in most existing studies on language clustering, we test the effectiveness of our method on the NER task. A comparison with the POS tagging task is also performed using a more diverse set of languages. Furthermore, our approach is compared with recent baseline methods, namely the linguistic-based approach [20], the embedding-based approach [19] (which has already been outperformed by the linguistic-based approach), and the language clustering method, within the same experimental setup used to evaluate the effectiveness of our method.
To instantiate the language molecule with a linguistic prior, the distances between any two connected languages are calculated using five feature vectors from the URIEL typological database [28], and one learned data-driven language vector [29]. The values of these vectors are queried, and the corresponding language distance is calculated using cosine similarity from the Lang2vec https://github.com/antonisa/lang2vec (accessed on 7 February 2026) library. The five typological features (‘genetic’, ‘geographic’, ‘syntactic’, ‘inventory’, ‘phonological’) and one learned vector (‘featural’) are utilized. The average complement of the language distance for these features is used to measure linguistic similarity (or the initial edge value) for each language pair in the initial language molecule. The complement of the distances is taken because small distances, which are directly calculated from the aforementioned library, should imply a high reward, directly compatible with RL maximization.
Identical to the baseline papers that implement their method in the NER for both linguistic and embedding-based approaches, we used the same setup to implement our method for fair comparison. The NER dataset is grouped into three named entities, LOC (location), PER (person), and ORG (organization), for the selected 25 Indo-European languages from the WikiAnn multilingual dataset https://huggingface.co/datasets/unimelb-nlp/wikiann (accessed on 20 January 2026) [30]. A pre-trained XLM-RoBERTa-base https://huggingface.co/xlm-roberta-base (accessed on 2 February 2026) model [2] is employed. The average F1 score of the three named entities is calculated using the Seqeval framework [31] and used as a measure to evaluate the performance of the NER task, both during the clustering and evaluation steps, using a random seed set to 42. To ensure a fair comparison, the same set of languages and the same experimental setup as in the baseline studies are utilized to test our approach. More specifically, 25 languages from 6 families of Indo-European languages were selected. These are: Romance (ro, fr, es, pt, it, scn), Germanic (af, nl, de, is, en, da, no, fo), Greek (el), Slavic (bg, pl, ru, sl, hr), Indo-Iranian (ps, mr, hi), and Celtic (cy, ga). A list of language codes and the corresponding language names can be found in the ISO 639-1 standard https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes (accessed on 2 March 2026).
The standard approach in data-driven language clustering approaches, particularly the embedding-based method, uses a portion of the training and validation data for language clustering, followed by the use of the full dataset to train and evaluate the resulting clusters. Following this common practice, we used 1000 and 10,000 samples from each language to train our language clustering method. If for a language, the size of the available data is smaller than the limits (1000 or 10,000 samples), the full data is used.
Like in the baseline study, the experiments have been run three times, and the average F1 score is used to normalize any randomness. The F1 score is the standard and commonly used metric to measure the performance of NER tasks. Moreover, the F1 score is capable of balancing precision and recall. The F1 score provides a balanced assessment of precision and recall. This capability is selected when both false positives and false negatives are present, such as in sequential models, like ours. Hence, the F1 score alone provides a sufficient measure for our work that involves evaluating cross-lingual transfer across a different number of clusters. It is also consistently adopted in prior NER works.
While additional metrics such as precision and recall can be reported separately, they are directly reflected in the F1 score and typically follow the same trends. Since our primary focus is on comparing cross-lingual transfer effectiveness across different clustering strategies, the F1 score alone provides a sufficient and consistent basis for evaluation, as also adopted in prior multilingual NER studies.
In addition, the batch size, maximum input sequence length, learning rate, and number of epochs are set to 32, 512, 5 × 10 5 , and 3, respectively. Different experiments were conducted by varying the number of clusters, i.e., 2, 3, 4, and 5. The experiments are conducted on an NVIDIA GPU, taking approximately 1.5 days for the RL optimization.
The Python 3.12 Networkx library [32] is used to implement the language molecule as a graph. It is also used to calculate the reward and the structural characteristics of the language molecule. The state-of-the-art RL algorithm, Proximal Policy Optimization (PPO), is used from the stable-baseline package [33]. The optimization is run for a total of 20,000 steps with a policy update every 512 steps. The implementation code will be made publicly available upon acceptance of the manuscript to ensure full reproducibility.
For most of the hyperparameters, we used the default PPO configuration provided by Stable-Baselines [33], as it is robust and works in different tasks and configurations. The main hyperparameters used in our experiments are summarized as follows:
  • Learning rate: 3 × 10 5 ;
  • Discount factor ( γ ): 0.99;
  • Clipping parameter ( ϵ ): 0.2;
  • Number of steps per update: 20,000;
  • Number of steps per update: 512;
  • Batch size: 32;
  • Number of epochs: 3.
Unlike previous studies that used agglomerative clustering, spectral clustering is employed in this study for its advantage in clustering graphs. Agglomerative clustering relies on pairwise distances between local nodes or languages calculated greedily, and assumes spherical or compact graph shapes. In contrast, spectral clustering aligns with graph partitioning theory and does not assume any shape of the graph. It works well even if the graph is noisy or sparse. Moreover, it considers the global connectivity of the graph, enabling the discovery of language communities even though they are located far apart in the graph. Again, the Networkx package is utilized to cluster the optimal language molecule.
Clarification on how the dataset is allocated among the different phases of the proposed method, namely graph optimization, validation, and final testing, is important. First, a subset of the training data (e.g., 1000 or 10,000 samples) is used to construct the language graph using RL-based optimization. Then, after spectral clustering is applied on the language graph, the resulting clusters are evaluated by training a model where data for languages in each cluster is merged from a held-out validation set to assess transfer. Finally, the full test set is utilized separately for evaluating the model and calculating the task performance.
Generally, the experiment is conducted on six cases: (1) full dataset (or all samples) used for evaluating clusters formed using 10,000 samples (‘full;cls-10,000’); (2) full dataset used for evaluating clusters formed using 1000 samples (‘full;cls-1000’); (3) 10,000 samples used for evaluating clusters formed using 10,000 independently sampled samples (‘10,000;cls-10,000’); (4) 1000 samples used for evaluating clusters formed using 1000 independently sampled samples (‘1000;cls-1000’); (5) baseline study—involving two cases [20]—full dataset is used to evaluate clusters formed linguistically based on morpho-syntactic features in the nominal parameter (baseline_noun’) and head-directionality parameter (‘baseline_head’). The cases ‘10,000;cls-10,000’ and ‘1000;cls-1000’ are introduced to compare the methods under the low-resource and the relatively high-resource conditions, respectively, and to be consistent with the evaluation setup of the data-driven embedding-based clustering method [19]. ‘cls’ stands for clustering in the aforementioned abbreviated cases.
In addition to the linguistic-based baseline, the result from the above experiments is compared with the embedding-based language clustering approach that incorporates the emb_1000 (which uses 1000 samples) and emb_10,000 (which uses 10,000 samples) settings. The proposed method is also validated on the POS tagging task, using the UDPOS dataset, to test its effectiveness on other tasks and typologically diverse languages.

4. Results and Discussion

A comparison is made between our proposed RL-driven, graph-based clustering method and the baselines using the WikiANN dataset, i.e., the linguistic clustering [20] and the embedding-based methods [19]. Both approaches are evaluated from the perspective of cross-lingual transfer quality by their ability to improve transfer and reduce interference under varying multilingual setups. Although the linguistic-based approach is found to outperform the embedding approach in most languages [20], we compare our approach with both approaches under the same setup. The full list of evaluation results is shown in Appendix A (for the linguistic-based baseline) and Appendix B (for the embedding-based baseline). The results of the comparison with the POS tagging task are also presented in the end.

4.1. Graphical Analysis of the Language Molecules

As mentioned above, considering languages as atoms and chemical bonds as the strength of the relationship, the proposed method attempts to obtain language molecules with desired characteristics using RL. More specifically, two language molecules are constructed using 1000 samples (Figure 2) and 10,000 (Figure 3) samples. The language molecule of 10,000 samples has a larger number of connections than the one constructed from 1000 samples. In addition to visual information, this is asserted by graph metrics such as the total number of edges (233 for the 1000-sample graph and 292 for the 10,000-sample graph), proportion of edges out of the possible total that measures density (0.78 and 0.97, respectively), and the average number of edges per language (or node), i.e., the average degree of the molecule (18.64 vs. 23.36). See Table 1 for a detailed analysis.
In our approach, edge weights are formed based on the performance of the joint dataset for the connected pair of languages. Since more samples can improve the performance to a higher edge weight, this encourages the retention or addition of more edges. Moreover, noisy or random connections will be normalized or regularized as more data is used, resulting in a more accurate representation. Furthermore, more data allows for a higher transfer of knowledge from neighbors, resulting in a more confident and viable connection. In this section, a graphical analysis of the language molecule is conducted from the perspective of transfer and interference. All metrics are calculated in an edge-weighted manner to incorporate the effect of edges, our primary graph attribute, on the analysis output.
The average edge or weight value indicates the average transfer quality among languages. Larger values indicate high transfer strength (because the RL algorithm rewards higher values during optimization). Our approach shows significant transfer support, with the 10,000 sample slightly improving the transfer quality. In addition to the average edge weights, the average standard deviation indicates a consistent pattern among languages (showing consistent transfer across all languages). This is more pronounced in the 10,000-sample graph. Entropy calculates the variability (or diversity) of edge weights in the graph, i.e., how varied the strength of transfer relationships is between pairs of languages. Higher entropy indicates a uniform spread of edge weights or transfer of knowledge with no dominance, while a low value shows that a few edges dominate the transfer, while the rest are weak or not participating actively. Both graphs have moderate entropy, indicating good coverage, especially important for low-resource languages, which is again slightly more pronounced in the 10,000-sample graph.
The clustering coefficient measures the tendency to form local triangles in language neighborhoods. A value of 0 indicates a tree-like structure with no triangle formation (graph is too sparse; low-resource languages may be isolated), whereas a value of 1 indicates a perfect clustering (possible redundancy or over-specialization) in which every neighbor of a node is connected. Higher values indicate formation of local transfer communities or, more specifically, a significant transfer across related languages or clusters. Our graph has a moderate (0.40 for the 1000-sample graph) and better (0.56 for a 10,000-sample graph) value for the clustering coefficient.
The betweenness of a node measures how often it lies on paths between other pairs of nodes. A value of 0 shows that the language is non-bridging or peripheral, which makes it less important for transferring knowledge from one part of the graph to the other. A value of 1 shows that the language is a central hub or bridge passage for information among the other languages. If it were removed, it would highly disconnect the rest of the graph. In both graphs, ‘ps’ serves as the main connector (with a betweenness value of 0.1304 that is significantly greater than all the others). When tested on actual data (Section 4.2), ‘ps’ shows a significant performance gain (35% difference from the baseline) over others because, being a bridge passage or a hub, it receives and transfers useful representations from multiple directions, exposing itself to diverse and rich contexts. This avoids overfitting and enables high performance on the downstream task. This is a proper example that the graph centrality metric can predict the data-driven downstream success. The 1000-sample graph has a few other bridge passage languages, but with a lower betweenness value. In the case of the 10,000-sample graph, most languages have a 0 value because of the high connectivity of the graph.
Assortativity uses some attribute of a node to measure how much a node connects to other similar nodes. The data size of the languages is used as an attribute to measure its resource level. The value 1 indicates that languages are connected to similar languages (high resource to high resource, low resource to low resource), 0 indicates no or random preference, and −1 indicates that languages are connected to dissimilar counterparts (high to low and vice versa). Both graphs show slightly negative assortativity (−0.057 for the 1000-sample graph, −0.037 for the 10,000-sample graph), indicating a desirable property: high-resource languages help low-resource languages. This asserts that knowledge transfer can be enabled by the graph structure.

4.2. Performance Analysis on the NER Task

4.2.1. Comparison with Linguistic-Based Clustering Baseline

Clusters are formed from the two language molecules using spectral graph clustering, and their performance is evaluated in the NER task. The performance for each number of clusters of our method and of the linguistic-based baseline is shown in Figure 4, Figure 5, Figure 6 and Figure 7. The figures and Appendix A indicate that the performance of the clusters formed by our graphical approach is significantly better than the performance of the clusters formed by the linguistic baseline in all languages, across different numbers of clusters. The language with the highest performance gain is ‘ps’, an extremely low-resource language with only 100 samples. In the ‘full; cls-1000’ condition with two clusters, it achieved a 35% F1 improvement compared to the baseline_noun model (from 53.97% to 91.55%, which is 37.58%).
Additionally, the performance gain from our approach is more pronounced for low-resource languages than for high-resource ones. Compared to the linguistic baseline, there is a 17.8% improvement in F1 score for low-resource languages (each having 100 samples), whereas, for the other extreme, the relatively high-resource languages (20,000 samples), an improvement of 5.6% and 5.4% has been observed for ‘full;cls-10,000’ and ‘full;cls-1000’ conditions, respectively, with five clusters. This is clearly indicated by Figure 8, Figure 9, Figure 10 and Figure 11. The plots also show a negative correlation between the performance gain from our approach and the data size, i.e., the performance gain increases as the data size decreases. Even the high-resource languages benefit from the cross-lingual transfer in our approach. This relationship holds consistently, for both low-resource and high-resource conditions, across the different numbers of clusters considered in our experiment.
We also compare the improvement (over the baseline) obtained by the four experimental setups. A minor improvement is seen between the ‘full;cls-1000’ and ‘full;cls-10,000’ conditions, with the ‘full;cls-10,000’ condition showing a slightly better improvement most of the time. Furthermore, the improvement by the ‘full;cls-10,000’ condition consistently outperforms the ‘10,000;cls-10,000’ condition (except when clusters are 2 and 3, and data size is 0.1k), which by itself shows a consistent and better improvement across the different clusters and data sizes (except when clusters are 5 and data size is 0.1k). This shows the importance of using a larger number of training samples during both the formation of clusters and the downstream task. These variations in improvement become more visible as the data size increases (from left to right in the bar charts across all clusters) because the effect of threshold values (1000 and 10,000 limit) becomes more visible.

4.2.2. Comparison with Embedding-Based Clustering Baseline

The previous discussion compares our approach with the linguistics-based clustering baseline. Regarding the embedding-based baseline, two additional settings are used to compare with our graphical approach, emb_1000 and emb_10,000. The former uses 1000 samples to form clusters, while the latter uses 10,000 samples. Like the linguistic-based comparison, the performance for each cluster from our method is compared with the embedding-based clustering baseline. This is shown in Figure 12 when the number of clusters is 2, along with the detailed output shown in Appendix B for all numbers of clusters. The results show that our graphical approach is also better than embedding-based clustering with ‘ps’, an extremely low-resource language with only 100 samples, and again achieves the greatest improvement. For instance, in our ‘full;cls-1000’ condition, this language shows over 35% F1 score improvement when the number of clusters is 2 (from 55.31% to 91.55%, which is 36.24% for emb_1000 and from 55.92% to 91.55%, which is 35.63% for emb_10,000).
Also, in the case of the embedding-based baseline, the performance gain is much better in low-resource languages than in the high-resource counterparts. For instance, in both ‘full;cls-10,000’ and ‘full;cls-1000’ conditions, when the number of clusters is 5, there are 18.8% and 17.9% improvements (for emb_1000 and emb_10,000 settings, respectively) in F1 score for low-resource languages (each having 100 samples). On the other hand, for the relatively high-resource languages (20,000 samples), a smaller improvement is recorded, i.e., 5.8% and 5.7% for ‘full;cls-10,000’ and ‘full;cls-1000’ conditions, respectively, with five clusters in the emb_1000 baseline. This is clearly indicated by Figure 13 and Appendix B for all cluster numbers for the emb_1000 baseline when the number of clusters is 2.
The same trend is observed for the other emb_10,000 setting, as demonstrated by Figure 14 and Appendix B across all clusters. Also, the performance gain increases as the data size decreases, indicating a negative correlation between the performance gain of our approach and the data size. Similarly to the improvement obtained for the linguistic-based clustering baseline, our approach consistently benefits both low-resource and high-resource languages through cross-lingual transfer across different conditions and cluster numbers considered in our experiment. Generally, the same pattern of improvement is observed by our approach over the linguistic-based and embedding-based baselines.
Generally, the comparison in improvement over the two baselines across the different data sizes shows that our approach benefits low-resource languages significantly more than high-resource ones. The transfer behavior in low-resource languages is enhanced without negative interference (smaller but consistent gains are seen in high-resource languages), suggesting cleaner knowledge transfer. Furthermore, significant improvement over the baselines is observed, with minor variations across the different experimental setups (data limits in training data sizes), indicating that the clusters generated by our approach are robust even when a smaller amount of data is used.
Our graphical approach is dynamic and, unlike the embedding-based clustering method, can adapt to different tasks by learning from task-specific datasets. The linguistic clustering is naturally static, not adaptable to varying tasks or data, and assumes that transferability is implied by genealogical proximity. However, this may not always work in practice. For example, in the baseline linguistic-based cluster [20], ‘ps’ and ‘hi’ are grouped under Indo-Iranian languages despite their linguistic differences, such as different scripts and token structures. As shown in Table 2, the proposed graphical clustering approach has successfully learned these differences and therefore groups ‘ps’ with more related groups (such as ‘cy’ and ‘ga’), not with ‘hi’, to avoid script-related interference while still improving transfer, resulting in improved performance. Furthermore, in the linguistic approach, some low-resource languages are diminished by the presence of many other high-resource languages in the same group, for example, ‘scn’ in the Romance group. Again, the graphical cluster systematically re-allocates this language into appropriate groups to improve its performance. The next section analyzes the proposed graphical approach from the perspective of cluster quality.

4.2.3. Statistical Analysis of Results

The graphical clustering achieved over 35% score gain for a low-resource language, specifically for the language ‘ps’. This substantial gain needs statistical validation to assess the robustness of the stated improvements. The mean and standard deviation across the three independent runs are 91.55 and 0.66, respectively. The low variance indicates that the method is robust and provides consistent results across the different runs. Furthermore, the 95% confidence interval yields [89.92, 93.17], which is still substantially higher than the baseline (53 to 55, depending on the different baselines and settings). A two-sample t-test yields a p value of 6.15 × 10−8 (p < 0.01), indicating that the result is statistically significant and consistent and not due to random variation. The same trend is followed by the other languages, confirming that the proposed approach provides stable and reliable improvements.
Moreover, we earlier stated that graph centrality (such as betweenness) and downstream performance are related, highlighting that the substantial gain for ‘ps’ is due to its high centrality value in the language molecule. This also requires a more quantitative and statistical validation for ‘ps’ and the remaining languages. Hence, we compute Pearson and Spearman correlations and make the corresponding scatter plot between the average improvement across the four clusters and the corresponding graph betweenness value. Figure 15 (for improvement over linguistic-based baseline) shows a consistent and high value of the Pearson coefficient for most of the experimental settings in the linguistic-based baseline. This is also confirmed by the trend line approximated by linear regression in the scatter plots.
The same statistical analysis is performed for the improvements over the embedding-based baselines. Figure 16 (for improvement over embedding-based baseline, emb_1000) and Figure 17 (for improvement over embedding-based baseline, emb_10,000) show a similar correlational behavior with linguistic-based baseline improvement.
Generally, as indicated by the high Pearson correlation and linear regression line, the statistical analysis indicates there is a strong positive relationship between graph centrality and downstream performance. Notably, ‘ps’ exhibits the highest betweenness and largest F1 improvement, consistent with our qualitative observation. However, this strong linear relationship is not able to create a ranking in all languages, as indicated by the low Spearman coefficient that measures the ranking ability of the correlation. This is because, except for some languages such as ‘ps’, many languages have similar betweenness and simultaneously small variation in improvement. This shows that only a small number of languages act as a bridging node in our graphs, and hence, smaller improvements than ‘ps’.

4.3. Cluster Quality Analysis

Language clustering is an interpretable method that allows us to infer (or interpret) why some sets of clusters are better suited than others. The clusters formed by our approach, and those from the baseline linguistic-based clustering, are indicated in Table 2. As mentioned earlier, we observed that linguistic clusters have some weaknesses related to cross-lingual transfer. For instance, some languages that vary in script or orthography are allocated together (e.g., ‘ps’ vs. ‘hi’). Furthermore, languages with different rules of capitalization and compounding are grouped in the same cluster (e.g., the Germanic family).
In the case of ‘full;cls-1000’, the low-resource languages have an extremely small number of sentences (‘scn’, ‘ps’, ‘fo’, ‘ga’), but the high-resource ones are not allowed to pass the 1000 sample limit. This simulates the low-resource setup. When the number of clusters is 2, cluster 1 contains languages that are mixtures of Romance, Germanic, Slavic, and Indo-Iranian families. Cluster 2, on the other hand, contains [‘ro’, ‘fr’, ‘es’, ‘pt’, ‘el’, ‘sl’, ‘ps’, ‘cy’, ‘ga’]. Comparing these clusters with a linguistic one, ‘scn’ (100 samples) is still kept with its closest transfer partner, but ‘ps’ (100 samples) is detached from ‘hi’ to avoid negative interference because of a dissimilar script. ‘cy’ and ‘ga’ are moved from the Celtic family, which contains only two of them, and reallocated to the Romance, where they receive more support from the other languages in their new group. Furthermore, each of these clusters is composed of low- and high-resource languages, enabling a balanced transfer. Our experimental results confirm that the F1 score for ‘ps’ increased from 55.9% to 91.5% (+35.6), ‘scn’: from 76.5% to 86.1% (+9.6), and ‘fo’: from 87.0% to 93.3% (+6.3).
For the ‘full;cls-10,000’ case, there is no difference in dataset sizes for the extremely low-resource case, but for the relatively high-resource languages, dataset sizes are allowed to expand to the 10,000 limit. Still, even for samples below 10,000, interference is mitigated, indicating that low-resource languages are not significantly penalized by the presence of high-resource counterparts. When compared to the linguistic clusters, our method never groups ‘scn’ with just Romance, and is always in the same cluster with Germanic-like languages (e.g., ‘af’ and ‘nl’). Beyond this, ‘ga’ and ‘cy’ are reallocated from the Celtic family (which contains only these two languages) into larger groups, and ‘ps’ never comes with ‘hi’. The experimental result reveals that ‘scn’ receives an F1 score of 88.5% (baseline score: 76.5% in the nominal parameter setting), while ‘ps’ scores 91.5% (baseline scores 55.9%).
We can also observe that some language pairs recur in both data size conditions (‘it’ + ‘scn’, ‘fo’ + ‘nl’, ‘cy’ + ‘ga’ + ‘ps’), indicating a robust optimization that keeps consistent language relationships. High-resource languages benefited moderately from our approach under both sample size conditions. This indicates that the graphical representation of the cross-lingual transfer and its structural optimization using RL results in an optimized graph that models cross-lingual transfer in a better way, exposing hidden relationships, especially for low-resource languages. We have mentioned that, when addressing the transfer–interference trade-off, language clustering is heavily influenced by the quality of clusters. The enhancements contributed by our approach are due to the quality of clusters it discovered, such as new support from similar but high-resource languages, or not being overloaded by irrelevant but high-resource languages (‘scn’ vs. ‘fr’ and ‘ps’ vs. ‘hi’ are reallocated).
We also measure the sensitivity of the results obtained from the proposed method to the number of clusters. Hence, the analysis is extended to calculating the average F1 score across all languages for the different numbers of clusters, as depicted in Figure 18. The result shows that the performance improvement remains relatively stable across the different cluster configurations and experimental conditions (with a very minor variation in the ‘10,000;cls-10,000’ condition). The method is robust across a reasonable range of the number of clusters and differences in the construction of the graphs. The proposed RL edge optimization method leads to consistent performance across clusters.

4.4. Generalizability and Ablation Analysis

The above analysis utilizes the WikiANN dataset for the Indo-European languages under the NER task. However, this limits the generalizability of our approach to one language family and one task, motivating us to test the proposed approach on another task with diverse languages. Hence, additional experiments were conducted using 12 languages from the UDPOS dataset https://huggingface.co/datasets/commul/universal_dependencies (accessed on 30 March 2026), on the POS tagging task. This dataset covers more diverse language families, namely Romance (‘fr’, ‘es’, ‘pt’, ‘it’), Germanic (‘en’, ‘de’, ‘nl’), Slavic (‘ru’), Semitic (‘ar’), Sino-Tibetan (‘zh’), Japonic + Koreanic (‘ja’, ‘ko’). These languages include non–Indo-European languages, different scripts, and morphologically rich languages, fulfilling our objective of validation.
We keep a similar experimental setup with the NER task (or WikiANN dataset), except for minor changes to keep our setting simplified for the purpose of validation. Only the simplest case, ‘1000;cls-1000’ condition, is employed, and for the RL agent, the number of steps per update is reduced to 5000 with the number of steps per update set to 128.
The result indicates that the proposed method shows a consistent trend in performance across different tasks, particularly NER and POS (Figure 19). This reveals that the proposed language clustering mechanism is not static, but can adapt itself to the task at hand and is able to capture generalizable cross-lingual transfer patterns.
Furthermore, Figure 20 analyzes performance across languages from diverse families. The proposed method is robust, showing strong performance, although the languages are linguistically diverse. This confirms that our method is robust and can generalize across different tasks and languages.
We performed an ablation study on the reward functions used by the RL agent during the optimization of the language graph. We considered three cases for our ablation analysis: (1) the empirical reward, (2) the graph-based reward (the sum of spectral transfer reward and clustering reward), and (3) the combined reward. As shown in Figure 21, although there are minor variations across the different cluster settings, all reward configurations achieved comparable performance. The empirical reward achieved a slightly higher performance, which is expected, as this reward directly optimizes task performance. When this reward is incorporated with the graph-based reward, the performance remains very close but is slightly decreased, especially in some cluster configurations. This has happened because, unlike the empirical reward, the integrated graph-based reward now optimizes not only task performance, but also includes rewards that optimize global and local structural properties, which affect performance indirectly.
The competitive performance from the graph-based reward alone is an indication that structural behaviors alone can guide transfer patterns, which opens up future investigation. The graph-based reward can be considered as a regularizer, allowing a more interpretable and well-structured learning process without a significant effect on the task performance. The narrow gap in performance among the different rewards is an indication that all rewards can encode cross-lingual transfer.

5. Conclusions and Future Work

This study addressed the core issue of multilingual learning: enhancing positive transfer and, simultaneously, reducing negative interference. This problem has been tackled in the literature using different approaches. Among them, language clustering is a promising approach that includes linguistic and embedding-based methods. The linguistic clustering assumes that transferability is implied by genealogical proximity, while the embedding-based approach presupposes that languages that are similar in embedding space will transfer knowledge better, which is not always correct in either case. Furthermore, these methods are not task-aware, and, along with their static nature, it is difficult to adapt them to varying tasks or make corrections based on a feedback mechanism if two languages have similar embeddings but interfere. Moreover, the methods learn the language clusters from flat-structured knowledge, leading to a suboptimal grouping that results in lower performance because the effectiveness of the language clustering approach, when it addresses cross-lingual transfer, is highly influenced by the quality of the resulting clusters.
Inspired by computational chemistry, particularly molecule discovery, we propose a novel approach for language clustering through graphical representation by considering languages as nodes (or atoms) and edges as chemical bonds whose weights represent transfer strength through joint training of the connected languages. After their edges’ weights have been initialized with linguistically informed similarity values, the resulting graph (or language molecule) is optimized using an RL agent, which directly manipulates the edges and weights of the language molecule using a reward function that increases transfer and minimizes interference. The final language clusters are then obtained using the graph clustering approach. The approach is task-aware and dynamic, and directly aware of interference and transfer patterns by correcting itself through a performance-driven feedback mechanism based on a non-flat network structure.
An analysis conducted from three different perspectives (graphical, performance-based, and cluster quality) using 25 languages from the WikiAnn dataset on the NER task and 12 typologically diverse languages from the UDPOS dataset for the POS tagging task consistently indicates that our approach outperforms both the linguistic-based and embedding-based clustering baselines. The performance of low-resource languages improved significantly (some languages benefited by more than 35% F1 score), and even the high-resource languages improved, but to a smaller extent. This indicates that our method successfully mitigates the transfer–interference trade-off problem by enhancing positive transfer while simultaneously reducing negative interference. The output from the experiments on a POS tagging task using the UDPOS dataset reveals that the enhancements from our method are consistent across typologically diverse languages and varying experimental setups. More experiments can be conducted in the future using different natural language processing tasks, involving a very large number of languages, with varying and increased data sizes, to test the scalability of our approach. Our atom–language model, which considers languages as atoms and edges as chemical bonds, opens a new research direction that can potentially leverage abundant but related concepts from the physical sciences, such as chemistry and physics, into computational linguistics.

Author Contributions

Conceptualization, B.B., W.M. and S.T.; methodology, B.B. and W.M.; software, B.B.; validation, W.M. and S.T.; formal analysis, B.B. and S.T.; investigation, B.B. and W.M.; resources, B.B.; data curation, B.B. and S.T.; writing—original draft B.B.; writing—review and editing, B.B., W.M. and S.T.; visualization, B.B.; supervision, W.M. and S.T.; project administration, B.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available. Detailed information and links are provided in the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. F1 Score of NER Evaluations (Linguistic-Based Comparison)

Table A1. F1 Score, averaged over three runs, of our method and the linguistic-based baselines, across all clusters and experimental setups. Column ‘X, Y’ represents the case in which X number of samples are used to evaluate clusters that are formed using Y number of samples. ‘noun’ and ‘heads’ are the baseline values [20]. NA indicates value not available.
Table A1. F1 Score, averaged over three runs, of our method and the linguistic-based baselines, across all clusters and experimental setups. Column ‘X, Y’ represents the case in which X number of samples are used to evaluate clusters that are formed using Y number of samples. ‘noun’ and ‘heads’ are the baseline values [20]. NA indicates value not available.
Lng#trnFull,1kFull,10k1k,1k10k,10kNounHeadFull,1kFull,10k1k,1k10k,10kNounHead
LanguageNumber of Clusters = 2Number of Clusters = 3
fo0.1k93.31 (0.20)94.35 (0.64)92.82 (0.00)93.46 (0.94)87.0188.2193.57 (0.88)92.76 (0.17)92.04 (0.16)94.17 (0.16)86.61NA
ps0.1k91.55 (0.66)91.27 (0.08)90.62 (0.43)91.12 (0.68)53.9755.9290.65 (0.36)90.98 (0.24)91.22 (0.13)90.91 (0.13)55.92NA
scn0.1k86.14 (1.41)85.29 (0.39)87.66 (0.00)88.27 (0.89)76.5477.0487.05 (0.90)87.27 (0.33)88.31 (0.36)87.45 (0.36)80.08NA
ga1k93.23 (0.25)93.35 (0.08)92.94 (0.07)93.41 (0.07)85.7285.3793.42 (0.13)92.70 (0.24)92.77 (0.17)93.03 (0.17)85.72NA
is1k95.93 (0.15)95.81 (0.12)95.57 (0.00)95.74 (0.17)87.6588.0496.18 (0.13)95.84 (0.22)95.45 (0.10)95.86 (0.10)87.54NA
af5k97.43 (0.05)97.32 (0.10)96.09 (0.00)97.28 (0.05)91.1691.7097.24 (0.12)97.37 (0.05)95.86 (0.08)97.07 (0.08)91.19NA
hi5k95.48 (0.20)94.97 (0.26)92.61 (0.00)94.67 (0.06)90.0986.8994.88 (0.27)94.80 (0.48)91.58 (0.00)94.73 (0.00)86.89NA
mr5k95.83 (0.14)95.79 (0.19)93.82 (0.00)95.62 (0.09)88.3486.9695.69 (0.23)95.59 (0.27)93.58 (0.21)95.57 (0.21)86.96NA
cy10k97.17 (0.08)96.68 (0.09)94.62 (0.11)97.11 (0.18)91.5793.1596.82 (0.12)96.93 (0.07)94.68 (0.15)96.71 (0.15)91.57NA
sl15k97.71 (0.02)97.89 (0.03)95.40 (0.14)97.62 (0.07)93.7993.9797.83 (0.03)97.82 (0.06)96.08 (0.06)97.52 (0.06)93.89NA
bg20k97.32 (0.02)97.33 (0.04)94.83 (0.00)96.96 (0.02)93.4493.6097.27 (0.03)97.28 (0.03)94.64 (0.00)96.86 (0.00)93.25NA
da20k97.78 (0.02)97.65 (0.02)95.79 (0.00)97.28 (0.02)93.1093.3997.60 (0.02)97.63 (0.05)95.14 (0.00)97.22 (0.00)93.15NA
de20k96.81 (0.04)96.77 (0.01)95.04 (0.00)96.41 (0.03)88.5189.0696.65 (0.01)96.70 (0.03)94.87 (0.00)96.31 (0.00)88.59NA
el20k97.39 (0.01)97.55 (0.06)95.00 (0.17)97.12 (0.03)91.1891.4997.50 (0.01)97.50 (0.04)95.01 (0.02)97.10 (0.02)91.21NA
en20k93.44 (0.08)93.51 (0.02)91.52 (0.00)93.06 (0.02)84.1184.3793.35 (0.02)93.50 (0.07)91.34 (0.00)92.97 (0.00)84.12NA
es20k95.99 (0.07)96.04 (0.03)92.27 (0.29)95.57 (0.09)91.3891.6695.94 (0.02)95.99 (0.07)92.33 (0.16)95.45 (0.16)91.51NA
fr20k95.44 (0.06)95.67 (0.07)93.20 (0.20)95.13 (0.05)91.0191.1095.39 (0.04)95.55 (0.11)92.85 (0.05)95.11 (0.05)91.04NA
hr20k97.52 (0.03)97.49 (0.01)94.88 (0.00)97.18 (0.04)92.1292.2797.45 (0.02)97.45 (0.06)94.53 (0.13)97.05 (0.13)92.05NA
it20k96.88 (0.04)96.90 (0.05)95.36 (0.00)96.53 (0.04)92.1692.0396.88 (0.02)96.85 (0.03)95.00 (0.08)96.28 (0.08)92.22NA
nl20k97.41 (0.02)97.30 (0.03)96.03 (0.00)96.81 (0.06)92.6292.5697.19 (0.03)97.17 (0.02)95.17 (0.12)96.73 (0.12)92.59NA
no20k97.79 (0.05)97.80 (0.03)95.54 (0.00)97.40 (0.02)93.4893.4697.72 (0.04)97.74 (0.03)95.22 (0.03)97.36 (0.03)93.32NA
pl20k96.87 (0.03)96.89 (0.03)94.40 (0.00)96.41 (0.07)91.4591.3896.73 (0.03)96.80 (0.01)93.86 (0.00)96.34 (0.00)91.34NA
pt20k96.06 (0.01)95.97 (0.03)92.55 (0.39)95.22 (0.08)92.1492.0096.04 (0.01)95.97 (0.07)92.09 (0.11)95.28 (0.11)92.11NA
ro20k97.17 (0.02)97.32 (0.03)93.89 (0.29)96.91 (0.03)94.3294.4397.05 (0.04)97.30 (0.06)94.10 (0.21)96.79 (0.21)94.32NA
ru20k95.90 (0.03)95.92 (0.03)92.51 (0.00)95.56 (0.04)90.0189.8695.89 (0.03)95.86 (0.04)92.12 (0.00)95.47 (0.00)89.96NA
LanguageNumber of Clusters = 4Number of Clusters = 5
fo0.1k93.46 (0.40)94.04 (0.46)91.98 (0.00)92.96 (0.79)88.70NA94.62 (0.08)94.28 (0.47)93.59 (0.00)93.24 (0.18)86.78NA
ps0.1k91.57 (0.33)91.51 (0.89)90.46 (0.98)90.95 (0.22)55.92NA91.96 (0.77)90.69 (0.71)91.02 (0.52)90.71 (0.29)55.92NA
scn0.1k85.91 (0.52)88.49 (1.06)86.63 (0.00)87.51 (0.55)76.77NA86.21 (0.63)87.80 (0.79)87.83 (0.00)86.50 (0.77)76.77NA
ga1k92.89 (0.24)93.66 (0.17)92.20 (0.00)93.28 (0.06)85.72NA92.89 (0.03)93.50 (0.15)91.66 (0.00)93.09 (0.32)85.72NA
is1k95.80 (0.11)95.35 (0.00)95.69 (0.00)95.12 (0.00)87.92NA95.91 (0.06)95.93 (0.08)95.73 (0.00)96.03 (0.15)87.51NA
af5k97.35 (0.05)97.30 (0.10)95.70 (0.00)97.25 (0.10)91.46NA97.24 (0.02)97.20 (0.04)95.24 (0.00)97.17 (0.04)91.37NA
hi5k94.85 (0.13)94.79 (0.13)90.87 (0.00)94.66 (0.23)86.89NA95.12 (0.08)95.23 (0.06)91.51 (0.00)95.19 (0.10)86.89NA
mr5k95.61 (0.20)95.52 (0.06)93.57 (0.00)95.37 (0.11)86.96NA95.71 (0.17)95.99 (0.14)93.17 (0.25)95.78 (0.16)86.96NA
cy10k96.63 (0.24)97.11 (0.02)94.26 (0.16)96.94 (0.25)91.57NA96.76 (0.04)97.15 (0.04)94.16 (0.00)96.94 (0.10)91.57NA
sl15k97.69 (0.01)97.87 (0.03)95.74 (0.02)97.56 (0.04)93.93NA97.67 (0.04)97.76 (0.04)95.63 (0.08)97.41 (0.02)93.78NA
bg20k97.26 (0.06)97.35 (0.06)94.17 (0.00)96.79 (0.03)93.18NA97.27 (0.05)97.23 (0.03)94.17 (0.00)96.71 (0.05)93.19NA
da20k97.58 (0.05)97.62 (0.02)95.19 (0.00)97.07 (0.04)93.00NA97.51 (0.02)97.58 (0.05)94.81 (0.00)97.11 (0.03)92.92NA
de20k96.73 (0.02)96.70 (0.00)94.87 (0.00)96.16 (0.00)88.25NA96.75 (0.01)96.64 (0.05)94.87 (0.00)96.32 (0.03)88.25NA
el20k97.49 (0.05)97.53 (0.03)95.10 (0.16)97.08 (0.04)91.18NA97.42 (0.02)97.50 (0.03)94.85 (0.08)96.96 (0.05)90.07NA
en20k93.37 (0.02)93.40 (0.04)91.21 (0.00)92.96 (0.07)83.75NA93.36 (0.07)93.53 (0.04)91.21 (0.00)92.97 (0.01)83.83NA
es20k95.83 (0.03)96.01 (0.03)92.16 (0.41)95.38 (0.07)90.96NA95.71 (0.13)96.06 (0.01)91.91 (0.25)95.62 (0.07)90.96NA
fr20k95.21 (0.06)95.57 (0.04)93.05 (0.16)95.02 (0.04)90.39NA95.37 (0.05)95.51 (0.04)92.28 (0.00)94.98 (0.04)90.39NA
hr20k97.39 (0.02)97.54 (0.05)94.51 (0.00)96.97 (0.01)91.91NA97.39 (0.03)97.38 (0.02)94.32 (0.00)96.96 (0.03)91.97NA
it20k96.76 (0.02)96.83 (0.00)94.78 (0.00)96.18 (0.02)91.54NA96.87 (0.01)96.84 (0.01)94.48 (0.00)96.19 (0.03)91.54NA
nl20k97.22 (0.04)97.05 (0.00)95.39 (0.00)96.63 (0.00)92.26NA97.19 (0.02)97.11 (0.02)95.39 (0.00)96.73 (0.06)92.14NA
no20k97.66 (0.00)97.71 (0.02)94.78 (0.27)97.24 (0.04)93.31NA97.62 (0.01)97.69 (0.03)94.51 (0.00)97.24 (0.02)93.24NA
pl20k96.68 (0.06)96.79 (0.06)93.65 (0.00)96.31 (0.06)91.19NA96.72 (0.06)96.54 (0.03)93.65 (0.00)96.01 (0.09)91.18NA
pt20k95.83 (0.07)95.87 (0.01)91.87 (0.00)95.25 (0.01)91.57NA95.96 (0.03)96.08 (0.07)92.37 (0.13)95.41 (0.06)91.57NA
ro20k97.06 (0.10)97.37 (0.06)92.82 (0.00)96.81 (0.07)93.69NA96.89 (0.01)97.22 (0.03)92.34 (0.00)96.63 (0.01)93.69NA
ru20k95.79 (0.06)95.89 (0.01)92.19 (0.00)95.38 (0.09)89.97NA95.82 (0.004)95.71 (0.05)92.19 (0.00)95.25 (0.07)89.81NA

Appendix B. F1 Score of NER Evaluation (Embedding-Based Comparison)

Table A2. F1 Score, averaged over three runs, of our method and the embedding-based baselines, across all clusters and experimental setups. Column ‘X, Y’ represents the case in which X number of samples are used to evaluate clusters that are formed using Y number of samples. ‘e1k’ and ‘e10k’ represent the embedding-based baseline [19] tested with 1k and 10k samples [20]. NA indicates value not available.
Table A2. F1 Score, averaged over three runs, of our method and the embedding-based baselines, across all clusters and experimental setups. Column ‘X, Y’ represents the case in which X number of samples are used to evaluate clusters that are formed using Y number of samples. ‘e1k’ and ‘e10k’ represent the embedding-based baseline [19] tested with 1k and 10k samples [20]. NA indicates value not available.
Ln#trFull,1kFull,10k1k,1k10k,10ke1ke10kFull,1kFull,10k1k,1k10k,10ke1ke10k
LanguageNumber of Clusters = 2Number of Clusters = 3
fo0.1k93.31 (0.20)94.35 (0.64)92.82 (0.00)93.46 (0.94)87.5888.7093.57 (0.88)92.76 (0.17)92.04 (0.16)94.17 (0.16)86.3587.44
ps0.1k91.55 (0.66)91.27 (0.08)90.62 (0.43)91.12 (0.68)55.3155.9290.65 (0.36)90.98 (0.24)91.22 (0.13)90.91 (0.13)54.6853.32
scn0.1k86.14 (1.41)85.29 (0.39)87.66 (0.00)88.27 (0.89)75.5877.1287.05 (0.90)87.27 (0.33)88.31 (0.36)87.45 (0.36)75.5877.12
ga1k93.23 (0.25)93.35 (0.08)92.94 (0.07)93.41 (0.07)84.1184.3893.42 (0.13)92.70 (0.24)92.77 (0.17)93.03 (0.17)84.1184.43
is1k95.93 (0.15)95.81 (0.12)95.57 (0.00)95.74 (0.17)87.2887.6396.18 (0.13)95.84 (0.22)95.45 (0.10)95.86 (0.10)86.7587.44
af5k97.43 (0.05)97.32 (0.10)96.09 (0.00)97.28 (0.05)91.3091.7397.24 (0.12)97.37 (0.05)95.86 (0.08)97.07 (0.08)91.5190.73
hi5k95.48 (0.20)94.97 (0.26)92.61 (0.00)94.67 (0.06)89.4889.4294.88 (0.27)94.80 (0.48)91.58 (0.00)94.73 (0.00)89.1889.90
mr5k95.83 (0.14)95.79 (0.19)93.82 (0.00)95.62 (0.09)88.7188.2995.69 (0.23)95.59 (0.27)93.58 (0.21)95.57 (0.21)87.9388.58
cy10k97.17 (0.08)96.68 (0.09)94.62 (0.11)97.11 (0.18)92.2291.8896.82 (0.12)96.93 (0.07)94.68 (0.15)96.71 (0.15)91.7392.42
sl15k97.71 (0.02)97.89 (0.03)95.40 (0.14)97.62 (0.07)93.6593.9597.83 (0.03)97.82 (0.06)96.08 (0.06)97.52 (0.06)93.6593.88
bg20k97.32 (0.02)97.33 (0.04)94.83 (0.00)96.96 (0.02)93.2293.3497.27 (0.03)97.28 (0.03)94.64 (0.00)96.86 (0.00)92.6493.34
da20k97.78 (0.02)97.65 (0.02)95.79 (0.00)97.28 (0.02)92.7692.9197.60 (0.02)97.63 (0.05)95.14 (0.00)97.22 (0.00)92.5993.03
de20k96.81 (0.04)96.77 (0.01)95.04 (0.00)96.41 (0.03)88.1388.6196.65 (0.01)96.70 (0.03)94.87 (0.00)96.31 (0.00)88.1388.31
el20k97.39 (0.01)97.55 (0.06)95.00 (0.17)97.12 (0.03)90.9191.2897.50 (0.01)97.50 (0.04)95.01 (0.02)97.10 (0.02)90.4090.07
en20k93.44 (0.08)93.51 (0.02)91.52 (0.00)93.06 (0.02)84.2284.0293.35 (0.02)93.50 (0.07)91.34 (0.00)92.97 (0.00)84.2283.97
es20k95.99 (0.07)96.04 (0.03)92.27 (0.29)95.57 (0.09)91.3490.5295.94 (0.02)95.99 (0.07)92.33 (0.16)95.45 (0.16)91.3490.52
fr20k95.44 (0.06)95.67 (0.07)93.20 (0.20)95.13 (0.05)90.7490.5695.39 (0.04)95.55 (0.11)92.85 (0.05)95.11 (0.05)90.7490.53
hr20k97.52 (0.03)97.49 (0.01)94.88 (0.00)97.18 (0.04)91.8892.0797.45 (0.02)97.45 (0.06)94.53 (0.13)97.05 (0.13)91.8892.06
it20k96.88 (0.04)96.90 (0.05)95.36 (0.00)96.53 (0.04)91.9391.5296.88 (0.02)96.85 (0.03)95.00 (0.08)96.28 (0.08)91.9391.52
nl20k97.41 (0.02)97.30 (0.03)96.03 (0.00)96.81 (0.06)91.9092.2397.19 (0.03)97.17 (0.02)95.17 (0.12)96.73 (0.12)91.7492.17
no20k97.79 (0.05)97.80 (0.03)95.54 (0.00)97.40 (0.02)93.0593.3497.72 (0.04)97.74 (0.03)95.22 (0.03)97.36 (0.03)93.1493.24
pl20k96.87 (0.03)96.89 (0.03)94.40 (0.00)96.41 (0.07)91.1291.3396.73 (0.03)96.80 (0.01)93.86 (0.00)96.34 (0.00)91.1291.22
pt20k96.06 (0.01)95.97 (0.03)92.55 (0.39)95.22 (0.08)91.7991.4396.04 (0.01)95.97 (0.07)92.09 (0.11)95.28 (0.11)91.7991.43
ro20k97.17 (0.02)97.32 (0.03)93.89 (0.29)96.91 (0.03)94.0494.1797.05 (0.04)97.30 (0.06)94.10 (0.21)96.79 (0.21)94.0493.98
ru20k95.90 (0.03)95.92 (0.03)92.51 (0.00)95.56 (0.04)89.7789.9895.89 (0.03)95.86 (0.04)92.12 (0.00)95.47 (0.00)89.3290.02
LanguageNumber of Clusters = 4Number of Clusters = 5
fo0.1k93.46 (0.40)94.04 (0.46)91.98 (0.00)92.96 (0.79)87.7287.7694.62 (0.08)94.28 (0.47)93.59 (0.00)93.24 (0.18)86.78NA
ps0.1k91.57 (0.33)91.51 (0.89)90.46 (0.98)90.95 (0.22)54.6855.3791.96 (0.77)90.69 (0.71)91.02 (0.52)90.71 (0.29)52.9753.54
scn0.1k85.91 (0.52)88.49 (1.06)86.63 (0.00)87.51 (0.55)75.5877.1286.21 (0.63)87.80 (0.79)87.83 (0.00)86.50 (0.77)75.5877.12
ga1k92.89 (0.24)93.66 (0.17)92.20 (0.00)93.28 (0.06)84.1184.5392.89 (0.03)93.50 (0.15)91.66 (0.00)93.09 (0.32)84.1185.13
is1k95.80 (0.11)95.35 (0.00)95.69 (0.00)95.12 (0.00)86.5187.7795.91 (0.06)95.93 (0.08)95.73 (0.00)96.03 (0.15)96.5187.71
af5k97.35 (0.05)97.30 (0.10)95.70 (0.00)97.25 (0.10)90.7391.1897.24 (0.02)97.20 (0.04)95.24 (0.00)97.17 (0.04)90.7391.14
hi5k94.85 (0.13)94.79 (0.13)90.87 (0.00)94.66 (0.23)88.6689.7095.12 (0.08)95.23 (0.06)91.51 (0.00)95.19 (0.10)88.6688.98
mr5k95.61 (0.20)95.52 (0.06)93.57 (0.00)95.37 (0.11)87.3888.0995.71 (0.17)95.99 (0.14)93.17 (0.25)95.78 (0.16)87.3888.13
cy10k96.63 (0.24)97.11 (0.02)94.26 (0.16)96.94 (0.25)91.7391.9896.76 (0.04)97.15 (0.04)94.16 (0.00)96.94 (0.10)91.2792.64
sl15k97.69 (0.01)97.87 (0.03)95.74 (0.02)97.56 (0.04)93.6593.6197.67 (0.04)97.76 (0.04)95.63 (0.08)97.41 (0.02)93.6593.81
bg20k97.26 (0.06)97.35 (0.06)94.17 (0.00)96.79 (0.03)92.6492.4897.27 (0.05)97.23 (0.03)94.17 (0.00)96.71 (0.05)92.5892.48
da20k97.58 (0.05)97.62 (0.02)95.19 (0.00)97.07 (0.04)92.4392.7897.51 (0.02)97.58 (0.05)94.81 (0.00)97.11 (0.03)92.4392.99
de20k96.73 (0.02)96.70 (0.00)94.87 (0.00)96.16 (0.00)88.1388.3396.75 (0.01)96.64 (0.05)94.87 (0.00)96.32 (0.03)88.1388.38
el20k97.49 (0.05)97.53 (0.03)95.10 (0.16)97.08 (0.04)90.4090.0797.42 (0.02)97.50 (0.03)94.85 (0.08)96.96 (0.05)90.0790.07
en20k93.37 (0.02)93.40 (0.04)91.21 (0.00)92.96 (0.07)84.2283.8993.36 (0.07)93.53 (0.04)91.21 (0.00)92.97 (0.01)84.2283.89
es20k95.83 (0.03)96.01 (0.03)92.16 (0.41)95.38 (0.07)91.3490.5295.71 (0.13)96.06 (0.01)91.91 (0.25)95.62 (0.07)91.3490.52
fr20k95.21 (0.06)95.57 (0.04)93.05 (0.16)95.02 (0.04)90.7490.5295.37 (0.05)95.51 (0.04)92.28 (0.00)94.98 (0.04)90.7490.32
hr20k97.39 (0.02)97.54 (0.05)94.51 (0.00)96.97 (0.01)91.8892.1497.39 (0.03)97.38 (0.02)94.32 (0.00)96.96 (0.03)91.8891.91
it20k96.76 (0.02)96.83 (0.00)94.78 (0.00)96.18 (0.02)91.9391.5296.87 (0.01)96.84 (0.01)94.48 (0.00)96.19 (0.03)91.9391.52
nl20k97.22 (0.04)97.05 (0.00)95.39 (0.00)96.63 (0.00)90.8692.1497.19 (0.02)97.11 (0.02)95.39 (0.00)96.73 (0.06)90.8692.20
no20k97.66 (0.00)97.71 (0.02)94.78 (0.27)97.24 (0.04)92.7993.2797.62 (0.01)97.69 (0.03)94.51 (0.00)97.24 (0.02)92.7993.17
pl20k96.68 (0.06)96.79 (0.06)93.65 (0.00)96.31 (0.06)91.1291.2396.72 (0.06)96.54 (0.03)93.65 (0.00)96.01 (0.09)91.1291.24
pt20k95.83 (0.07)95.87 (0.01)91.87 (0.00)95.25 (0.01)91.7991.4395.96 (0.03)96.08 (0.07)92.37 (0.13)95.41 (0.06)91.7991.43
ro20k97.06 (0.10)97.37 (0.06)92.82 (0.00)96.81 (0.07)94.0494.0296.89 (0.01)97.22 (0.03)92.34 (0.00)96.63 (0.01)94.0494.06
ru20k95.79 (0.06)95.89 (0.01)92.19 (0.00)95.38 (0.09)89.1889.6695.82 (0.004)95.71 (0.05)92.19 (0.00)95.25 (0.07)89.1888.52

References

  1. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  2. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar]
  3. Joshi, M.; Chen, D.; Liu, Y.; Weld, D.S.; Zettlemoyer, L.; Levy, O. Spanbert: Improving pre-training by representing and predicting spans. Trans. Assoc. Comput. Linguist. 2020, 8, 64–77. [Google Scholar] [CrossRef]
  4. Artetxe, M.; Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
  5. Sennrich, R.; Haddow, B.; Birch, A. Edinburgh Neural Machine Translation Systems for WMT 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 371–376. [Google Scholar] [CrossRef]
  6. Patil, V.; Talukdar, P.; Sarawagi, S. Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 219–233. [Google Scholar] [CrossRef]
  7. Ri, R.; Tsuruoka, Y. Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 7302–7315. [Google Scholar] [CrossRef]
  8. Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the International Conference on Machine Learning, Online, 13–18 July 2020; PMLR: Cambridge, MA, USA, 2020; pp. 4411–4421. [Google Scholar]
  9. Pfeiffer, J.; Vulić, I.; Gurevych, I.; Ruder, S. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7654–7673. [Google Scholar] [CrossRef]
  10. Lin, Y.; Chen, C.; Lee, J.; Li, Z.; Zhang, Y.; Xia, M.; Rijhwani, S.; He, J.; Zhang, Z.; Ma, X.; et al. Choosing Transfer Languages for Cross-Lingual Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 3125–3135. [Google Scholar] [CrossRef]
  11. Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4996–5001. [Google Scholar] [CrossRef]
  12. Ponti, E.M.; Ohoran, H.; Berzak, Y.; Vulić, I.; Reichart, R.; Poibeau, T.; Shutova, E.; Korhonen, A. Modeling language variation and universals: A survey on typological linguistics for natural language processing. Comput. Linguist. 2019, 45, 559–601. [Google Scholar] [CrossRef]
  13. Dhamecha, T.I.; Murthy, R.; Bharadwaj, S.; Sankaranarayanan, K.; Bhattacharyya, P. Role of Language Relatedness in Multilingual Fine-tuning of Language Models: A Case Study in Indo-Aryan Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 8584–8595. [Google Scholar] [CrossRef]
  14. Saleh, F.; Buntine, W.; Haffari, G.; Du, L. Multilingual Neural Machine Translation: Can Linguistic Hierarchies Help? In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 1313–1330. [Google Scholar] [CrossRef]
  15. Chronopoulou, A.; Stojanovski, D.; Fraser, A. Language-Family Adapters for Low-Resource Multilingual Neural Machine Translation. In Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023); Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 59–72. [Google Scholar] [CrossRef]
  16. Ammar, W.; Mulcaire, G.; Ballesteros, M.; Dyer, C.; Smith, N.A. Many languages, one parser. Trans. Assoc. Comput. Linguist. 2016, 4, 431–444. [Google Scholar] [CrossRef]
  17. Fujinuma, Y.; Boyd-Graber, J.; Kann, K. Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1500–1512. [Google Scholar] [CrossRef]
  18. Tan, X.; Chen, J.; He, D.; Xia, Y.; Qin, T.; Liu, T.Y. Multilingual Neural Machine Translation with Language Clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 963–973. [Google Scholar] [CrossRef]
  19. Shaffer, K. Language clustering for multilingual named entity recognition. In Proceedings of the Findings of the Association for Computational Linguistics; EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 40–45. [Google Scholar]
  20. Imai, S.; Kawahara, D.; Orita, N.; Oda, H. Theoretical Linguistics Rivals Embeddings in Language Clustering for Multilingual Named Entity Recognition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop); Association for Computational Linguistic: Stroudsburg, PA, USA, 2023; pp. 139–151. [Google Scholar]
  21. Oncevay, A.; Haddow, B.; Birch, A. Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2391–2406. [Google Scholar] [CrossRef]
  22. de Vries, W.; Bartelds, M.; Nissim, M.; Wieling, M. Adapting Monolingual Models: Data Can Be Scarce When Language Similarity Is High. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 4901–4907. [Google Scholar] [CrossRef]
  23. Ge, L.; Hu, C.; Ma, G.; Zhang, H.; Liu, J. Den-ML: Multi-source cross-lingual transfer via denoising mutual learning. Inf. Process. Manag. 2024, 61, 103834. [Google Scholar] [CrossRef]
  24. Chen, L.; Zhao, Y.; Li, Q. Multi-level multilingual semantic alignment for zero-shot cross-lingual transfer learning. Neural Netw. 2024, 172, 106123. [Google Scholar]
  25. Pfeiffer, J.; Ruder, S.; Vulić, I. Adapter-Based Methods for Parameter-Efficient Transfer Learning: A Survey. Trans. Assoc. Comput. Linguist. 2024, 12, 1–28. [Google Scholar]
  26. Garcia, M.; Lopez, J.; Kim, S. Cross-Lingual Summarization Using Retrieval-Based In-Context Learning. Appl. Sci. 2025, 15, 7800. [Google Scholar]
  27. Fekete, M.R.; Bjerva, J. Gradual language model adaptation using fine-grained typology. In Proceedings of the The 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 153–158. [Google Scholar]
  28. Littell, P.; Mortensen, D.R.; Lin, K.; Kairis, K.; Turner, C.; Levin, L. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, 3–7 April 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 8–14. [Google Scholar]
  29. Malaviya, C.; Neubig, G.; Littell, P. Learning Language Representations for Typology Prediction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 2529–2535. [Google Scholar] [CrossRef]
  30. Rahimi, A.; Li, Y.; Cohn, T. Massively multilingual transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 151–164. [Google Scholar]
  31. Nakayama, H. Seqeval: A Python Framework for Sequence Labeling Evaluation. Software, GitHub. 2018. Available online: https://github.com/chakki-works/seqeval (accessed on 17 April 2025).
  32. Hagberg, A.A.; Schult, D.A.; Swart, P.J. Exploring Network Structure, Dynamics, and Function Using NetworkX. In Proceedings of the 7th Python in Science Conference, Pasadena, CA, USA, 19–24 August 2008; pp. 11–15. [Google Scholar]
  33. Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Dormann, N.; Traore, R. Stable-Baselines3: Reliable Reinforcement Learning Implementations. Software, GitHub. 2021. Available online: https://github.com/DLR-RM/stable-baselines3 (accessed on 18 July 2025).
Figure 1. Inspired by computational chemistry, particularly molecule discovery, we introduce the first generative, task-driven, and graph-based approach to model cross-lingual transfer as RL-based structural optimization to expose hidden relationships, followed by graph clustering.
Figure 1. Inspired by computational chemistry, particularly molecule discovery, we introduce the first generative, task-driven, and graph-based approach to model cross-lingual transfer as RL-based structural optimization to expose hidden relationships, followed by graph clustering.
Ai 07 00151 g001
Figure 2. Language molecule, consisting of 25 languages, constructed by our method (using 1000 samples), resembling a large or complex chemical compound, such as a polymer.
Figure 2. Language molecule, consisting of 25 languages, constructed by our method (using 1000 samples), resembling a large or complex chemical compound, such as a polymer.
Ai 07 00151 g002
Figure 3. The same language molecule as in Figure 2 but constructed using 10,000 samples.
Figure 3. The same language molecule as in Figure 2 but constructed using 10,000 samples.
Ai 07 00151 g003
Figure 4. F1 score for each language averaged across three runs (number of clusters = 2, with linguistic-based comparison).
Figure 4. F1 score for each language averaged across three runs (number of clusters = 2, with linguistic-based comparison).
Ai 07 00151 g004
Figure 5. F1 score for each language averaged across three runs (number of clusters = 3, with linguistic-based comparison).
Figure 5. F1 score for each language averaged across three runs (number of clusters = 3, with linguistic-based comparison).
Ai 07 00151 g005
Figure 6. F1 score for each language averaged across three runs (number of clusters = 4, with linguistic-based comparison).
Figure 6. F1 score for each language averaged across three runs (number of clusters = 4, with linguistic-based comparison).
Ai 07 00151 g006
Figure 7. F1 score for each language averaged across three runs (number of clusters = 5, with linguistic-based comparison).
Figure 7. F1 score for each language averaged across three runs (number of clusters = 5, with linguistic-based comparison).
Ai 07 00151 g007
Figure 8. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the dataset is categorized based on their sample size, and sorted in ascending order (number of clusters = 2).
Figure 8. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the dataset is categorized based on their sample size, and sorted in ascending order (number of clusters = 2).
Ai 07 00151 g008
Figure 9. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the datasets are categorized based on their sample size, and sorted in ascending order (number of clusters = 3).
Figure 9. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the datasets are categorized based on their sample size, and sorted in ascending order (number of clusters = 3).
Ai 07 00151 g009
Figure 10. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the datasets are categorized based on their sample size, and sorted in ascending order (number of clusters = 4).
Figure 10. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the datasets are categorized based on their sample size, and sorted in ascending order (number of clusters = 4).
Ai 07 00151 g010
Figure 11. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the datasets are categorized based on their sample size, and sorted in ascending order (number of cluster = 5).
Figure 11. Average F1 score differences with the linguistic-based baseline (baseline_noun) after the datasets are categorized based on their sample size, and sorted in ascending order (number of cluster = 5).
Ai 07 00151 g011
Figure 12. F1 score for each language averaged across three runs (number of clusters = 2, with embedding-based comparison).
Figure 12. F1 score for each language averaged across three runs (number of clusters = 2, with embedding-based comparison).
Ai 07 00151 g012
Figure 13. Average F1 score differences with the embedding-based baseline (emb_1000) after the dataset is categorized based on their sample size, and sorted in ascending order (number of clusters = 2).
Figure 13. Average F1 score differences with the embedding-based baseline (emb_1000) after the dataset is categorized based on their sample size, and sorted in ascending order (number of clusters = 2).
Ai 07 00151 g013
Figure 14. Average F1 score differences with the embedding-based baseline (emb_10,000) after the dataset is categorized based on their sample size, and sorted in ascending order (number of clusters = 2).
Figure 14. Average F1 score differences with the embedding-based baseline (emb_10,000) after the dataset is categorized based on their sample size, and sorted in ascending order (number of clusters = 2).
Ai 07 00151 g014
Figure 15. Correlation between F1 score improvement over the linguistic-based baseline and betweenness values of our graph.
Figure 15. Correlation between F1 score improvement over the linguistic-based baseline and betweenness values of our graph.
Ai 07 00151 g015
Figure 16. Correlation between F1 score improvement over the embedding-based baseline (emb_1000) and betweenness value of our graph.
Figure 16. Correlation between F1 score improvement over the embedding-based baseline (emb_1000) and betweenness value of our graph.
Ai 07 00151 g016
Figure 17. Correlation between F1 score improvement over the linguistic-based baseline (emb_10,000) and betweenness value of our graph.
Figure 17. Correlation between F1 score improvement over the linguistic-based baseline (emb_10,000) and betweenness value of our graph.
Ai 07 00151 g017
Figure 18. Cluster sensitivity to performance improvements averaged across all languages.
Figure 18. Cluster sensitivity to performance improvements averaged across all languages.
Ai 07 00151 g018
Figure 19. Performance trend across tasks (NER and POS).
Figure 19. Performance trend across tasks (NER and POS).
Ai 07 00151 g019
Figure 20. Performance trend across language families.
Figure 20. Performance trend across language families.
Ai 07 00151 g020
Figure 21. Ablation results on reward functions.
Figure 21. Ablation results on reward functions.
Ai 07 00151 g021
Table 1. Overview of graphical analysis of the language molecules.
Table 1. Overview of graphical analysis of the language molecules.
MetricGraph_1kGraph_10k
AvgEdgeWt0.4780.481
EdgeWtStd0.1240.099
Min0.0590.221
Entropy5.425.66
Clustering0.4020.560
Assortativity−0.057−0.037
Table 2. Clusters formed by our method (from the 1000-sample and 10,000-sample graphs) when the number of clusters is 2, 3, 4, and 5. The clusters for the embedding-based baseline can be found from [20]. Clusters from the linguistic-based baseline are shown here.
Table 2. Clusters formed by our method (from the 1000-sample and 10,000-sample graphs) when the number of clusters is 2, 3, 4, and 5. The clusters for the embedding-based baseline can be found from [20]. Clusters from the linguistic-based baseline are shown here.
IndexNumber of Clusters
1000 Samples (from Graph_1k)
#2345
1[‘it’, ‘scn’, ‘af’, ‘bg’, ‘nl’, ‘pl’, ‘de’, ‘ru’, ‘is’, ‘hr’, ‘en’, ‘da’, ‘mr’, ‘no’, ‘fo’, ‘hi’][‘de’, ‘en’, ‘da’, ‘bg’, ‘pl’, ‘ru’, ‘hi’][‘fr’, ‘es’, ‘no’, ‘el’, ‘sl’, ‘ps’, ‘cy’][‘es’, ‘pt’, ‘el’, ‘sl’, ‘ps’, ‘mr’]
2[‘ro’, ‘fr’, ‘es’, ‘pt’, ‘el’, ‘sl’, ‘ps’, ‘cy’, ‘ga’][‘fr’, ‘it’, ‘scn’, ‘af’, ‘is’, ‘no’, ‘fo’, ‘el’, ‘sl’, ‘hr’, ‘mr’][‘nl’, ‘de’, ‘en’, ‘bg’, ‘pl’, ‘ru’][‘nl’, ‘de’, ‘en’, ‘bg’, ‘pl’, ‘ru’]
3-[‘ro’, ‘es’, ‘pt’, ‘nl’, ‘ps’, ‘cy’, ‘ga’][‘it’, ‘scn’, ‘af’, ‘is’, ‘da’, ‘fo’, ‘hr’, ‘mr’][‘scn’, ‘af’, ‘is’, ‘da’, ‘fo’, ‘hr’, ‘hi’]
4--[‘ro’, ‘pt’, ‘hi’, ‘ga’][‘fr’, ‘it’, ‘no’, ‘cy’]
5---[‘ro’, ‘ga’]
10,000 Samples (from Graph_10k)
#2345
1[‘ro’, ‘fr’, ‘es’, ‘it’, ‘de’, ‘en’, ‘el’, ‘bg’, ‘pl’, ‘ru’, ‘sl’, ‘hr’, ‘hi’][‘pt’, ‘it’, ‘af’, ‘sl’, ‘hr’, ‘mr’, ‘ga’][‘scn’, ‘da’, ‘no’, ‘fo’, ‘ps’, ‘cy’, ‘ga’][‘bg’, ‘pl’, ‘ru’]
2[‘pt’, ‘scn’, ‘af’, ‘ps’, ‘nl’, ‘mr’, ‘is’, ‘cy’, ‘da’, ‘ga’, ‘no’, ‘fo’][‘scn’, ‘nl’, ‘is’, ‘da’, ‘no’, ‘fo’, ‘ps’, ‘cy’][‘ro’, ‘en’, ‘fr’, ‘el’, ‘es’, ‘bg’, ‘pl’, ‘sl’, ‘ru’, ‘hr’, ‘hi’][‘scn’, ‘da’, ‘ps’, ‘no’, ‘cy’, ‘fo’, ‘ga’]
3-[’ro’, ’fr’, ’es’, ’de’, ’en’, ’el’, ’bg’, ’pl’, ’ru’, ’hi’][’pt’, ’it’, ’af’, ’mr’][’ro’, ’de’, ’fr’, ’en’, ’es’, ’el’, ’pt’]
4--[’nl’, ’de’, ’is’][’af’, ’nl’, ’is’, ’sl’, ’hr’, ’mr’, ’hi’]
5---[’it’]
Baseline Linguistic Clusters [20]
1Germanic: [“af”, “nl”, “de”, “is”, “en”, “da”, “no”, “fo”]; Romance: [“ro”, “fr”, “es”, “pt”, “it”, “scn”]; Indo-Iranian: [“ps”, “mr”, “hi”]; Helenic: [“el”]; Slavic: [“bg”, “pl”, “ru”, “sl”, “hr”]; Celtic: [“cy”, “ga”]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bekuretsion, B.; Menzel, W.; Teferra, S. A Chemistry-Inspired Cross-Lingual Transfer in Multi-Lingual NLP via Graph Structural Optimization. AI 2026, 7, 151. https://doi.org/10.3390/ai7050151

AMA Style

Bekuretsion B, Menzel W, Teferra S. A Chemistry-Inspired Cross-Lingual Transfer in Multi-Lingual NLP via Graph Structural Optimization. AI. 2026; 7(5):151. https://doi.org/10.3390/ai7050151

Chicago/Turabian Style

Bekuretsion, Befekadu, Wolfgang Menzel, and Solomon Teferra. 2026. "A Chemistry-Inspired Cross-Lingual Transfer in Multi-Lingual NLP via Graph Structural Optimization" AI 7, no. 5: 151. https://doi.org/10.3390/ai7050151

APA Style

Bekuretsion, B., Menzel, W., & Teferra, S. (2026). A Chemistry-Inspired Cross-Lingual Transfer in Multi-Lingual NLP via Graph Structural Optimization. AI, 7(5), 151. https://doi.org/10.3390/ai7050151

Article Metrics

Back to TopTop