Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification

Li, Xun; Zhang, Kun

doi:10.3390/app15073454

Open AccessArticle

Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification

by

Xun Li

and

Kun Zhang

^*

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(7), 3454; https://doi.org/10.3390/app15073454

Submission received: 18 February 2025 / Revised: 18 March 2025 / Accepted: 20 March 2025 / Published: 21 March 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The cross-lingual text classification task remains a long-standing challenge that aims to train a classifier on high-resource source languages and apply it to classify texts in low-resource target languages, bridging linguistic gaps while maintaining accuracy. Most existing methods achieve exceptional performance by relying on multilingual pretrained language models to transfer knowledge across languages. However, little attention has been paid to factors beyond semantic similarity, which leads to the degradation of classification performance in the target languages. This study proposes a novel framework, a heterogeneous graph neural network with multi-view contrastive learning for cross-lingual text classification, which integrates a heterogeneous graph architecture with multi-view contrastive learning for the cross-lingual text classification task. This study constructs a heterogeneous graph to capture both syntactic and semantic knowledge by connecting document and word nodes using different types of edges, including Part-of-Speech tagging, dependency, similarity, and translation edges. A Graph Attention Network is applied to aggregate information from neighboring nodes. Furthermore, this study devises a multi-view contrastive learning strategy to enhance model performance by pulling positive examples closer together and pushing negative examples further apart. Extensive experiments show that the framework outperforms the previous state-of-the-art model, achieving improvements of 2.20% in accuracy and 1.96% in F1-score on the XGLUE and Amazon Review datasets, respectively. These findings demonstrate that the proposed model makes a positive impact on the cross-lingual text classification task overall.

Keywords:

cross-lingual text classification; contrastive learning; graph attention network; multi-view

1. Introduction

Deep learning [1], a prominent subfield of artificial intelligence, leverages neural networks to identify complex patterns in large datasets, driving significant progress across multiple fields [2,3]. In particular, it has made substantial strides in numerous Natural Language Processing (NLP) tasks, such as text classification [4], machine translation [5], and information retrieval [6]. However, a significant limitation of deep-learning models is their heavy reliance on large amounts of labeled data for optimal performance [7]. Labeled data, however, is often scarce for low-resource languages. As a result, an increasing number of recent studies are focusing on the field of cross-lingual knowledge transfer [8,9].

Cross-Lingual Text Classification (CLTC) [10,11,12] is a branch of cross-lingual knowledge transfer task, aiming to classify texts in target languages with limited or no annotated data by training a classifier on source language data. The main challenge in CLTC is overcoming the inherent semantic and syntactic disparities between languages, with the ultimate goal of enabling effective transfer knowledge from the source language to the target language. Existing methods mainly implement the semantic similarity among languages to bridge cross-lingual gaps [13], including effective multilingual pretrained language models [9,14], which leverage transformer-based neural networks trained on large-scale multilingual datasets. However, the performance of multilingual pretrained language models are often constrained by their ability to capture syntactic information and handle complex structural dependencies across languages [15]. Furthermore, their failure to explicitly model interlingual syntactic variations may ultimately compromise generalization performance in the target language.

To overcome the limitations in learning representations of textual syntactic structures, recent researchers have focused on leveraging Graph Neural Networks (GNNs). These models capture relational information inherent in graphs by aggregating and transforming information from neighboring nodes [16], thereby enabling them to model both local and global dependencies and effectively capture global long-range word dependencies and syntactic structures.

Motivated by recent breakthroughs in GNN-based CLTC and driven by the need to address existing challenges [17], Wang et al. [18] introduced a heterogeneous graph construction method to encode both source and target texts. However, this model neglects the dependency relationships between words, and the use of machine translation often introduces substantial inaccuracies, compromising the reliability of results. On this basis, this study proposes a heterogeneous GNN framework integrated with multi-view contrastive learning, named XCLHG, to effectively capture structured information by combining semantic and syntactic knowledge. In this framework, a heterogeneous graph is constructed in which words and documents serve as nodes, connected by various types of edges, including Part-of-Speech (POS) tagging, dependency, translation, and similarity edges.

Additionally, in order to improve the quality of language representations by bringing semantically equivalent samples from different languages closer together in the embedding space, this study devises a multi-view contrastive learning strategy; the two views are Translation-Level Contrastive Learning (TL-CL) and Label-Level Contrastive Learning (LL-CL), respectively.

This study presents a novel framework, XCLHG, that integrates a heterogeneous GNN with multi-view contrastive learning to address the challenges of CLTC. The main contributions are as follows:

Theoretical advances: The proposed framework introduces an innovative approach that captures both syntactic and semantic knowledge by constructing a heterogeneous graph, where documents and words are interconnected through multiple edge types. The integration of multi-view contrastive learning facilitates the alignment of cross-lingual features, thereby advancing the theoretical understanding of structured representation learning.
Practical implications: Accurate classification of text across multiple languages provides significant real-world benefits. Enhanced text analysis enabled by the proposed framework facilitates more effective evaluation of customer feedback, targeted advertising, and refined product recommendation systems. Consequently, these advancements contribute to global market research and public policy development by ensuring that diverse linguistic perspectives are integrated into decision-making processes.
Empirical performance: Extensive experiments demonstrate that the proposed approach substantially outperforms state-of-the-art methods. Notably, a 2.20% improvement in accuracy on the XGLUE dataset confirms the effectiveness of the framework in facilitating knowledge transfer from high-resource to low-resource languages.

The remainder of this paper is organized as follows. Section 2 reviews related work in CLTC. Section 3 describes the proposed methodology in detail, including the formal problem formulation, the construction of a heterogeneous graph, and the design of a multi-view contrastive learning strategy. Section 4 presents the datasets, experimental setup, and analysis of the results. Finally, Section 5 concludes the paper by discussing the limitations of the study and potential directions for future research.

2. Related Work

In recent years, the rapid advancement of deep-learning architectures in NLP has led to substantial progress in a wide range of fundamental tasks in text analysis and mining [19]. Jovanovic et al. [20] proposed an improved variant of the popular particle swarm optimization algorithm, integrating decomposition techniques to enable networks to capture and interpret underlying trends in the input data. This refined approach demonstrated outstanding performance in predicting Amazon sales. Alohali et al. [21] developed an enhanced deep-learning model that integrates metaheuristic optimization, specifically designed for intelligent systems to accurately identify and classify textual emotions. Recent innovations in text classification have not only enhanced the understanding of textual nuances but have also paved the way for rapid and transformative growth in the field of CLTC. Early approaches in CLTC are mainly divided into two groups: linear transformation methods and multilingual pre-trained methods.

2.1. Linear Transformation Methods

Prior work has explored many approaches to cross-lingual representation learning through linear transformation [22,23,24]. These methods represent semantic features across different languages by mapping words to continuous vector spaces. Subsequently, linear transformations are employed to project these language-specific representations into a shared, unified vector space [25]. Zhou et al. [22] proposed a strategy of bilingual sentiment word embeddings to bridge the semantic gap between English and Chinese. Wang et al. [23] employed structural correspondence learning to effectively exploit target-language domain knowledge, minimizing knowledge degradation during semi-supervised transfer. Kuriyozov et al. [24] proposed a method that maps languages into a common cross-lingual space through pairwise alignment. Although linear transformation methods are widely applied, they still have significant limitations. In particular, these methods struggle to address lexical gaps between languages, such as culturally specific vocabulary, and fail to effectively handle polysemous words. Furthermore, alignment at the word or sentence level tends to be insufficient, especially when handling culturally unique expressions or polysemous words. Additionally, these methods exhibit limited effectiveness in handling languages with substantial differences in word order or syntactic structures.

2.2. Pre-Trained Methods

Pre-trained models transfer knowledge during the pre-training phase and perform classification tasks during the fine-tuning phase [26]. Based on the central content of this study, pre-trained models are subdivided into three categories: transformer-based models, GNN models, and contrastive learning models. Transformer-based models offer significant advantages over traditional linear mapping methods, primarily in the following aspects [12]. They apply neural network architectures to capture the dynamic semantics of words and sentences, which enhances contextual awareness. Furthermore, they employ a shared multilingual vocabulary and joint pre-training to facilitate implicit semantic space alignment. Xia et al. [27] proposed a meta-learning-based framework to push the source and target language representations closer together. Vo et al. [28] integrated topic modeling with a graph attention network to enhance the semantic information of textual representations in multilingual context. Maurya et al. [29] categorized languages into three groups based on central languages and applied meta-learning to optimize parameter initialization. Zheng et al. [26] incorporated a code-switching technique and a syntactic-based mBERT model to capture syntactic and lexical information for cross-lingual knowledge transfer. Miah et al. [30] proposed a framework that preprocesses target language data by translating it into the base language, and classification results were determined by a majority vote from three distinct models. Typically, transformer-based methods outperform those relying on linear transformations. However, these approaches often overlook the structural information inherent in sentences, which is crucial for accurate text classification [17].

Compared to transformer-based approaches, GNNs offer distinct advantages for cross-lingual text classification by capturing the complex relational structures inherent in text [18]. GNNs excel at modeling the intricate syntactic and semantic dependencies critical to understanding multilingual data. Moreover, they naturally facilitate multi-scale representation learning by aggregating information from local neighborhoods, capturing fine-grained, word-level co-occurrences and syntactic patterns, as well as those from broader global contexts that reflect document-level semantic coherence [31]. In the field of text mining, numerous studies employing GNNs, such as TensorGCN [32] and TG Transformer [33], have consistently surpassed transformer-based methods in performance. Nonetheless, these GNN-based techniques have yet to fully capture the global semantic information embedded in text documents, nor have they provided sufficiently structured representations. Recent studies have focused on modeling long-term heterogeneous dependencies among documents to enhance text representation quality [12]. Extensive experiments in these studies have demonstrated that heterogeneous text graph modeling and representation learning, realized through various GNN architectures, can yield more robust and informative representations [18]. However, a notable limitation of GNNs is that they assign equal weight to all neighboring nodes, which does not reflect real-world scenarios. To solve this issue, a Graph Attention Network (GAT) was introduced to adjust and assign different weights to neighbors using a self-attention mechanism [34]. GATs offer inherent flexibility in capturing complex, node-dependent relationships within graph-structured data, effectively addressing the key limitations of traditional GNNs, such as the over-smoothing phenomenon [35]. With its novel attention mechanism, GATs consistently outperform conventional methods across a wide range of graph-based tasks. Building on these strengths, Wang et al. [18] introduced a heterogeneous GAT to encode both source and target texts.

The fundamental principle of contrastive learning lies in constructing sample pairs to learn more discriminative feature representations. This is achieved by maximizing the similarity between “positive” samples and minimizing the similarity between “negative” samples [36]. Specifically, anchor samples serve as the reference samples for similarity computation in contrastive learning. Positive samples are semantically similar to or belong to the same category as the anchor, and the model is trained to bring their representations as close as possible to that of the anchor. In contrast, negative samples are semantically dissimilar or belong to different categories. The model learns to push their representations farther away from the anchor. Contrastive learning has demonstrated promising capabilities in cross-lingual NLP tasks for enhancing discriminative feature learning, effectively separating differently labeled instances in the latent representation space [37]. Recent studies on text representation learning have demonstrated the remarkable flexibility and scalability of contrastive learning [36]. By integrating contrastive strategies with various neural architectures, including autoencoders, transformers, and GNNs, researchers have achieved robust performance across a wide range of domains and tasks [37]. Wang et al. [38] proposed a self-supervised heterogeneous graph neural network with co-contrastive learning, which leverages meta-path and network schema views to learn robust node embeddings. In contrast, this study diverges significantly from their approach. First, our graph construction method is fundamentally different; we design our heterogeneous graph to specifically capture the syntactic and grammatical structures of text, thereby providing a more robust representation of its intrinsic semantic and syntactic properties. Second, our contrastive learning framework employs a novel dual-view strategy comprising Translation-Level Contrastive Learning (TL-CL) and Label-Level Contrastive Learning (LL-CL). Compared with the method in [38], our dual-view strategy exhibits superior effectiveness in cross-lingual alignment, thereby offering a more precise solution for CLTC.

3. The Proposed Method

This section introduces the framework of the proposed model, Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification (XCLHG), including details on the construction of the heterogeneous graph and the design of a multi-view contrastive learning strategy for the CLTC task.

3.1. Problem Definition

Let

x_{s r c} \in D_{s r c}

and

x_{t g t} \in D_{t g t}

represent samples from the source and target languages, respectively, with

y_{s r c}

denoting the set of labels associated with the source language dataset

D_{s r c}

. The goal of CLTC task is to build a model that transfers knowledge from the source language to the target language while utilizing only the labeled data from the source language for training.

f (X; θ) \to y

(1)

where

θ

represents the model parameters, which are trained on the source language data and applied directly to unlabeled data from the target language, X refers to a text sample from either source or target language, y is the predicted label, and

f (\cdot)

represents the model. Additionally, this study assumes that the source and target languages share the same class types.

3.2. The Proposed Framework Structure

This study proposes an XCLHG framework, which consists of four key components: node embedding, heterogeneous graph construction, information propagation, and multi-view contrastive learning, as illustrated in Figure 1. Specifically, the input texts are preprocessed using a machine translation system and language-specific taggers. Subsequently, node embeddings are generated by encoding source and target language words and documents using a multilingual pre-trained language model. To capture both syntactic and semantic information from the source and target languages, this study constructs a heterogeneous text graph, where documents and words are the nodes, connected by multiple types of edges. A GAT is employed to aggregate information from neighboring nodes. After the GAT encoders integrate information from each type of edge to produce comprehensive node embeddings, a multi-view contrastive learning module, which consists of translation-level and label-level views, is applied to align these embeddings. In this step, distances between corresponding feature representations from the two views (positive samples) are minimized, while distances between non-corresponding feature representations (negative samples) are maximized. This collaborative integration mechanism ensures that the complementary features from both views are mutually refined, resulting in robust and discriminative representations for CLTC. The detailed algorithm of XCLHG is presented in Algorithm 1. Based on this framework, three loss functions are formulated as follows:

-: $L_{C E}$ encourages the GAT to output predictions similar to the ground truth, which encodes structured semantic and syntactic information of the source and target language document nodes.
-: $L_{T L - C L}$ encourages the model to minimize the distance between the original documents and their translated counterparts via contrastive learning, thereby reducing errors introduced during translation.
-: $L_{L L - C L}$ reduces the distance between samples with the same label, while increasing the distance between those with different labels.

In the following sections, the construction of a heterogeneous graph is first presented (Section 3.3), followed by an introduction to the application of a GAT for cross-lingual text classification (Section 3.4). Additionally, a detailed description of the employed multi-view contrastive learning strategies, including the formulation of the loss functions, is provided (Section 3.5).

3.3. Heterogeneous Graph Construction

To leverage both semantic and syntactic structured information, this study constructs a heterogeneous graph. Following previous work [18,28], documents and words from both source and target languages are represented as nodes, which are connected by different edge types. Specifically syntactic relationships are captured by connecting nodes via POS and dependency edges, and semantic relationships are captured by connecting nodes via translation and similarity edges. Details of the heterogeneous graph construction are shown in the left part of Figure 2. The edge types are as follows:

POS edges: POS edges connect documents with words based on their co-occurrence relationships. To capture syntactic structure and contextual relationships between documents and words, words are linked with co-occurrence documents via their POS edges, categorized as Noun, Verb, and ADJ edges.
Dependency edges: To capture syntactic structural information of source and target documents, a parsing toolkit (detailed in the experimental setup) is used to identify dependency relations between words, subsequently connecting the words using dependency edges.
Translation edges: To establish stronger connections between source and target languages, each document is translated and linked to its corresponding translation.
Similar edges: To facilitate knowledge transfer from the source to the target language, cosine similarity is employed to identify the top-K most similar documents for a specific document in the corpus, and these documents are connected using similarity edges.

Algorithm 1 XCLHG Training Algorithm

Input: anchor node

H_{S}

, translated language node

H_{T}

, non-translated node

H_{N}

, positive node

H_{p o s}

, negative node

H_{n e g}

, temperature coefficient

γ

, batch-size B, epochs e. the set of edge types

τ

, adjacency matrix A, weight matrix W, edge types attention

β

, the number of category C and the source language label y, hyper-parameter

λ

and

μ

Output: The model M

1:: Initialization parameter M
2:: for $e p o c h$ in range $(1, e + 1)$ do
3:: for i in range $(1, B + 1)$ do
4:: # Compute the node feature representations of heterogeneous GAT
5:: $h_{S} = σ (\sum_{τ_{i} \in τ} β_{τ_{i}} A_{τ_{i}} H_{S_{τ_{i}}} W_{τ_{i}})$
6:: $h_{T} = σ (\sum_{τ_{i} \in τ} β_{τ_{i}} A_{τ_{i}} H_{T_{τ_{i}}} W_{τ_{i}})$
7:: $h_{N} = σ (\sum_{τ_{i} \in τ} β_{τ_{i}} A_{τ_{i}} H_{N_{τ_{i}}} W_{τ_{i}})$
8:: $h_{p o s} = σ (\sum_{τ_{i} \in τ} β_{τ_{i}} A_{τ_{i}} H_{{p o s}_{τ_{i}}} W_{τ_{i}})$
9:: $h_{n e g} = σ (\sum_{τ_{i} \in τ} β_{τ_{i}} A_{τ_{i}} H_{{n e g}_{τ_{i}}} W_{τ_{i}})$
10:: # define heterogeneous GAT $l o s s$
11:: $L_{c e} = - \sum_{i} \sum_{v = 1}^{C} y \cdot log s o f t m a x (h)$
12:: # Compute the similarity between the source and translated sample
13:: $S i m (h_{S}, h_{T}) = \frac{h_{S} \cdot h_{T}}{| | h_{S} | | | | h_{T} | |}$
14:: # define translation-level contrastive learning $l o s s$
15:: $L_{t l - c l} = - log \frac{e x p (S i m (h_{S}, h_{T}) / τ)}{\sum e x p (S i m (h_{S}, h_{N}) / τ)}$
16:: # Compute the similarity between the anchor and positive samples
17:: $S i m (h_{S}, h_{p o s}) = \frac{h_{S} \cdot h_{p o s}}{| | h_{S} | | | | h_{p o s} | |}$
18:: # define label-level contrastive learning $l o s s$
19:: $L_{l l - c l} = - \sum_{i \in B} log \frac{1}{C_{i}} \sum_{y_{i} = y_{C}, C \neq i} \frac{e x p (s i m (h_{S}, h_{p o s} / γ)}{\sum_{S \in B, S \neq n e g} e x p (s i m (h_{S}, h_{n e g} / γ)}$
20:: # Compute the total $l o s s$
21:: $L_{X C L H G} = L_{c e} + λ L_{t l - c l} + μ L_{l l - c l}$
22:: end for
23:: # Back propagation and optimization
24:: Update model M using gradient descent to minimize $L_{X C L H G}$
25:: end for
26:: return M

3.4. Heterogeneous Graph Attention Network

For the construction of the heterogeneous graph, as shown on the right in Figure 2, this study employs a multi-layer GAT to aggregate information from nodes connected by each type of edge. Specifically, the GAT framework consists of three components: node encoding, graph structure representation, and a classifier.

First, the node embeddings are generated using a multilingual pre-trained model. These embeddings are then fused using a dual-level graph attention mechanism, which first weighs the importance of different relation types at the edge-type level, and subsequently adjusts the contribution of each neighboring node at the node level. The integration of different edge types enables the model to comprehensively leverage complementary information. This integration process effectively captures heterogeneous graph information, resulting in robust and discriminative node embeddings. Specifically, the Graph Attention Network (GAT) individually encodes each type of edge (e.g., POS, dependency, translation, similarity), producing embeddings that encapsulate the diverse semantic and syntactic perspectives of the heterogeneous graph.

Node Encoding

Let

x = (w_{1}, w_{2}, \dots, w_{n})

denotes an input document with n words.

D = {x_{1}, x_{2}, \dots, x_{m}}

represents a dataset of source and target languages, where m is the number of documents. The encoder is a multilingual pre-trained language model, such as multilingual BERT [14,39], XLM [9], XLM-RoBERTa [40], or mT5 [41], which can obtain rich contextual information. In this study, XLM-RoBERTa is specifically chosen to encode words and documents.

h_{i} = f_{X L M - R} (x_{i})

(2)

z_{i} = f_{X L M - R} (w_{i})

(3)

where

z_{i}

and

h_{i}

denote the word and document node features, respectively.

f_{X L M - R}

is a model used to encode each node. Notably, the node features remain fixed during model training.

Graph Representation Structure

After encoding all nodes with the multilingual pretrained language model, the resulting features serve as input to the GAT, enabling the model to capture correlations between nodes more effectively.

Inspired by [42], heterogeneous GATs exhibit variations in node importance, not only across different edge types but also among nodes connected by the same edge type. Therefore, this study employs a GAT with a dual-level attention mechanism, comprising type-level attention and node-level attention to aggregate neighborhood information and generate node representations.

The formulation of the type-level attention is defined as follows:

α_{τ} = \frac{e x p (L e a k y R e L U (a^{T} [W_{τ_{i}} ∥ W_{τ_{j}}]))}{\sum_{τ_{k} \in τ} e x p (L e a k y R e L U (a^{T} [W_{τ_{i}} ∥ W_{τ_{k}}]))}

(4)

where W and a are the weight matrix and weight vector, respectively,

L e a k y R e L U

is the activation function, ∥ denotes the concatenation operator, and

τ

represents the set of all neighboring nodes of different types.

α_{τ}

is the attention coefficient of node i and node j for type-level.

Node importance varies across both different node types and among nodes of the same type. Here, the node-level attention can be defined as:

β_{i j} = \frac{e x p (L e a k y R e L U (b^{T} α_{τ} [W_{z_{i}} ∥ W_{z_{j}}]))}{\sum_{z_{k} \in N_{i}} e x p (L e a k y R e L U (b^{T} [W_{z_{i}} ∥ W_{z_{k}}]))}

(5)

where

N_{i}

is the set of the neighboring nodes

z_{k}

,

β_{i j}

is the node-level attention coefficient between neighboring node i and node j, and b represents the weight vector.

Then, the representation of nodes

h = {h_{i}}_{i = 1}^{n}

can be expressed as follows:

h^{l + 1} = σ (\sum_{τ_{i} \in τ} β_{τ_{i}} A_{τ_{i}} h_{τ_{i}}^{l} W_{τ_{i}}^{l})

(6)

where

A_{τ_{i}}

is a submatrix of the aforementioned adjacency matrix of the graph structure,

τ

denotes the set of edge types,

σ

represents the activation function, and l indicates the layer in the GAT architecture. When

l = 0

,

h^{0} = f_{X L M - R} (x_{i})

, which is computed by the multilingual model. Thus, the above computations yield global graph representations, capturing both semantic and structural information.

Classifier

After obtaining the representation of node features, h, a softmax function is applied to compute the final classification.

Z = s o f t m a x (h^{l})

(7)

Setting

l = 2

indicates that the GAT consists of two layers. Information from second-order neighbors is aggregated using the two-layer GAT, followed by a linear transformation of the document node representations to produce predictions.

Then, cross-entropy loss is employed during model training.

L_{c e} = - \sum_{u \in D_{t r a i n}} \sum_{v = 1}^{C} y_{u v} \cdot log Z_{u v}

(8)

where

D_{t r a i n}

is the training dataset, C denotes the number of categories,

y_{u v}

represents the label matrix, and

Z_{u v}

represents the prediction probability. Specifically, gradient descent is employed to optimize the model.

3.5. Contrastive Learning

This study devises a multi-view contrastive learning strategy that leverages two complementary views to enhance cross-lingual alignment. As shown in Figure 3, multi-view contrastive learning, by simultaneously considering feature representations from multiple perspectives, effectively captures the rich semantic information in the data, thereby enhancing model performance and generalization capabilities. In the translation-level view, for each anchor document in the source language, its corresponding translated document is treated as positive samples. This view aims to minimize discrepancies introduced by machine translation, ensuring that the semantic representations of the original and translated texts are closely aligned. In contrast, in the label-level view, documents sharing the same class label, regardless of their languages, are considered positive pairs, while those with differing labels serve as negative pairs. This view reinforces the discriminative features among classes by pulling semantically similar documents closer in the latent space and pushing dissimilar ones apart.

By jointly optimizing these two views through their respective loss functions, the model benefits from complementary supervision signals derived from both views. The translation-level view primarily addresses cross-lingual semantic alignment, while the label-level view focuses on enhancing intra-class compactness and inter-class separability. The integration of these views results in more robust and discriminative node embeddings, thereby improving overall cross-lingual classification performance.

3.5.1. Translation-Level Contrastive Learning

In CLTC tasks, the Translation-Level Contrastive Learning (TL-CL) strategy effectively reduces translation errors by minimizing the distance between anchor samples and their translated counterparts, thereby enhancing model performance and generalization. Figure 3b shows how TL-CL applies contrastive learning within the proposed XCLHG framework. Specifically, a source, text

x_{s r c}

, serves as the anchor, and its translated counterpart,

x_{t r a n s}

, is treated as the positive sample. Subsequently, negative samples are randomly selected from the dataset, which comprises source, translated, and target documents. Using the model described above, feature representations for each example are obtained. Thus, the loss function of TL-CL can be defined as:

L_{T L - C L} = - \sum_{x_{i}, x_{t} \in D} log \frac{e x p (s i m^{+} (h (x_{i}), h (x_{t}) / γ)}{\sum_{x_{e} \in D, x_{e} \neq x_{i}} e x p (s i m^{-} (h (x_{i}), h (x_{e}) / γ)}

(9)

where

s i m

represents the similarity between examples,

h (\cdot)

represents the heterogeneous graph neural network, and + and − denote positive and negative examples, respectively.

x_{t}

represents the translated sample from

x_{i}

and

x_{e}

denotes a document in dataset D. Here, this study simply employs the cosine similarity to compute the similarity.

γ

is the temperature coefficient, which regulates the challenge of distinguishing between positive and negative examples. Intuitively, the contrastive learning loss function aims to maximize the similarity of positive examples while minimizing the similarity of negative examples.

3.5.2. Label-Level Contrastive Learning

The Label-Level Contrastive Learning (LL-CL) strategy narrows the distance of samples with the same labels, as shown in Figure 3c. Specifically, let

B = {x_{1}, x_{2}, \dots, x_{b}}

denote a collection of samples; b is the batch size.

{x_{i}, y_{i}}

represents a sample and label pairs in B. The LL-CL loss on the batch B is defined as:

P_{B} (i, c) = \frac{e x p (s i m^{+} (h (x_{i}), h (x_{c}) / γ)}{\sum_{x_{k} \in B, x_{k} \neq x_{i}} e x p (s i m^{-} (h (x_{i}), h (x_{k}) / γ)}

(10)

L_{L L - C L} = - \sum_{i \in B} log \frac{1}{C_{i}} \sum_{y_{i} = y_{c}, c \neq i} P_{B} (i, c)

(11)

where

P_{B} (i, c)

represents the likelihood that

x_{c}

is the most similar sample to

x_{i}

,

h (\cdot)

represents the heterogeneous graph neural network,

x_{c}

is a document in B that belongs to the same category as

x_{i}

,

x_{k}

is a distinct sample from

x_{i}

in the same batch, thereby serving as a negative example for comparative analysis during the contrastive learning process, and

C_{i}

represents the set of samples in B that share the same label as

x_{i}

. The LL-CL loss function is designed to maximize the similarity between examples within the same category while minimizing the similarity between examples from different categories.

The three loss functions mentioned above are combined and jointly trained in XCLHG. Specifically, the overall training loss,

L_{X C L H G}

, is calculated as a combination of the heterogeneous graph neural network loss,

L_{C E}

, the multi-view contrastive learning loss,

L_{T L - C L}

, and

L_{L L - C L}

on batch B; the coefficients

λ

and

μ

are introduced to balance the optimization objectives:

L_{X C L H G} = L_{C E} + λ L_{T L - C L} + μ L_{L L - C L}

(12)

4. Experiments and Discussion

This section demonstrates the significant performance improvements of the XCLHG framework on various multilingual datasets for the CLTC task. Subsequently, the datasets and experimental setup are detailed, followed by a comprehensive analysis of the results.

4.1. Dataset

This study conducts experiments on three datasets, including the Amazon Review dataset [43], the multilingual SLU dataset [44], and the XGLUE news classification dataset [45]. These datasets were selected based on criteria including the diversity of language pairs, the variety of application domains, and their extensive usage in previous studies, which collectively enable direct comparisons with state-of-the-art methods. Table 1 summarizes the statistical characteristics of these datasets. The CLTC task is then performed on each dataset.

Amazon Review Dataset (Amazon Reviews: https://webis.de/data/webis-cls-10.html accessed on 12 May 2023). This multilingual sentiment classification dataset comprises customer reviews in four languages (English, German, French, and Japanese) collected across three domains, Books, DVDs, and Music, thereby reflecting authentic real-world customer feedback scenarios. For the sentiment classification task, the original 5-point rating scale has been transformed into binary classification labels (positive/negative) using a threshold of 3.

Multilingual SLU Dataset (Multilingual SLU: https://nlpforthai.com/tasks/slu/ accessed on 12 May 2023). This intent classification dataset comprises twelve distinct intent types across three languages (English, Spanish, and Thai). It provides multilingual data that highlight the inherent challenges in accurately interpreting user intent within diverse linguistic contexts.

XGLUE News Classification Dataset (XGLUE News: https://microsoft.github.io/XGLUE/ accessed on 12 May 2023). This multilingual news text dataset comprises articles in five languages (English, French, Russian, German, and Spanish), with each article assigned to one of ten distinct content categories. It serves as a challenging benchmark for cross-lingual news classification by encompassing a wide array of languages and diverse topical domains.

It is worth noting that this study uses only the source language English as training data and evaluates the model on target languages, including German, Spanish, Russian, Thai, Japanese, and French.

4.2. Experiment Setting

XLM-R is fine-tuned with a learning rate of

4 \times 10^{- 5}

on each dataset and is then used to encode all nodes. A two-layer GAT with a hidden size of 512 and a batch size of 32 is employed in the experiments. For optimization, the AdamW optimizer [46] with a learning rate of

3 \times 10^{- 5}

is employed. The number of the most similar examples (K) in similarity edges is set to 3. The temperature coefficient of contrastive learning,

γ

, is set to

0.1

and

λ = 1

,

μ = 1

in Equation (10). The maximum sequence lengths for intent classification and other classification tasks are set to 128 and 512, respectively, consistent with [18]. This study implements the XCLHG model with the Python (3.9.2) programming language under the support of the well-known PyTorch (1.10.0) machine-learning framework. All experimental evaluations are executed on a GPU (Geforce RTX 3090, 24 GB).

For part-of-speech tagging, the Stanford POS Tagger [47] is used for English text; SpaCy (SpaCy: https://spacy.io accessed on 12 May 2023) for German, French, and Spanish; MeCab (MeCab: https://taku910.github.io/mecab/ accessed on 12 May 2023) for Japanese; TLTK (TLTK: https://pypi.org/project/tltk/ accessed on 12 May 2023) for Thai; and NLTK (NLTK: https://www.nltk.org/ accessed on 12 May 2023) for Russian. For identifying dependency relations, the Stanford Dependency Parser is used for English text; SpaCy for German, French, and Spanish; CaboCha (CaboCha: http://chasen.org/taku/software/cabocha/ accessed on 12 May 2023) for Japanese; and Stanza (Stanza: https://stanfordnlp.github.io/stanza/ accessed on 12 May 2023) for Thai and Russian. In data preprocessing, the Tencent Translation API (Tencent translation API: https://cloud.tencent.com/document/api/551/73920 accessed on 12 May 2023) is employed to translate both source and target language documents. The proposed XCLHG framework is compared with various baselines, including multilingual BERT, XLM, XLM-R, and mT5. Additionally, it is evaluated against state-of-the-art models such as CLHG [18], TG-CTC [28], Meta-X_NLG [29], the Ensemble model [30], and LS-mBERT [26].

4.3. Experiment Results and Analysis

The proposed model is compared with several baseline and multilingual models. Accuracy and F1-scores are adopted as the evaluation metrics to ensure a consistent and fair comparison. The experimental results of classification accuracy and F1-score on the Amazon Review dataset are shown in Table 2 and Table 3, and the results of classification accuracy and F1-score on the XGLUE and SLU datasets are shown in Table 4 and Table 5, respectively.

Table 2 presents the accuracy results for the Amazon Review dataset; the proposed model achieves excellent performance across the three language datasets, outperforming not only the baselines but also state-of-the-art models in all domains and languages. Notably, the highest accuracy of 93.22% is reached on the German book review dataset. Furthermore, all models exhibit higher accuracy on the French and German datasets compared to Japanese, suggesting that linguistic similarity contributes to improved model performance.

Table 4 presents the results on two datasets, including the tasks of news and intent classification. The model XCLHG consistently outperforms all other models listed in the table. Notably, the model achieves an accuracy of up to 97.67% on the SLU Spanish dataset. This excellent performance can be attributed to two key factors: (1) the linguistic similarity of the languages involved, as discussed earlier, and (2) the shorter text lengths in the SLU dataset, which facilitate the construction of a more stable GAT. Additionally, both CLHG [18] and TG-CTC [28] employ graph-based approaches and achieve significantly higher performance compared to models evaluated on other datasets. The most significant improvement, compared to the suboptimal model, is observed on the XGLUE dataset, where a 2.20% increase in performance is achieved for Russian. As Russian is the most dissimilar language to English in the XGLUE dataset, this demonstrates that the proposed contrastive learning strategy effectively reduces the linguistic discrepancy between languages.

Table 3 and Table 5 present the F1-score results; the predicted F1-scores closely reflect the trends observed in the accuracy results. For the Amazon Review dataset, XCLHG achieves the highest F1-scores across all language subsets. Similarly, on the SLU and XGLUE datasets, XCLHG consistently outperforms competing models, achieving notably high F1-scores, particularly on the SLU Spanish subset, where it reaches approximately 93.52%. These results indicate that XCLHG not only delivers high accuracy but also maintains a balanced precision–recall performance, thereby reinforcing its effectiveness for CLTC.

Overall, the alignment between the accuracy and F1-scores further substantiates the robustness of the approach in capturing discriminative features across diverse multilingual contexts and clearly indicates that the XCLHG model significantly outperforms all competing multilingual models. The success of the model can be attributed to two key factors: (1) The design of a heterogeneous graph enables the effective integration of structural features, enhancing the model’s ability to capture complex relationships within the data. (2) The incorporation of multi-view contrastive learning facilitates the alignment of semantic representations by bringing together documents with the same label and their translated pairs, thereby enhancing cross-lingual consistency.

4.4. Visualization

To systematically investigate the behavior of the framework, the Amazon Reviews (Book) dataset was visualized using t-SNE [48], and the resulting two-dimensional embeddings are presented in Figure 4. For the English->French task in Figure 4a, the high-dimensional feature representations produced by the multilingual pre-trained language model are subsequently mapped to a two-dimensional space via the t-SNE algorithm. The visualization clearly illustrates that the embedding representations of source and target language documents exhibit significant divergence in feature space, forming distinct clusters aligned with linguistic boundaries. This observation is consistent with the cross-lingual alignment hypothesis in representation learning. Additionally, while the multilingual pretrained language model demonstrates strong document embedding capabilities, the visualization reveals that this model still results in a considerable number of misclassifications.

To facilitate visualization, the feature points of the source and target languages in the model without the contrastive learning module (referred to as the Heterogeneous Graph (HG) model) are presented as two separate images in Figure 4b. The distributions of the source and target languages exhibit fundamental consistency, indicating effective knowledge transfer from the source language to the target language. In Figure 4c, the visualization of the model shows tightly clustered representations of positive and negative documents, respectively. A comparative analysis of subfigures (b) and (c) demonstrates that the XCLHG framework achieves enhanced cluster separation in target language document embeddings compared to the HG architecture. The t-SNE visualizations empirically validate the effectiveness of multi-view contrastive learning for cross-lingual knowledge transfer by improving latent space alignment.

4.5. Ablation Study

To evaluate the contribution of each component in the model, this study designs seven variant experiments and tests them on the Amazon Review music dataset. The results are shown in Table 6 and clearly demonstrate that removing any single component leads to a performance drop, highlighting the importance of each part in the model.

Here, Variant 1, which removes both POS tags and dependency edges from the graph structures, exhibits significant performance degradation. This indicates that POS tags and dependency edges play a crucial role in providing syntactic information, which enhancing the construction of the graph. Moreover, Variant 2, which retains POS information, demonstrates a measurable performance improvement over Variant 1, particularly when applied to language pairs with significant linguistic divergence. Variant 4 demonstrates the importance of the similarity edge in the model, since it facilitates more effective semantic representation learning. Variant 3, which removes both the translation and multi-view contrastive learning modules, achieves the lowest performance, while adding the translation module yields better results, as shown in Variant 5. However, Variant 5, which lacks multi-view contrastive learning, remains the second-worst-performing; this indicates that the contrastive learning module is the most crucial among all added components. Furthermore, the results of Variant 6 and 7, which remove translation-level and label-label contrastive learning, respectively, show a decline in performance. This further emphasizes the effectiveness of the contrastive learning modules from different views.

In conclusion, the results of the ablation study strongly demonstrate the effectiveness of the XCLHG framework. The components work synergistically to enhance the model’s performance in multilingual environments.

4.6. Case Study

In this section, a case study is presented to illustrate the capability of the model to transfer knowledge from the source to the target language. As shown in Table 7, the experimental results indicate that, by integrating a heterogeneous graph with contrastive learning, the XCLHG model effectively captures both syntactic and semantic structural information across texts in different languages.

Three distinct languages are selected for analysis. The models yield correct predictions for both German and Spanish texts; however, for longer Russian texts, the CLHG method produces prediction errors. This can be attributed to the tendency of longer texts to lose structured information, thereby weakening the dependency relationships. To address this challenge, the proposed model employs a heterogeneous graph structure to fuse syntactic and semantic information, while also incorporating a multi-view contrastive learning strategy. This strategy both rectifies translation errors and pulls together samples within the same category, thereby mitigating the loss of dependencies in long texts and enhancing the overall accuracy and robustness of the model in CLTC tasks.

5. Conclusions

This study investigated the potential of CLTC, a critical task for transferring knowledge from high-resource to low-resource languages. Cross-lingual classification is inherently challenging due to the complex syntactic and semantic variations that exist across different languages, which traditional approaches have often struggled to address effectively. To overcome these challenges, a novel framework XCLHG was developed, integrating a heterogeneous graph structure with multi-view contrastive learning. Specifically, the heterogeneous graph captures syntactic–semantic relationships across languages, while multi-view contrastive learning bridges the gap between representations of different languages by maximizing similarity scores for positive samples and minimizing similarity distances between negative samples. The experimental results revealed exceptional performance in CLTC tasks. Evaluated using both F1-score and accuracy, the XCLHG framework consistently outperformed competing approaches. Notably, on the XGLUE dataset, notably, the proposed model achieved up to a 2.20% improvement in accuracy, providing compelling evidence that integrating heterogeneous graph construction with contrastive learning markedly enhances classification precision in CLTC.

The limitation of XCLHG is that the construction of the heterogeneous graph depends on various language-specific features, such as POS tagging, dependency parsing, similarity metrics, and translation edges. The performance of these linguistic tools may vary significantly across different languages, potentially leading to errors or inconsistencies in graph construction. Future work could address these pressing challenges by fine-tuning or developing domain- and language-specific models for tasks such as POS tagging and dependency parsing, which can reduce errors in the extracted linguistic features and lead to more accurate graph construction. In addition, future work will incorporate a comprehensive time complexity analysis to provide a more complete evaluation of the model’s performance and efficiency.

Author Contributions

Methodology, X.L. and K.Z.; software, X.L.; validation, K.Z.; data curation, K.Z.; writing—original draft preparation, X.L.; writing—review and editing, X.L.; visualization, X.L.; supervision, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Zhao, T.; Wang, S.; Ouyang, C.; Chen, M.; Liu, C.; Zhang, J.; Yu, L.; Wang, F.; Xie, Y.; Li, J.; et al. Artificial intelligence for geoscience: Progress, challenges and perspectives. Innovation 2024, 5, 100691. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
Yadav, A.; Vishwakarma, D.K. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 2020, 53, 4335–4385. [Google Scholar] [CrossRef]
Cho, K.; van Merriënboer, B.; Gulçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014. [Google Scholar]
Wang, S.; Jiang, J. Learning Natural Language Inference with LSTM. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1442–1451. [Google Scholar]
Conneau, A.; Schwenk, H.; Le Cun, Y.; Barrault, L. Very deep convolutional networks for text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL, Valencia, Spain, 3–7 April 2017; pp. 1107–1116. [Google Scholar]
Liu, Z.; Winata, G.I.; Fung, P. Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation. In Findings of the Association for Computational Linguistics, ACL-IJCNLP; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2706–2718. [Google Scholar]
Conneau, A.; Lample, G. Cross-lingual language model pretraining. Adv. Neural Inf. Process. Syst. 2019, 32, 7059–7069. [Google Scholar]
Wan, X. Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 235–243. [Google Scholar]
Xu, R.; Yang, Y. Cross-lingual Distillation for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1415–1425. [Google Scholar]
Pibáň, P.; Šmíd, J.; Steinberger, J.; Mitera, A. A comparative study of cross-lingual sentiment analysis. Expert Syst. Appl. 2024, 247, 123247. [Google Scholar] [CrossRef]
Eronen, J.; Ptaszynski, M.; Masui, F. Zero-shot cross-lingual transfer language selection using linguistic similarity. Inf. Process. Manag. 2023, 60, 103250. [Google Scholar] [CrossRef]
Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. Proc. Naacl-HLT 2019, 1, 4171–4186. [Google Scholar] [CrossRef]
Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; First, O.; Johnson, M. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 13–18 July 2020; pp. 4411–4421. [Google Scholar]
Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The graph neural network model. IEEE Trans. Neural Netw. 2008, 20, 61–80. [Google Scholar] [CrossRef] [PubMed]
Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7370–7377. [Google Scholar]
Wang, Z.; Liu, X.; Yang, P.; Liu, S.; Wang, Z. Cross-lingual Text Classification with Heterogeneous Graph Neural Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Virtual, 1–6 August 2021; pp. 612–620. [Google Scholar]
Fields, J.; Chovanec, K.; Madiraju, P. A survey of text classification with transformers: How wide? how large? how long? how accurate? how expensive? how safe? IEEE Access 2024, 12, 6518–6531. [Google Scholar] [CrossRef]
Jovanovic, A.; Bacanin, N.; Jovanovic, L.; Damasevicius, R.; Antonijevic, M.; Zivkovic, M.; Dobrojevic, M. Performance evaluation of metaheuristics-tuned recurrent networks with VMD decomposition for Amazon sales prediction. Int. J. Data Sci. Anal. 2024, 1–19. [Google Scholar] [CrossRef]
Alohali, M.A.; Alasmari, N.; Almalki, N.S.; Khalid, M.; Alnfiai, M.M.; Assiri, M.; Abdelbagi, S. Textual emotion analysis using improved metaheuristics with deep learning model for intelligent systems. Trans. Emerg. Telecommun. Technol. 2024, 35, e4846. [Google Scholar] [CrossRef]
Zhou, H.; Chen, L.; Shi, F.; Huang, D. Learning bilingual sentiment word embeddings for cross-language sentiment classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 26–31 July 2015; pp. 430–440. [Google Scholar]
Wang, D.; Wu, J.; Yang, J.; Jing, B.; Zhang, W.; He, X.; Zhang, H. Cross-lingual knowledge transferring by structural correspondence and space transfer. IEEE Trans. Cybern. 2021, 52, 6555–6566. [Google Scholar] [CrossRef] [PubMed]
Kuriyozov, E.; Doval, Y.; Gómez-Rodríguez, C. Cross-Lingual Word Embeddings for Turkic Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4054–4062. [Google Scholar]
Abdalla, M.; Hirst, G. Cross-Lingual Sentiment Analysis Without (Good) Translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, 7 July 2017; pp. 506–515. [Google Scholar]
Zheng, J.; Fan, F.; Li, J. Incorporating Lexical and Syntactic Knowledge for Unsupervised Cross-Lingual Transfer. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), Turin, Italy, 20–25 May 2024; pp. 8986–8997. [Google Scholar]
Xia, M.; Zheng, G.; Mukherjee, S.; Shokouhi, M.; Neubig, G.; Hassan, A. MetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 499–511. [Google Scholar]
Vo, T. An integrated topic modelling and graph neural network for improving cross-lingual text classification. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 22, 1–18. [Google Scholar] [CrossRef]
Maurya, K.; Desarkar, M. Meta-XNLG: A Meta-Learning Approach Based on Language Clustering for Zero-Shot Cross-Lingual Transfer and Generation. In Findings of the Association for Computational Linguistics; ACL: Stroudsburg, PA, USA, 2022; pp. 269–284. [Google Scholar]
Miah, M.S.U.; Kabir, M.M.; Sarwar, T.B.; Safran, M.; Alfarhood, S.; Mridha, M.F. A multimodal approach to cross-lingual sentiment analysis with ensemble of transformer and LLM. Sci. Rep. 2024, 14, 9603. [Google Scholar] [CrossRef] [PubMed]
Wu, L.; Chen, Y.; Ji, H.; Li, Y. Deep Learning on Graphs for Natural Language Processing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials, Online, 6–11 June 2021; pp. 11–14. [Google Scholar]
Liu, X.; You, X.; Zhang, X.; Wu, J.; Lv, P. Tensor graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8409–8416. [Google Scholar]
Zhang, H.; Zhang, J. Text graph transformer for document classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
Hu, H.; Wang, X.; Zhang, Y.; Chen, Q.; Guan, Q. A comprehensive survey on contrastive learning. Neurocomputing 2024, 610, 128645. [Google Scholar] [CrossRef]
Wang, X.; Liu, N.; Han, H.; Shi, C. Self-supervised heterogeneous graph neural network with co-contrastive learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual, 14–18 August 2021; pp. 1726–1736. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4996–5001. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F. Unsupervised Cross-Lingual Representation Learning at Scale; ACL: Stroudsburg, PA, USA, 2020. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 483–498. [Google Scholar]
Linmei, H.; Yang, T.; Shi, C.; Ji, H.; Li, X. Heterogeneous Graph Attention Networks for Semi-supervised Short Text Classification. In Proceedings of the Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Prettenhofer, P.; Stein, B.B. Cross-Language Text Classification Using Structural Correspondence Learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010. [Google Scholar]
Schuster, S.; Gupta, S.; Shah, R.; Lewis, M. Cross-Lingual Transfer Learning for Multilingual Task Oriented Dialog. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Liang, Y.; Duan, N.; Gong, Y.; Wu, N.; Guo, F.; Qi, W.; Gong, M.; Shou, L.; Jiang, D.; Cao, G.; et al. Xglue: A new benchmark dataset for cross-lingual pre-training understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2019, arXiv:1711.05101. [Google Scholar]
Manning, C.D.; Surdeanu, M.; Bauer, J.; Finkel, J.R.; Bethard, S.; McClosky, D. The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Architecture of the proposed model, XCLHG. For clarity, the diagram omits the visualization of data preprocessing steps.

Figure 2. This figure illustrates the architecture of the proposed heterogeneous graph. For simplicity, only Noun, Verb, and ADJ edges are plotted for POS edges.

Figure 3. Illustration of multi-view contrastive learning. Distinct shapes represent different stance labels; blue and yellow indicate source and target languages, respectively. (a) Document feature representations. (b) Example of training for Translation-Level Contrastive Learning (TL-CL). (c) Example of training for Label-Level Contrastive Learning (LL-CL). (d) Result of contrastive learning and Classification Alignment (CA).

Figure 4. Visualizations of the document features; t-SNE is applied to reduce the dimensionality of features. (a) is the feature representation in the datasets. In (b,c), the subgraph on the left is source language features, and the one on the right is target language features.

Table 1. The datasets used to evaluate the proposed model.

Dataset	Category	Languages	Train	Test
Amazon Review (Books)	2	English	2000	2000
		German	2000	2000
		French	2000	2000
		Japanese	2000	2000
Amazon Review (Dvd)	2	English	2000	2000
		German	2000	2000
		French	2000	2000
		Japanese	2000	2000
Amazon Review (Music)	2	English	2000	2000
		German	2000	2000
		French	2000	2000
		Japanese	2000	2000
XGLUE	10	English	10,000	10,000
		German	10,000	10,000
		French	10,000	10,000
		Spanish	10,000	10,000
		Russian	10,000	10,000
Multilingual SLU	12	English	30,521	8621
		Spanish	3617	3043
		Thai	2156	1692

Table 2. Classification accuracy (%) on Amazon Review dataset.

	German			French			Japanese
Model	Books	DVD	Music	Books	DVD	Music	Books	DVD	Music
mBERT	84.35	82.85	83.85	84.55	85.85	83.65	73.35	74.80	76.10
XLM	86.85	84.20	85.90	88.10	86.95	86.20	80.95	79.20	78.02
XLM-R	91.65	87.60	90.97	89.33	90.07	89.15	85.26	86.77	86.95
mT5	91.83	88.82	91.98	91.02	91.42	90.36	87.41	88.25	87.82
CLHG	92.70	88.60	91.62	90.67	91.38	90.45	87.21	87.33	88.08
TGCTC	92.63	87.91	92.54	90.03	91.35	88.68	86.50	88.27	89.07
Meta-X_NLG	90.34	88.37	89.64	90.33	90.38	90.26	87.25	88.91	87.79
EM	92.73	90.06	91.83	91.55	92.03	91.47	88.90	89.06	89.14
LS-mBERT	92.43	90.57	91.85	92.30	90.57	92.52	87.62	89.41	88.37
XCLHG	93.22	91.31	92.91	92.98	92.64	93.13	89.16	90.31	89.73

Table 3. F1-score (%) for text classification on Amazon Review dataset.

	German			French			Japanese
Model	Books	DVD	Music	Books	DVD	Music	Books	DVD	Music
mBERT	82.71	81.26	82.11	82.97	84.35	82.10	71.83	73.26	74.39
XLM	85.14	82.56	84.48	86.62	85.41	84.73	79.24	77.20	76.52
XLM-R	90.33	86.12	89.38	86.36	87.91	86.28	83.13	84.77	85.34
mT5	90.72	86.90	89.98	89.14	89.74	88.02	85.71	86.53	85.90
CLHG	90.70	87.15	90.04	88.72	89.83	88.17	85.86	86.11	86.36
Meta-X_NLG	88.59	86.75	80.49	88.23	88.12	88.67	85.42	86.73	85.80
EM	90.03	88.14	89.47	89.53	90.11	89.24	85.37	86.16	86.28
LS-mBERT	90.17	88.95	90.17	90.58	90.36	89.74	86.22	87.13	86.97
XCLHG	91.18	89.94	91.17	91.22	90.93	91.28	88.16	87.91	88.23

Table 4. Classification accuracy (%) on SLU and XGLUE datasets.

	XGLUE				SLU
Model	German	French	Spanish	Russian	Spanish	Thai
mBERT	82.53	78.46	81.83	79.06	74.91	72.97
XLM	83.52	78.83	82.49	79.47	62.30	81.60
XLM-R	84.61	78.96	83.51	79.94	94.38	85.17
mT-5	85.03	79.62	84.53	81.21	95.68	87.32
CLHG	85.00	79.58	84.80	79.91	96.81	89.71
Meta-X_NLG	83.41	79.04	84.77	78.02	87.12	84.33
EM	84.52	79.31	85.13	79.70	85.64	79.29
LS-mBERT	85.78	80.24	85.61	80.14	82.07	63.43
TGCTC	85.36	80.45	85.65	80.32	97.23	90.85
XCLHG	86.43	82.43	87.31	82.52	97.67	92.86

Table 5. F1-score (%) for text classification on SLU and XGLUE datasets.

	XGLUE				SLU
Model	German	French	Spanish	Russian	Spanish	Thai
mBERT	80.17	76.52	79.98	76.81	72.40	70.22
XLM	81.29	76.17	80.23	76.91	60.24	89.93
XLM-R	82.93	76.77	81.75	77.96	88.57	81.49
mT-5	83.14	77.82	82.13	78.42	92.29	84.90
CLHG	83.21	77.64	82.56	77.98	92.56	85.13
Meta-X_NLG	81.28	77.22	81.58	75.40	83.24	81.77
LS-mBERT	83.54	78.19	83.20	77.72	79.93	60.72
XCLHG	84.82	79.17	85.16	79.14	93.52	90.73

Table 6. Ablation study on Amazon Review dataset.

	XLM-R	1	2	3	4	5	6	7	Full Model
POS			✓	✓	✓	✓	✓	✓
dependency				✓	✓	✓	✓	✓
translation		✓	✓		✓	✓	✓	✓
similarity		✓	✓	✓		✓	✓	✓
TL-CL		✓	✓		✓		✓
LL-CL		✓	✓		✓			✓
DE	90.97	91.93	92.37	91.22	92.34	91.95	92.58	92.39	92.91
FR	89.15	92.05	92.11	91.70	92.57	92.46	92.99	92.70	93.13
JA	86.95	88.12	88.92	87.42	89.02	88.78	89.63	89.24	89.93

DE represents German, FR represents French and JA represents Japanese.

Table 7. Case study of proposed model.

Language	Dataset	Sample	Results
		Ein wunderbares Buch. Hab es im Urlaub an einem Tag durchgelesen und bin begeistert. Genau die richtige Mischung aus Witz, Phantasie, Trauer und Nachdenklichkeit. Freue mich schon sehr auf das nächste Buch von Adena Halpern!	label: positive
DE	Amazon Review		CLHG: ✓
		English translation: A wonderful book. I read it on vacation in one day and I’m thrilled. Just the right mix of wit, imagination, grief and thoughtfulness. Looking forward to the next book by Adena Halpern!	XCLHG: ✓
		В этoм гoду за приз в размере oкoлo 66 тысяч дoлларoв пoбoрются шесть автoрoв рoманoв из Великoбритании, США и Канады. Председатель жюри премии уже заявил,чтo все книги, пoпавшие в кoрoткий списoк, - этo чудеса стилистическoй изoбретательнoсти, и выбрать пoбедителя будет непрoстo. Крoме денежнoгo приза, лауреат Букерoвскoй премии автoматически пoлучает приятный бoнус - резкий рoст прoдаж свoегo прoизведения пo всему миру, передает “Рoссия 24”.	label: culture
RU	XGLUE		CLHG: ✕
		English translation: This year, six authors of novels from the United Kingdom, the United States and Canada will compete for a prize of about $66,000. The chairman of the jury of the award has already stated that all the books on the short list are miracles of stylistic ingenuity, and it will not be easy to choose the winner. In addition to the cash prize, the winner of the Booker Prize automatically receives a pleasant bonus-a sharp increase in sales of his work around the world, reports “Russia 24”.	XCLHG: ✓
		cambia el dia para la alarma de la mañana	label: modify alarm
ES	SLU		CLHG: ✓
		English translation: Change the day for the morning alarm	XCLHG: ✓

DE: German, RU: Russian and ES: Spanish, CLHG proposed by Wang et al. [18]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Zhang, K. Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification. Appl. Sci. 2025, 15, 3454. https://doi.org/10.3390/app15073454

AMA Style

Li X, Zhang K. Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification. Applied Sciences. 2025; 15(7):3454. https://doi.org/10.3390/app15073454

Chicago/Turabian Style

Li, Xun, and Kun Zhang. 2025. "Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification" Applied Sciences 15, no. 7: 3454. https://doi.org/10.3390/app15073454

APA Style

Li, X., & Zhang, K. (2025). Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification. Applied Sciences, 15(7), 3454. https://doi.org/10.3390/app15073454

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Heterogeneous Graph Neural Network with Multi-View Contrastive Learning for Cross-Lingual Text Classification

Abstract

1. Introduction

2. Related Work

2.1. Linear Transformation Methods

2.2. Pre-Trained Methods

3. The Proposed Method

3.1. Problem Definition

3.2. The Proposed Framework Structure

3.3. Heterogeneous Graph Construction

3.4. Heterogeneous Graph Attention Network

Node Encoding

Graph Representation Structure

Classifier

3.5. Contrastive Learning

3.5.1. Translation-Level Contrastive Learning

3.5.2. Label-Level Contrastive Learning

4. Experiments and Discussion

4.1. Dataset

4.2. Experiment Setting

4.3. Experiment Results and Analysis

4.4. Visualization

4.5. Ablation Study

4.6. Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI