1. Introduction
The rise in social networks has changed the way people acquire information, leading to a surge in online users [
1]. For example, X (formerly known as Twitter) is one of the world’s largest social media platforms. Users can share short text messages in short posts commonly known as “tweets” (officially “posts”) and repost or comment other users’ content. Similarly, Sina Weibo is one of the most popular microblog platforms in China, similar to X but with some localized features. Users on such social platforms publish a large amount of text to express emotions, intentions, and opinions, and their interaction behaviors, such as reposting and commenting, create complex graph structures.
Text plays a crucial role in information propagation within the graph, while the graph structure provides abundant context for interpreting the text’s meaning. Therefore, analyzing text based on the graph structure can provide valuable information for social network tasks, such as user classification [
2], personalized recommendations [
3], and community detection [
4].
Due to the rich textual information and complex graph structures present on social networks, the recent research topic in this field focuses on combining natural language processing (NLP) techniques and network science [
5,
6,
7]. Contemporary researchers tend to represent these kinds of data on such social networks by text-attributed graphs (TAGs). A TAG is a graph where nodes or edges are associated with text, commonly found in domains like citation networks [
8,
9], web page hyperlink networks [
10], and social networks [
5]. Unlike traditional NLP methods that transform text attributes into shallow or hand-crafted features, such as bag-of-words [
11] or skip-gram [
12], the core of TAG representation learning lies in the integration of graph structure and textual information. Given that graph neural networks (GNNs) excel at capturing graph structures and large language models (LLMs) perform well in various natural language processing tasks, most existing TAG-based methods focus on frameworks that convey graph structures to LLMs in order to generate semantic embeddings for GNNs. For example, SimTeG [
13] fine-tunes an LLM to generate semantic embeddings for a GNN through a consistent downstream task loss, while LMaaS [
14] utilizes a pre-trained language model (PLM) as an interpreter which transforms the explainable texts generated by the LLM into embedding vectors for the GNN. DGTL [
15] inputs the text information encoded by an LLM into a disentangled GNN to capture the graph structural information, then feeds the learned representation vectors into the LLM predictor.
However, these methods only provide text attributes for nodes, which fail to capture the multilevel context on social networks and lead to the loss of valuable information from the outset. The neglected context is crucial in the analysis of users’ posts. First, online accounts typically include supplementary personal descriptions, such as occupation, hobbies, and education, which are essential for understanding their posts. Meanwhile, interactions such as reposts and comments provide important local context to clarify the meaning of individual texts. Furthermore, users’ current topics offer significant global context, as the same expression may have different meanings in different topics. As shown in
Figure 1, when a user describes something as “unconventional” on the social network, it could have opposite meanings depending on the context. If it is used to describe a piece of art, it could express admiration. If used to describe food, it is more likely to be a subtle criticism. The different levels of context mentioned above, i.e., the personal context, the local context, and the global context, can provide important information to clarify the meaning. Therefore, it is essential to capture and utilize the multilevel context for semantic embeddings of users’ posts.
To capture the complex multilevel context and generate semantic embeddings for downstream tasks, the following challenges must be addressed: (1) How to model both multilevel context and graph structures in TAGs.
TAGs that only provide text attributes for nodes fail to capture multilevel contexts, which in turn affects the quality of the semantic embeddings. Firstly, providing only text attributes for user nodes fails to differentiate between personal descriptions and posts. Additionally, interactions such as reposts and comments between users form text pairs, but classical TAGs fail to capture the semantic relationships based on these text pairs. For example, when two users engage in interactions based on different text pairs, classical TAGs struggle to distinguish between them. Therefore, it is essential to design TAGs that capture more detailed information, enabling the simultaneous modeling of multilevel context and graph structure. (2) How to leverage the multilevel context to generate semantic embeddings for downstream tasks. Enabling LLMs to leverage multilevel context is also challenging because the multilevel context on social networks is distinguished not only in terms of granularity but also in form. The personal context is reflected at the node level and may vary significantly in form across different users. The local context is represented through the edges, and the complex semantic relationships are difficult to describe using natural language. Additionally, the global context is embodied throughout the overall graph and does not have predefined descriptions. Therefore, directly applying the multilevel context as prompts for LLMs is not practical. This highlights the importance of developing a more effective approach to utilize the multilevel context based on the designed TAG.
To solve these challenges, we propose the
Multilevel
Context
Learner (MCL) model, which can leverage multilevel context on social networks to enhance the semantic embedding capabilities of LLMs for downstream tasks. First, we model the social network as a multilevel context textual-edge graph (MC-TEG), with personal descriptions as node attributes and interaction texts as edge attributes. Node attributes capture personal context, while edge attributes provide the basis for utilizing local context. Second, the proposed MCL leverages LLMs’ reasoning abilities to infer and update global context from posted text. Then, it combines the local context through relevant edges with personal and global context to generate semantic embeddings for each node. Moreover, a group of tailored bidirectional dynamic graph attention network (GAT) layers [
16] are developed to further distinguish the weight information on social networks. Two types of attention are trained separately to collectively represent the relationships among nodes. To demonstrate the effectiveness of our proposed model, we evaluated it on the fundamental graph representation learning task: node classification. Extensive experiments on real social network datasets demonstrate the effectiveness of MCL.
Our main contributions are summarized as follows:
We model the social network as a multilevel context textual-edge graph (MC-TEG). Personal descriptions are regarded as node attributes, while interaction texts are treated as edge attributes, effectively capturing both graph structure and semantic relationships.
We propose the Multilevel Context Learner (MCL) model, which utilizes LLMs’ reasoning abilities to leverage multilevel context for generating semantic embeddings. The proposed bidirectional dynamic graph attention layers further distinguish the weight information.
Experimental evaluations on six social network datasets demonstrate the effectiveness of the proposed MCL model, which consistently outperforms all baseline methods across all datasets.
3. Problem Formulation and Preliminaries
In this section, we introduce notation and formalize some concepts related to textual-edge graphs, graph neural networks, and large language models.
3.1. Textual-Edge Graphs
A textual-edge graph can be formulated as , where denotes the set of nodes, denotes the set of edges, and y denotes the the labels of nodes. In a TEG, each node is associated with a text description, and each edge also contains its text description, which is absent in traditional TAGs. These textual descriptions provide rich contextual information about the complex relationships between nodes, enabling a more detailed and comprehensive representation of data relations than traditional TAGs.
In this paper, we focus on node classification, one of the most typical tasks on graphs. We adopt the semi-supervised settings, where all the text information and the adjacency matrix are given during the training procedure, while only a part of the node labels are provided, where is the training node set. The task aims at predicting the labels of , where is the set of test nodes.
3.2. Large Language Models
LLMs have introduced a new paradigm for task adaptation known as “pre-train, prompt, and predict”. In this paradigm, the LLM is first pre-trained on a large corpus of text data to learn general language representations. Instead of fine-tuning the model, a natural language prompt that defines the task and context is then provided to the model. The prompt can be presented in various formats, ranging from a concise sentence to a more extensive passage, and may incorporate supplementary details or constraints to direct the model’s behavior accordingly. Based on the prompt and input tokens, the model generates the output directly. Formally, for the sequence of input tokens
and the prompt
, we can concatenate them into a new sequence
. Then, the probability of the output sequence
given
is
where
represents the prefix of sequence
s up to position
, and
represents the probability of generating token
given
and
.
3.3. Graph Neural Networks
Graph neural networks are a class of deep learning models specifically designed to handle graph-structured data [
32]. GNNs extend the capabilities of traditional neural networks, enabling direct operation on graph structures, thereby capturing complex relationships and dependencies between nodes. GNNs typically follow a message-passing scheme where nodes aggregate information from their neighbors in each layer, formulated as
where
is the representation vector of node
i at the
l-th layer,
is the neighbors of node
i,
Agg(·) is the aggregation function, and
Update(·) is an updating function that typically includes linear functions and activation functions. For node classification, the output
ŷ of GNNs is a normalized vector, where the dimension corresponds to the number of node categories and the values represent the probabilities of the node belonging to the corresponding category.
4. Method
In this section, we describe our proposed MCL model for node classification on social networks. An overall framework of our method is shown in
Figure 2; it involves three main steps: (1) constructing an MC-TEG based on social networks; (2) utilizing LLMs to extract multilevel contexts and generate embeddings; and (3) training a bidirectional dynamic graph attention for prediction.
4.1. MC-TEG Construction Based on Social Networks
The first step in our method was to construct an MC-TEG that captures both graph structure and semantic relationships based on social network data. We collected textual data from real social networks based on tags and keywords and obtained personal descriptions of corresponding accounts. The dataset may include considerable noise due to users attaching irrelevant tags to their posts in an effort to increase visibility and engagement. Additionally, some keywords may appear in discussions across multiple topics and could carry different meanings. Given the scale of the dataset, manually filtering out this noise is not feasible, which highlights the necessity of using LLMs to leverage multilevel context to obtain accurate semantic embeddings.
We treat users as nodes in the MC-TEG, with personal descriptions serving as the text attributes for node i. The interaction texts between node i and node j are represented as pairs of original and repost or comment texts, serving as the corresponding edge attributes . Notably, and are distinct, as the edges are directed. Moreover, if there are multiple interactions between users, the corresponding will include multiple text pairs.
As shown in
Figure 3, our MC-TEG provides textual attributes for both nodes and edges. The user descriptions serve as the textual attributes of the nodes, while the textual pairs representing interactions between users act as the textual attributes of the edges. The textual pairs on the edges not only effectively handle multiple interactions, but also distinguish the directionality of these interactions. Additionally, this structure facilitates the utilization of personal contexts through textual attributes on nodes and enhances the updating and utilization of global contexts.
In summary, the constructed MC-TEG preserves the graph structure while providing a foundation for LLMs to leverage multilevel context.
4.2. LLM-Based Multilevel Context Extraction and Embedding Generation
In this subsection, we utilize the powerful reasoning and comprehension abilities of LLMs to leverage personal context, local context, and global context, and then generate semantic embeddings based on the multilevel context for downstream GNNs.
4.2.1. Personal Context
The challenge in extracting user context lies in the fact that personal descriptions are unstructured. There are differences in style and content across personal descriptions of different users, making it difficult to unify them. This issue can be effectively addressed by designing specific prompts for the LLM. We design a fixed template and set key information as tokens that the LLM needs to predict based on personal descriptions. At the same time, we determined a set of default values in the prompt to handle cases where some personal descriptions may lack the key information. This structured output standardizes the format of the personal context for each node, facilitating subsequent utilization. The personal context of node
i is
where
is structured personal context of node
i, and
is the LLM prediction function for personal context.
4.2.2. Local Context
We treat the 1-hop interaction texts among users as the local context. This is reasonable because it reflects the most direct semantic relationships, and increasing the number of hops would lead to an exponential increase in complexity. In the designed MC-TEG, interaction texts are already treated as edge attributes, which effectively reflect the local context. At the same time, the directionality of the edges also introduces distinctions in the local context. We denote and as the local contexts when node i acts as the target node and the source node, respectively, which is reflected in the differing positions of node i’s text within the text pairs or .
4.2.3. Global Context
Online users often engage in discussions based on specific topics, which provides valuable global context. However, these topics are not explicitly predefined and are accompanied by significant noise. We fully utilize the powerful reasoning capabilities of LLMs by constructing prompts from the text pairs on the edges to infer the topics within the graph. Since the input consists of text pairs, some important texts may be repeatedly included. LLMs can reduce the impact of noise in the topic-update process by effectively leveraging patterns identified in repeated texts. The global context is inferred as follows:
where
is the inferred global context,
is the LLM-specific function for the global context, ⋃ represents the iterating progress, and
is the text pair associated with edge
.
4.2.4. Semantic Embedding
To fully leverage the LLM’s capability to understand and model complex patterns and semantics in MC-TEG, we injected the associated multilevel context into the LLM, which includes
(personal context),
and
(local context), and
(global context). Each context level provides valuable information that helps the LLM form a comprehensive understanding of the node’s semantics within the MC-TEG. Specifically, we reserved a set of token positions for placing the multilevel context in the prompt input. Even if the different levels of context are not unified in form, the form of our input will still allow the LLM to think that they are aligned with the natural semantic space that humans can understand, as shown in
Figure 4. Through this approach, we enabled the LLM to benefit from a comprehensive understanding of both the graph structure and textual information, generating semantic embeddings for downstream GNNs tasks. These semantic embeddings facilitate a direct gradient flow to the GNNs, resulting in more accurate and informative gradient updates. This fusion of language modeling and graph representation learning enables our MCL model to leverage the multilevel context captured by the LLM alongside the structural patterns learned by the GNNs, driving effective learning and enhanced performance.
4.3. Prediction by Bidirectional Dynamic Graph Attention Layers
As a classic GNN, the graph convolution mechanism uses a uniform weight matrix to aggregate features, which is unsuitable for social networks because different neighbors have distinctly varying impacts on different users. The attention mechanism can effectively address this issue, which is initially employed in computer vision [
33] and then in NLP [
22]. The attention mechanism has subsequently been proven to be competitive in graph analysis and has led to the popularity of graph attention networks [
34]. An attention score is denoted as
, which indicates the importance from the neighbor
j to the node
i. The unnormalized attention score for edge
in layer
m is computed as follows:
where
and
are learned in the training process, and ∥ represents vector concatenation. After computing all
, a softmax layer is used to normalize them and obtain the attention score
:
Then, the attention weighted average in
is used to update the representation of node
i in layer
:
where
is a nonlinear function. Although this attention mechanism can distinguish the weight matrix, it still has a limitation: nodes are ranked relatively equally for nodes, only differing in absolute values. This significant limitation diverges from the nature of social networks because the importance ranking of different users varies. To address this limitation, we use the dynamic attention [
16], where the order of operations in the scoring function is modified as follows:
where the simple modification makes a significant difference in the expressiveness of the attention function.
However, simply using dynamic attention is still not adequate to simulate the characteristics of social networks. In Equation (
10), the attention scores are normalized among all neighbors of the node without distinguishing between out-degree and in-degree. But on social networks, retweeting and being retweeted not only represent the direction of edges but also reflect the active and passive nature of user behavior, thus requiring attention to be distinguished accordingly. Therefore, we propose bidirectional dynamic graph attention layers. The out-degree attention and in-degree attention are trained with distinctions during the normalization process as follows:
where
and
represent the out-degree neighbors and in-degree neighbors, respectively.
,
,
, and
denote different training parameters while
is shared between the out-degree layer and in-degree layer. The representation vector will be updated to the weighted sum of two types of attention as follows:
where
is the hyperparameter.
is shared in the next layer and the node representation is iteratively updated according to Equations (
10)–(
14). Due to the presence of multiple categories, we use the cross-entropy loss function:
where
is the training node set,
denotes the size of this space,
K is the number of categories,
is the true label vector element for node
v and category
j, and
is the predicted probability of the node
v belonging to the category
j. To prevent overfitting, we add a regularization term. The final loss function is as follows:
where
is the L2 norm of each weight matrix
, and
is the corresponding coefficient.
5. Experiment
In this section, we will compare the MCL model with several baseline methods and demonstrate the effectiveness of node classification on social networks. The experiment was conducted on a server with an Ubuntu 22.04 LTS system.(Canonical Ltd., London, United Kingdom). The server has 96 cores and a clock speed of 2.5 GHz. The GPU used by the MCL model is Nvidia GeForce RTX 4090. (NVIDIA Corporation, Santa Clara, CA, USA).
5.1. Datasets
To verify the validity and robustness of our model, we collected data related to political elections from X (formerly Twitter) and data related to school food safety from Sina Weibo. X is one of the world’s largest social media platforms, while Sina Weibo is one of the most popular microblog platforms in China, similar to X but with localized features. Users on such social media platforms can add words or phrases starting with “#” (also known as hashtags) to categorize their posts and make them discoverable to a wider audience. To collect the data required for this study, we utilized publicly available APIs provided by the social media platforms (
https://developer.twitter.com (accessed on 10 January 2024) and
https://open.weibo.com (accessed on 3 June 2023)). These APIs enabled us to systematically retrieve relevant data, including posts, comments, and metadata, based on specific hashtags, all in a structured and compliant manner. All data were carefully de-identified to ensure that no personal information was leaked. Subsequently, these data underwent preprocessing, which included handling missing values and excluding posts in languages other than the target language. We constructed the network structure based on symbols in the text that reflect reposting and replying relationships, such as
and ‘@’. The detailed description of the datasets is as follows:
Datasets on X. These three English datasets record discussions on three topics related to the 2024 United States Presidential Election on X. They include information about the posting users, the content of the posts, posting times, retweet counts, and content relationships. The recording period covers discussions from 10th January 2024 to 10th February 2024. Users in these datasets are categorized into three classes: supporters of Trump, neutral users, and supporters of Harris. The three datasets are arranged in ascending order of size as X_A, X_B, and X_C. X_A contains data related to the hashtags and , X_B contains data related to and , and X_C contains data related to .
Datasets on Sina Weibo. The three Chinese datasets record discussions on three topics related to school food safety on Sina Weibo. They also include information about the posting users, the content of the posts, posting times, retweet counts, and content relationships. The recording period spans 3rd June 2023 to 30th June 2023. Users in these datasets are categorized into three classes based on their level of support for students: supporters, neutral users, and opponents. The three datasets are arranged in ascending order of size as Weibo_A, Weibo_B, and Weibo_C. Weibo_A contains data related to hashtags about media coverage, Weibo_B contains data related to hashtags about school statements, and Weibo_C contains data related to hashtags about official statements.
The statistics of all datasets are shown in
Table 1.
5.2. Experimental Setup
In the proposed MCL model, we used GPT-4 as the LLM to generate semantic embeddings. We compared our model with baselines from the following two categories:
GNN Predictors. We considered different GNN-based models to enhance the node features on TAG. Our baselines include mT5 [
35] and DeBERTa [
36]. We selected the most suitable GNN backbones based on the descriptions of their methods.
LLM Predictors. We also considered using different LLMs as baselines, where the text is directly input into the models for prediction. We performed predictions using Llama [
37], ChatGLM [
38], QwenLM [
39], and ERNIE BOT [
40].
For all methods, we adopted the classical semi-supervised learning setting, randomly selecting 10% of the data from each category to form the training set. We directly used classification accuracy as the evaluation metric.
5.3. Overall Evaluation
The results of our comparison with the baselines are shown in
Table 2. From the table, we can find that LLM predictors are generally better GNN predictors, indicating that the text on social networks cannot be adequately represented by solely pre-trained models. The reasoning ability of LLMs allows them to adapt and better understand these complexities. Our MCL model outperforms all baselines on all six datasets, demonstrating its effectiveness in both Chinese and English datasets. Unlike LLM-based predictors, our approach leverages the multilevel context on social networks, which is crucial for obtaining accurate text embeddings and enhancing the comprehensibility of the model’s decision-making process. In addition, we propose tailored bidirectional dynamic graph attention layers to further distinguish the weight information among nodes, which aligns more closely with the structural characteristics of social networks. Our MCL model excels by fully leveraging multilevel context and graph structure within social networks. As the dataset expands, its performance remains robust and relatively stable, whereas baselines typically encounter a decline in effectiveness.
5.4. Ablation Study
In this section, we designed ablation experiments to separately analyze the contributions of different levels of context and the bidirectional dynamic attention to the final results.
5.4.1. The Multilevel Context Layers
To analyze the impact of different contexts on semantic embeddings, we conducted ablation experiments by removing token positions for personal, local, and global context in the input. As shown in
Figure 5, removing context at any level affects the final results. The removal of local context has the largest impact, as user interactions in social networks are closely tied to text semantics. In contrast, removing personal context has the least impact, which we attribute to the fact that most user descriptions are either irrelevant or missing, so the personal context is set to default values, resulting in minimal impact.
5.4.2. The Bidirectional Dynamic Attention Layers
We further analyzed the effect of the GNN backbone in the MCL model by replacing the bidirectional dynamic attention layer with multilayer perceptron (MLP), graph convolutional network (GCN), and GAT layers, respectively. The results of the ablation experiment are shown in
Figure 6.
Our observations are as follows: (1) Our MCL model demonstrates the best performance. This indicates that the bidirectional dynamic attention layer indeed captures the structural information and is better suited for the directed nature of social networks. (2) The MLP backbone performs the worst. This is to be expected, as it cannot model graph-based dependencies and relationships, limiting its capacity to capture the complex structure of social networks. (3) The performance of the GAT backbone and GCN backbone is moderate. Although they leverage structural information, they do not distinguish the importance ranking of neighboring nodes, which diverges from the nature of social networks. (4) As the dataset size increases and the structure becomes more complex, the performance of our MCL model is not significantly affected, while the performance of the GAT and GCN architectures tends to deteriorate.
To quantify the contribution of our bidirectional dynamic graph attention layer, we sampled a network containing eight nodes and compared the weight matrices of the traditional GAT layer with those of our bidirectional dynamic attention. As shown in
Figure 7, the traditional GAT assigns the highest weight to the second node for all nodes, indicating that the representation of each node is overly influenced by the second node. This does not capture the diversity of user attention in social networks. In contrast, our bidirectional dynamic attention layer produces unique weight rankings for each node, highlighting its effectiveness in capturing the complex structure of social networks.