Popularity Prediction of Online Contents via Cascade Graph and Temporal Information

: Predicting the popularity of online content is an important task for content recommendation, social inﬂuence prediction and so on. Recent deep learning models generally utilize graph neural networks to model the complex relationship between information cascade graph and future popularity, and have shown better prediction results compared with traditional methods. However, existing models adopt simple graph pooling strategies, e.g., summation or average, which prone to generate inefﬁcient cascade graph representation and lead to unsatisfactory prediction results. Meanwhile, they often overlook the temporal information in the diffusion process which has been proved to be a salient predictor for popularity prediction. To focus attention on the important users and exclude noises caused by other less relevant users when generating cascade graph representation, we learn the importance coefﬁcient of users and adopt sample mechanism in graph pooling process. In order to capture the temporal features in the diffusion process, we incorporate the inter-infection duration time information into our model by using LSTM neural network. The results show that temporal information rather than cascade graph information is a better predictor for popularity. The experimental results on real datasets show that our model signiﬁcantly improves the prediction accuracy compared with other state-of-the-art methods.


Introduction
Online social platforms such as Facebook, Twitter, Sina Weibo and Tiktok, promote and widen the spread of online contents by attracting an increasing number of active users. A cascade of information diffusion is formed when an online content (e.g., a tweet, a microblog, or a video) is retweeted and propagated among users. The popularity of an online content is usually measured by the number of users participating in its spreading process, i.e., the number of users in the cascade graph. Predicting the future popularity of online content is greatly valuable and practically meaningful in many fields. For example, predicting the occurrence of large cascades as early as possible can create great commercial value in the field of commodity promotion and prevent the wide spread of fake news in the field of social security. Therefore, the popularity prediction task has been a hot research topic and widely used in many real-world applications, e.g., campaign strategy [1], social media recommendation [2] and viral marketing [3].
However, predicting the future popularity of online content is a challenging task since the information diffusion process is affected by many complicated factors and fraught with uncertainty. Researchers [4,5] have demonstrated that popularity prediction can be made with higher accuracy with the observation of early spread of online content, thus existing popularity prediction methods generally try to seek out the predictive features in the early state of information cascades and model the relationship between these predictive features and future popularity.
Inspired by recent successful application of deep learning in many fields, some researchers try to predict popularity through end-to-end deep learning methods. Compared with traditional approaches, deep learning based approaches for popularity prediction task [6,7] can automatically learn the complex factors affecting information diffusion without time consuming feature engineering and strong assumptions of underlying diffusion mechanism, and have achieved significant improvement of prediction accuracy. As a powerful approach for representation learning on graph data, graph neural networks [8] have been successfully applied to the field of social network analysis [9][10][11]. Early users involved in the cascade graph and early popularity have been proven to be highly related to future popularity [4,5,12]. Therefore, the efficient modeling of cascade graph characteristics is crucial to popularity prediction task. Recent approaches generally utilize graph neural networks to obtain low dimensional representation of information cascade graph, and model the non-linear relationship between the learned low dimensional representation and future popularity. However, existing graph neural network based popularity prediction methods confront some critical shortcomings: Firstly, existing simple graph pooling strategies, e.g., average or summation operations treat all users in the information cascade graph equally [7,13,14], but in reality, some users in the information diffusion process are more important than others, while some users are less relevant and have little effect on further cascade propagation. Therefore, existing graph pooling strategies normally generate inefficient cascade graph representation which have limited ability to express the information spread process, and lead to poor accuracy for popularity prediction.
Secondly, existing graph neural network based methods [7,[13][14][15][16] generally overlook the temporal information such as the time interval between two infections or reshares in the propagation process. However, temporal information [5,12,17] is a key predictive feature for popularity prediction task. Intuitively, a smaller mean time interval between two reshare event usually means that the information item itself is more popular in nature.
To address the aforementioned questions, we propose a novel framework integrating the cascade graph information and temporal information to predict the future popularity of online contents. We generate an effective cascade graph representation by integrating graph neural network with a novel graph pooling strategy. We focus on the top k important users and exclude the noise caused by other irrelevant users during graph pooling process. Meanwhile, we incorporate the inter-infection duration time information into our model by using Long Short Term Memory (LSTM) network [18]. By comparison of model variants, we come to a conclusion that temporal information is a better predictor for popularity prediction than cascade graph information. Our main contributions are listed as follows: • We integrate the graph neural network framework-GraphSage [19] with a novel graph pooling methods-top-K pooling [20,21], and obtained a more effective cascade graph representation which improve the general capabilities of our model to capture the underlying diffusion characteristic. • We incorporate the inter-infection duration time information into our model by using Long Short Term Memory (LSTM) network, and make up for the deficiencies of existing graph neural network based approaches. • The experimental results on two publicly available real-world datasets show that our proposed method can significantly improve the cascade prediction accuracy compared to several state-of-the-art competitive baselines.

Background and Related Works
Predicting the popularity of online content is a hot research topic in recent years. Lots of researchers have studied the changing popularity of online content, such as short texts [22], photos [23], videos [24,25], etc. Traditional popularity prediction methods mainly include feature-based approaches and generative approaches. Feature-based approaches identify and extract hand-crafted features, including content features [26,27], structural features [22,28], temporal features [12,29], etc., and use machine learning algorithms to make predictions. This type of approaches usually require heavy feature-engineering and their performance strongly depends on the effectiveness of extracted features. Generative approaches devote to model the diffusion process by probabilistic statistical generative approaches, e.g., epidemic models [30,31] and point processes [17,[32][33][34], but the performance is normally limited by its strong assumptions of underlying diffusion mechanisms.
With the rapid development of deep learning and its successful applications in many fields, many researchers have applied deep learning technology to social network analysis [35][36][37], such as rumor detection, spam detection, popularity prediction, etc. Cao [38] uses coupled graph neural networks to model the global social network which refers to friendship graph and the sequence of early infected users, while assuming that information propagation usually occurs along the social network structure between users. However, in many scenarios, due to the privacy protection policies of social network platforms, it is difficult for us to obtain the global social network. Meanwhile, affected by the recommendation mechanism of the platform, users may also participate in the dissemination of information that is not posted by their friends. Therefore, some recent deep learning based popularity prediction works [6,7,14] resort to the observed early social responses, i.e., early cascade graph characteristics and early temporal information in the diffusion process, in order to improve the prediction accuracy. Information cascade graph is formed in the process of information diffusion, in which nodes represent users participating in the propagation and edges represent the propagation relations among users. Cascade graph reflects the diffusion status under the influence of many complicated factors, and it is important to model the cascade graph characteristics in popularity prediction task. Below we introduce the recent deep learning based approaches which utilize cascade graph and temporal information to make popularity prediction respectively.

Cascade Graph Representation
Compared with traditional methods, deep learning based approaches can learn the complicated underlying diffusion patterns in an automatic way without heavy feature engineering and strong assumption of underlying diffusion mechanisms. As a pioneer of deep learning based approaches in popularity prediction, DeepCas [39] is an end-to-end deep learning method which utilize random walk method to sample paths from cascade graph in the context of global graph structure. It also use attention mechanism to integrate the cascade path embedding into cascade graph embedding for the cascade size prediction task. DeepHawkes [6] improves the performance of popularity prediction by taking time decay effect into account when integrating the cascade path embedding into cascade embedding. Liu [40] adopts random walk processes similar with DeepCas, and generates cascade graph representation from propagation path embedding. This kind of approaches learn the cascade graph representation based on the level of propagation paths rather than the cascade graph level. Due to inefficient local structural embedding, they can not generate the effective cascade graph embedding which lead to poor prediction accuracy.
Recent studies [7,[13][14][15][16] focus on learning the cascade graph level representation using graph neural networks. Recurrent cascades convolutional network-CasCN [7] propose to model the cascade graph with Graph Convolutional Network [41] and use LSTM neural network to model the dynamic changing characteristic from the sequence of cascade graph snapshots. Similarly, Cascade2vec [13] use Graph Residual block to learn the cascade graph representation and employ recurrent neural network to learn the temporal dependencies between cascade graph snapshots. Graph Attention Network (GAT) [42] is used in DMT-LIC [15] which makes micro and macro prediction simultaneously, DMT-LIC captures both the information of cascade graph and infected user sequence in the diffusion process. However, cascade graph representation in these graph neural network based methods is usually generated by a simple graph pooling strategy such as average or sum operation, which cannot distinguish user importance and results in inefficient cascade graph representation.

Temporal Representation
Existing deep learning based approaches often focus on learning the changing representation of cascade which grows over time but overlook the time series information such as the time interval between two infections during propagation. In DeepHawkes [6], temporal information is implicitly taken into account when integrating the cascade path embedding into cascade embedding by taking consideration of time decay effect. Researches [7,13] focus on modeling the dynamic changing characteristic from the sequence of cascade graph snapshots while neglecting the temporal information. Zhou [14] utilizes the inter-infection duration as the edge weight of cascade graph to incorporate temporal information with structural information simultaneously, but it is an artificially formulated manner with an implication that shorter infection time means closer relationship between users, but in reality, the infection-time duration does not only depend on the strength of the relationship between users but also may be caused by random factors. The methods mentioned above ignore the innate temporal pattern in the diffusion process which is a key predictive feature for popularity prediction tasks. Consequently, we exploit the explicitly temporal information in our model and demonstrate the importance of temporal information experimentally.

Problem Definition
We formulate the cascade prediction task as a regression problem which aims at predicting the number of users who will participate in the propagation of an online content, i.e., the size of information cascade. For an online content such as a tweet c i , we denote its cascade graph at observation time t o as: is the user set who have taken part in the cascade (e.g., retweet the tweet), E t o i denotes the interact relationship (e.g., retweet, comment) between users. We use T t o i = {t 1 , ..., t |V to i | } to represent the temporal information when users take part in the cascade, |V t o i | represents the number of users who have taken part in the cascade at observation time t o .
Given cascade graph information G t o i and temporal information T t o i of cascade c i at observation time t o (e.g., 1 h), our goal is to make prediction of the incremental popularity P i after a fixed time interval ∆t (e.g., 1 day), where Figure 1 shows a simple illustration of our problem, given early diffusion information including cascade graph and temporal information, we want to learn a regression function that maps the early diffusion information to its incremental popularity:

Methods
In this section, we demonstrate the details of our model and explain the way we utilize the cascade graph information and temporal information to make popularity prediction. Existing studies [7,13] generally generate cascade graph representation by taking the average or summation value of all node representations, which fails to discriminate between different users. We adopt a novel graph pooling strategy to distinguish the importance of users in the graph pooling process and improve the ability to represent the characteristic of cascade graph. Besides, we model the temporal information in the early stage of information diffusion which is often neglected by existing deep learning methods. Figure 2 gives the overall framework of our proposed model. Our model consists of three main components: (a) Cascade graph representation: we incorporate the graph neural network GraphSage [19] with a novel graph pooling method to get the integrated cascade graph representation. In the graph pooling process, we emphasize the important users and exclude the noise brought by less relevant users by supervised learning the importance coefficient for each user. (b) Temporal representation: we use the widely adopted LSTM model and sample mechanism to generate the temporal representation from the inter-infection duration time information. (c) Predictor: as a regression problem, we concatenate the cascade graph representation and temporal representation, and feed the embedding into multi-layer perceptrons (MLPs) to make the popularity prediction.

Cascade Graph Representation
Given an online information item c i , and its cascade graph we want to learn an effective cascade graph representation which reflect the actual characteristics of information diffusion. Graph neural networks are an effective deep learning framework for representation learning of graph data, we utilize the graph neural network-GraphSage as the graph convolutional layer to learn the representation of each node and a novel graph pooling method to generate cascade graph representation from node representation.
Each node v in cascade graph is initially represented as a one-hot vector q v ∈ R N , where N is the total number of users in the dataset. Then all nodes are converted to a low-dimensional dense representation by a randomly initialized embedding matrix D is an adjustable dimension. Note the node embedding is supervised learned during the training of the model. The node embedding learning process in GraphSage has two steps: aggregate information from neighborhood and update its own embedding. Firstly, each node v ∈ V t o i aggregates the representation of its immediate neighborhood. We employ the max pooling aggregation strategy which can effectively capture different aspects from its neighborhood, N (v) is the neighborhood set of node v, h k−1 u refers to the embedding of node u at the k − 1 graph convolutional step, W k p , b k p are learnable parameters of neural networks which define how to agrregate information from neighborhood in the k-th convolutional layer, σ refers to the sigmoid function. In the original work of GraphSage, the author uniformly samples a fixed-size set of neighbors instead of using all neighbors in order to keep the training batch size fixed. However, in our scenario, the number of node's neighbor generally obey power-law distribution. Using fixed-size set of neighbors is not a rational choice which might lead to severe information loss. Thus we use the full set of neighborhood by taking advantage of an effective GNN framework pytorch-geometric [43].
Then, we update the node embedding h k v using equation below: Concat() means the concatenation of embeddings, W k is the weight matrix used to update node embedding in the k-th convolutional layer. Figure 3 shows us a simple example of graph pooling process. Cascade graph representation are usually generated from node representation through graph pooling process such as average or summation pooling operations. Existing graph pooling strategies for popularity prediction treat all users in the information cascade graph equally and cannot distinguish user importance. Some users are more important than others in the information diffusion process, this reality motivates us to adopt a different graph pooling strategy for the generation of cascade graph representation. Top-K pooling is one of the graph pooling strategy [20,21] which has been demonstrated to be effective on many graph classification benchmarks. The downsampling method proposed in top-K pooling within the pipeline of graph neural network is analogous to image downsampling used in CNNs, helps us emphasize the important users and excludes the noise brought by irrelevant users. In order to decide which nodes should be dropped, a learnable vector p is introduced in and an importance coefficient is calculated across the node set by the equation: Figure 3. An example of graph pooling process. Cascade graph representation h g is generated from node representation by graph pooling process.
We select the top-ranked nodes from the cascade graph according to the importance coefficient s v and drop the nodes with low coefficient by a predefined downsample ratio r. We denote the induced subgraph as G . We use the learned importance coefficient s v as user weight and perform global max pooling on the induced subgraph to generate the cascade graph representation h g :

Temporal Representation
Temporal information in the early stage of diffusion process has been proven to be a salient predictor of future popularity in traditional approaches [5,12]. However, recent graph neural network based methods normally overlook this type of information and lead to suboptimal prediction results.The time interval between two propagation events reflects the diffusion speed of information. Shorter propagation time interval usually means that the information itself is more attractive and able to trigger wider diffusion. For each online content, we extract the inter-infection duration time where x t degenerates into a scalar value, h t−1 is the hidden state of previous time step, max is the max time step of all cascades and o i ∈ R d , d refers to the dimension of hidden states. We can use the final hidden states as our temporal information embedding, however, the output at different time step also contain useful information. The temporal information of first several infections are important features for popularity prediction [14], therefore, we should pay more attention on the temporal information of these early participants. Since the cascade size of online content usually follows a pow-law distribution, we sample the output sequence using the index sequence generated by {n * 10 m , n ∈ {1, 2, . . . 10}, m = {0, 1, 2}}, and use a weighted sum to get the temporal representation from the outputs of LSTM neural network: where N is our sample size, α is the attention vector learned automatically during training. The weighted sum mechanism can better model time decay effect during the information propagation than predefined time decay functions like exponential functions.

Predictor
The final component of our model is the predictor layer, which maps the learned predictive features in the early cascade data to its future popularity. Specifically, it takes the cascade graph embedding and temporal embedding learned in the previous step as the input, and output the future popularity of the cascade. We concatenate the cascade graph embedding h g and temporal embedding h t as the representation for cascade C i and feed the representation into a two-layer Multiple Layer Perception (MLP): Similar as previous work, we use the mean square log-transformed error (MSLE) as the loss function: N is the total number of cascades,P i is the predicted increment popularity, P i is the actual increment popularity for cascade C i . We take log-transformation for the cascade size following previous work to avoid the situation where the training process is dominated by large cascades. Model parameters are trained by minimizing the loss function and optimized using the Adam algorithm given its efficiency and ability to avoid overfitting.

Datasets
We evaluated the effectiveness and generalizability of our model on popularity prediction task using two real-world datasets. Following many previous works [6,7,14], we adopted mean squared log-transformed error (MSLE) as our evaluation metric to measure the discrepancy between the actual popularity values and predicted values. Sina Weibo is a popular twitter-like microblog platform, the dataset provided in paper [6] is a publicly available real-world dataset, which has been widely used in many recent related works [7,13,15]. We used this dataset to predict the final retweet number of microblog. HEP-PH [7] is a paper citation dataset and was used to predict the citation count of paper.

•
Sina Weibo: The dataset contained all microblogs posted on 1 June 2016, all their retweets and the corresponding retweet time within 24 h were recorded. The node in the cascade graph was the user who retweeted the microblog, and the edge between users represented their retweet relationship. Following previous works, we filtered out tweets posted in the midnight since they usually gained less attention due to less active users online. We also dropped microblogs whose retweet number was less than 10 or more than 1000 within the observation time window, because large cascades were rarely few in number and might have dominated the training process. • HEP-PH: The dataset included paper citation relationship and paper publication time from January 1993 to April 2003. The node in the cascade graph represented the paper, and the edges referred to the corresponding citation relationship.
The detail statistics of datasets are shown in Table 1. For different observation time window T, we list out the total number of cascades, nodes, and edges in our datasets. As the observation time window T enlarged, we got more qualified cascades whose size was more than 10 and less than 1000. We also calculated the average cascade size representing average number of users participating in each cascade within the observation time. As we can see from Figure 4, the distribution of cascade size followed a power-law distribution, and most information cascades' size was small. Figure 5 shows the average cascade size at different time normalized by the final average cascade size. As we can see, the microblog's popularity already reached 80% after 10 h since it was posted, so we set our observation time to be 1 h, 2 h, and 3 h when the normalized popularity was less than 60% to avoid saturation within the observation time window. We chose the observation time for HEP-PH dataset to be 3 years, 5 years and 7 years for the same reason.  We calculated the distribution of the mean inter-infection time and cascade size for the microblog dataset. Mean inter-infection time was defined as 1 M ∑ M j=1 (t j − t j−1 ) , where M is the number of users in the cascade. Figure 6a illustrate the distribution of mean inter-infection time. As shown in Figure 6b, most large cascades (cascade size was large than 100) had a quite small mean inter-infection time-less than 4 min. Furthermore, we calculated the actual ratio of large cascades based on mean inter-infection time in Figure 6c. More than 90% large cascades in the microblog dataset were propagated with a mean inter-infection time less than 300 s. The statistical results showed that temporal information such as mean inter-infection time was highly correlated with future popularity.

Baselines
We compared the performance of our method with several state-of-art deep learning based approaches including the following methods: • DeepCas [39]: is an end-to-end deep learning method which extracts structural information of cascade graph by taking random walk in the context of global graph, and use bi-GRU neural network for the cascade size prediction task. • DeepHawkes [6]: bridges the gap between deep learning and self-exciting point process by learning the cascade graph structural representation based on the level of propagation paths and takes time decay effect into consideration when integrating path representation into cascade representation. • CasCN [7]: demonstrates the effectiveness in applying the graph neural network framework to generate the representation of cascade graph. It claims to exploit both the temporal and structural information by extracting cascade subgraphs from cascade graph and using LSTM neural network to model the dynamic change of cascade graphs.
The hyper-parameters of baselines kept the same as those used in model CasCN [7]. As for our model, the embedding dimensionality of nodes was 32, the batch size was set to be 16. The dimension of cascade graph embedding was 32, and temporal embedding was 16. The learning rate for our model was 5 × 10 −3 . We used two graph convolutional layers in the component of cascade graph representation.

Variants
We provided three variants of our model to investigate the effectiveness of each component. The variants were generated by removing some part from the proposed model. We demonstrated the contribution of each component by comparing their prediction accuracy experimentally.  Table 2 shows the experimental results of baselines and our proposed model on two datasets. Following previous work [6,7,13], we chose mean square log-transformed error (MSLE) as the evaluation metric. In general, the performance of our model was better than the baselines on the evaluation of MSLE, surpassing the sub-optimal CasCN by around 11%. When comparing different observation time window T, we can see that as the observation time window enlarged, the MSLE error decreased, because more information was revealed to the model. The results demonstrated that our model could effectively exploit both temporal information and cascade graph information to predict the future popularity. As early models which apply deep learning to cascade prediction tasks, DeepCas and DeepHawkes focused on the representation learning of the propagation path, and the ability of learning global cascade information was weak, the prediction results were not very satisfactory compared to recent graph neural network based approach CasCN.
CasCN modeled the cascade graph with a graph convolutional network and used LSTM neural network to model the dynamic changing characteristic from the sequence of cascade subgraphs, but also overlooked the importance of temporal information. Our proposed model explicitly modeled the inter-infection duration time information by using LSTM neural network, and reduced the prediction error by around 11% comparing to CasCN.
We list the performance of our model variants on the evaluation of MSLE in Table 3. In order to compare the performance of baselines and our model variants more intuitively, we also present a line chart in Figure 7. As we can see, our variants VGraph which utilized graph neural network to capture cascade graph information achieved a better prediction result compared with DeepCas and DeepHawkes, demonstrating the advantage of graph neural network based approaches.  Figure 7. the line chart of performance comparison between baselines and model variants.
As shown in Table 3, the performance of our variant VGraph had a sub-optimal prediction result compared with CasCN, demonstrating the importance of modeling the changing characteristic of cascade graph. For model simplicity, we used the representation of cascade graph instead of dynamic cascade subgraphs and left modeling the dynamics as future work. We leveraged the temporal information in our model and made up for the deficiencies of existing methods. The model variant-VTemporal outperformed CasCN with an approximately 9% drop of MSLE error. The comparison of model variant-VTemporal and CasCN demonstrated that temporal information rather than cascade graph information was a better indicator for predicting future popularity, and confirmed the notably importance of temporal information in popularity prediction task. By jointly model the cascade graph information and temporal information, the prediction accuracy of our proposed model was significantly improved compared with state-of-art deep learning based methods.

Variants Comparison
The model variant VGraph (mean pool) adopted mean pooling strategy to generate the cascade graph representation, using the whole set of nodes in the cascade without differentiating node importance. In comparison of VGraph (mean pool) and VGraph, we demonstrated the effectiveness of top-k graph pooling strategy with a performance improvement of around 2%. We also compared the experiment results on different downsample ratio r and found its best value was around 0.9 which makes intuitive sense, since noisy nodes were much less than useful nodes in reality, getting rid of too many nodes could lead to severe information loss.
The model variant VTemporal proved the significant importance of temporal information in cascade popularity prediction task. The results were consistent with previous study [44] which only utilized temporal convolutional networks to capture the temporal information for popularity prediction of messages on social medias. Further on, as Figure 7 shows, model variant VTemporal achieved a much better prediction accuracy compared with variant VGraph, and confirmed that temporal information was a better predictor of popularity than cascade graph information. The results were consistent with the past work of Shulman [5], who proved features of early adopters were weak predictors of popularity compared to temporal features.

Latent Representation
End-to-end deep learning approaches on popularity prediction are often not interpretable. In order to have a more intuitive understanding of the learned representation in our model, we used t-SNE [45] to visualize the relationship between the learned representation and cascade size in Figure 8. T-SNE was mostly used to visualize high-dimensional data and project it into low-dimensional space. Cascade graph representation and temporal representation of cascades in test dataset generated by our model were projected into points in a 2D plane. Different colors were used to distinguish cascade size, the darker color point represented the larger cascade size. The distribution of color indicated the strength of connection between the learned representation and the cascade size, i.e., if points spatially close to each other shared the same color, it meant the representation was well correlated with the cascade size. A clustering phenomenon could be seen from both cascade graph representation and temporal representation in Figure 8, indicating that both cascade graph information and temporal information were useful for cascade popularity prediction. We also notice that the clustering phenomenon of temporal information was clearer than that of cascade graph information, confirming that temporal information was more relevant with popularity and a better predictor for popularity than cascade graph information.

Conclusions and Future Work
In this article, we propose a novel deep learning model to predict the popularity of online content by modeling the early cascade graph information and temporal information. To distinguish the importance of users and improve the ability to represent the cascade graph characteristics, we utilize the graph neural network with a novel graph pooling method. We also incorporate temporal information into our model by using LSTM neural network and make up for the deficiency of existing graph neural networks based popularity prediction methods. Experimental evaluations on two real-world datasets show that our model significantly improves the accuracy of popularity prediction comparing with other state-of-the-art methods. The experimental results demonstrate the notable importance of temporal information in popularity prediction and provide us an intuitive guidance for the subsequent popularity prediction work. In the future, it may be an interesting research point to explore the relationship between the micro-level prediction task which aims at predicting the next infected user and the popularity prediction task. Additionally, analyzing the diffusion pattern of information on different topics and predicting the popularity based on topic information may be a possible research direction.