You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

5 November 2022

Retweet Prediction Based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

,
and
1
Faculty for Informatics and Digital Technologies, University of Rijeka, 51000 Rijeka, Croatia
2
Center for Artificial Intelligence and Cybersecurity, University of Rijeka, 51000 Rijeka, Croatia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Natural Language Processing (NLP) and Applications

Abstract

Retweet prediction is an important task in the context of various problems, such as information spreading analysis, automatic fake news detection, social media monitoring, etc. In this study, we explore retweet prediction based on heterogeneous data sources. In order to classify a tweet according to the number of retweets, we combine features extracted from the multilayer network and text. More specifically, we introduce a multilayer framework for the multilayer network representation of Twitter. This formalism captures different users’ actions and complex relationships, as well as other key properties of communication on Twitter. Next, we select a set of local network measures from each layer and construct a set of multilayer network features. We also adopt a BERT-based language model, namely Cro-CoV-cseBERT, to capture the high-level semantics and structure of tweets as a set of text features. We then trained six machine learning (ML) algorithms: random forest, multilayer perceptron, light gradient boosting machine, category-embedding model, neural oblivious decision ensembles, and an attentive interpretable tabular learning model for the retweet-prediction task. We compared the performance of all six algorithms in three different setups: with text features only, with multilayer network features only, and with both feature sets. We evaluated all the setups in terms of standard evaluation measures. For this task, we first prepared an empirical dataset of 199,431 tweets in Croatian posted between 1 January 2020 and 31 May 2021. Our results indicate that the prediction model performs better by integrating multilayer network features with text features than by using only one set of features.

1. Introduction

Nowadays, social media platforms and online social networks have become an important source of information. Users tend to rely on online communication platforms for getting information, publishing posts that reflect their interests, views, and activities [1]. At the same time, users can express their opinions about other posts via different forms of feedback, such as reposts, quotes, mentions, replies, likes, etc. All these activities affect information spreading on social media [2]. In the last two decades, online social networks have increased the spread of information, but also misinformation and disinformation, which can lead to an infodemic and other negative side effects [3,4]. Therefore, exploring patterns of information spreading on social networks is a significant aspect of research in the domain of disinformation detection and infodemic. The primary motivation for this research was the analysis of crisis-related communication on social networks. However, the proposed approach can be applied in other domains for tasks related to retweet prediction.
Because social media platforms such as Twitter, Facebook, Instagram, and Weibo play an increasingly important role, people are more likely to turn to them in search of information during a global crisis. They can serve as an essential communication platform in real-world crises, emergencies, or disasters [5,6]. Social media may even influence the course of global crises, particularly in the context of epidemics, climate change, migration crises, economic crises, or wars. For example, the outbreak of the COVID-19 disease caused a significant increase in social media usage among the public, and it seriously affected the public’s understanding of the COVID-19 risk [7]. In some countries, there were many negative attitudes toward vaccines and anti-pandemic measures promoted on social networks [8]. Therefore, information-spreading analysis during a global crisis is of great importance as one of the steps of social media monitoring (infoveillance).
Twitter is one of the largest social networks with around 330 million monthly active users [9]. Consequently, it is one of the most studied social networks. Recently, it has frequently been used for monitoring and tracking different aspects of healthcare information and public disease [8,10,11]. Among all the user behavior on social media, retweets are considered to be one of the primary ways of spreading information on Twitter [12,13]. There is a large number of studies that deal with the prediction of information spreading on Twitter and other social networks. Many complex factors may influence the patterns of information spreading. Thus, different studies propose different sets of features for retweet prediction. Previous methods studied the problem by using various linguistic features, personal information of users, or network properties [12].
In some recent studies, authors combined heterogeneous data sources: information content, network structure, dynamics of spreading, information metadata, and other properties that can be referred to as heterogeneous data sources [1,13,14,15,16,17]. However, there are properties that have not been fully explored in the task of retweet prediction. One less-studied approach is the use of multilayer network properties as features, especially in combination with other features from heterogeneous data sources. The multilayer network is a formalism that captures various sorts of relationships over network data [18,19]. Used in the context of a social network, a multilayer network can represent different actions within the social network such as follow, share, quote, mention, or reply as the separate network layer. Because each action has a different impact on information spreading, in this way it is possible to make a fine-grade differentiation between layers and to include all this information as predictors of retweeting. We have already shown that a multilayer network structure is fundamentally more expressive than individual layers in examples of modeling a multilayer language network [20] and multidimensional knowledge network [21]. In [22], the authors use multilayer network features for disinformation detection in US and Italian news spreading on Twitter.
Inspired by these results, we decided to employ multilayer network features in the more general task of information-spreading prediction. However, in our approach, we construct a different multilayer network of Twitter and select different network measures to construct a multilayer network set of features. In addition, we combine multilayer network features with text features. To the best of our knowledge, this is the first time that anyone has attempted to use this set of multilayer network features in the task of retweet prediction and the first time anyone has attempted to combine multilayer network features with text features. We formalized our approach by introducing a multilayer framework for the representation of key elements of communications on social networks.
The main objective of this study is to explore the potential of the multilayer network measures as the set of features in the task of retweet prediction. Additionally, we investigate whether the multilayer network features combined with text features perform better than just one set of features. Therefore, this study explores how message features extracted from heterogeneous data sources may affect tweet spreading in terms of retweeting.
Multilayer network features are extracted from the multilayer model of the social network within which a message is spreading. For the purpose of retweeting prediction, we propose and construct a multilayer network representation with four layers representing actions of following, mentioning, replying, and a layer of tweets, and select several network measures from each layer. Text features are represented as a low-dimensional vector (embedding) that captures its semantics and structure. More specifically, we adopt a BERT-based language model, namely Cro-CoV-cseBERT [8] for representation of tweets as embeddings, which we use as a set of text features.
We model the prediction problem as a binary classification task in which the first class contains tweets with just one retweet and the second class contains tweets with more than one retweet. Next, we explore the performance of different feature sets by conducting an extensive set of experiments in which we train six machine learning models in three different setups: (i) classification based on text features, (ii) classification based on multilayer network features, and (iii) classification based on text and multilayer network features. More precisely, we trained the following classifiers: random forest (RF), multilayer perceptron (MLP), light gradient boosting machine (LGBM), category-embedding model (CEM), neural oblivious decision ensembles (NODE), and attentive interpretable tabular learning (TabNet model). We evaluated the performance of trained classifiers on three different sets of features in terms of standard evaluation measures: accuracy, precision, recall, and F1 score on a large dataset of tweets. For this purpose, we prepared an empirical dataset of 199,431 tweets in the Croatian language posted during the pandemic period between 1 January 2020 and 31 May 2021.
Our main research question is whether the use of multilayer network features in combination with features from heterogeneous data sources yields better results in terms of classification evaluation measures over just text features. Additionally, we are interested in understanding which of the above features are most effective in the classification task, and we analysed this by using the SHAP approach.
To summarize, the main contributions of this study are as follows.
  • We propose a multilayer framework formalism for Twitter representation based on multilayer network and select a set of measures from each layer to be extracted and combined with the metadata as the set of multilayer network features.
  • We conducted a set of experiments on a dataset of tweets using separate text features and multilayer network features and their combination and evaluated the performance of six machine learning classifiers.
  • We performed an analysis of feature importance to determine the impact of two sets of features for the task of retweet prediction and studied various multilayer network features chosen by using SHAP.
The rest of the paper is organized as follows. Section 2 discusses some of the existing research in the prediction of retweeting. Section 3 describes datasets, machine learning classifiers and the methods utilized in this study. Section 4 presents the results and analysis of our proposed approach. Section 5 discusses the proposed approach. Finally, Section 6 concludes our work.

3. Materials and Methods

3.1. Multilayer Framework Definition

In this section we introduce a multilayer framework, as a formalism that can capture various aspects of message spreading on Twitter. This is an extension of our previous work published in [4], in which we proposed a communication multilayer framework for representing communication on social media. Here, we define a framework based on the multilayer network that we use to represent different users’ actions on Twitter. Within the multilayer framework, we aggregate the multilayer social network with a set of metadata corresponding to text messages published on social media. Next we select set of network measures from each layer and define a set of multilayer network features used to train six ML models for retweet prediction.
According to [18], a multilayer network is defined as a pair:
M = ( G , C ) ,
where
G = { G α , α { 1 ,   ,   m } }
is a family of networks (graphs) G α = ( V α , E α ) called network layers of M and C = E α β V α × V β ; α , β { 1 ,   ,   m } , α β is the set of interconnections between nodes of different layers G α and G β where α β .
Similar to work presented in [4], layers are annotated as numbers from the set { 1 ,   ,   m } , where m is the number of layers. Like one-layer networks, multilayered networks can be directed or undirected, weighted or unweighted. Note that communication in social networks is best described by using a weighted and directed multilayer network.
Next, we introduce and consider a set T of metadata related to text messages posted on social networks. Generally, set T includes all messaging metadata that is available; however, the concrete metadata represented within the framework may vary depending on the task. In the case of Twitter and the retweet-prediction task, this metadata includes information such as the number of retweets, quotes, mentions, etc. In the context of network analysis, these vectors may be attributes of nodes that represent messages. Finally, the multilayer framework is defined as a tuple:
MF = ( M , T ) .

3.2. Twitter Communication Represented by Using Multilayer Network

Given the framework MF defined according to the (3), we model a Twitter network as the multilayer framework consists of four layers, thus m = 4 . Each layer represents one aspect of communication on Twitter as follows.
The first layer is the tweet layer, G 1 = ( V 1 , E 1 ) , in which nodes represent Twitter messages and two nodes i and j are connected with the directed link if message i and j have at least three words and/or hashtags in common. The direction of the link is defined according to the timeline; from the first tweet to the second tweet. The link weight is defined as the number of common words/hashtags. The second layer is the follower layer, G 2 = ( V 2 , E 2 ) , in which nodes represent Twitter users and two nodes i and j are connected with the directed link if user j follows user i. This is an unweighted network; however, the whole network is weighted, and thus all weights in this layer are set to 1. The third layer is the reply layer, G 3 = ( V 3 , E 3 ) , in which nodes represent Twitter users and two nodes i and j are connected with the directed link if user j replies to user i. The weight is defined as the number of replies. The fourth layer is is a mention layer, G 4 = ( V 4 , E 4 ) , in which nodes represent Twitter users and two nodes i and j are connected with the directed link if user j mentions user i. The weight is defined as the number of mentions. Next, we define a set of interconnections between nodes of different layers. The first layer of tweets (tweet layer) is connected with the second layer (follower layer) in the way that there is a directed link from the user (follower) to every tweet that he/she posts. The reply layer is connected with the tweet layer in the way that there is a directed link from the user to the tweet if the user replies to this tweet. Analogously, the mention layer is connected with the tweet layer in the way that there is a directed link from the user to the tweet if the user mentions this tweet. The rest of the layers, G 2 , G 3 and G 4 represent a multiplex network in which interconnections are established between the same nodes. A multiplex network is a special case of the multilayer network in which interlayer links can only connect nodes that represent the same node in different layers.
The model of Twitter represented as the multilayer network is illustrated in Figure 1. The figure is taken from [4] and adapted to the experiment of this study.
Figure 1. Twitter represented via multilayer network. Communication on Twitter captured via four layers of multilayer network: G1, tweet layer; G2, follower layer; G3, reply layer; and G4, mention layer. The interconnections between nodes of the tweet layer and the follower layer are established if the user (node) from G 2 posts a tweet (node) from G 1 . Other interlayer links are not represented in this figure due to the better visibility; however, G 2 , G 3 and G 4 are connected as multiplex network as explained in the text above.
Further explanations and details of the multiplex, which is illustrated in Figure 1, the connections between the nodes of the same or different layers, and the weights can be found in [4].

3.3. Multilayer Network Features

Next, we select a set of local network measures: degree (in/out), strength (in/out), eigenvector centrality (in/out), Katz centrality (in/out), average clustering coefficient and number of communities.
In general, local network measures are based on the number of node links, node position within the network, and relationship with other nodes. These are centrality measures, and they help in identification of the most influential individuals (nodes) in the network. These measures can give an insight into how nodes communicate with each other, which nodes are the most popular (hubs), how close are nodes with each other, and which nodes control the network (in terms of information flow). In the context of retweeting prediction, node centrality measures can exhibit the nodes with the largest potential to be retweeted. It is important to emphasize that the appropriate usage of centrality measures depends on the understanding of the type of links in the network and network flow [44].
Degree centrality of a node is the measure that takes into account the total number of links incident with a node. In the context of Twitter network, degree centrality can be interpreted as node with the largest number of followers or friends. However, if we capture more than one layer, degree centrality may also indicate the node with the largest number of mentions or replies. A higher degree implies popularity, and a higher possibility to gain information that is flowing through the network. According to [45], for a node i and the number of its links to other nodes k i , degree centrality is usually normalized by dividing it by the maximum possible degree N 1 :
d c i = k i N 1 .
In weighted networks, a weighted degree is refereed to as node strength. Strength of a node i is defined as the sum of all weights attached to links belonging to this node [45]:
s i = j ( i ) w i j ,
where ( i ) denotes the set of neighbouring nodes of a node i.
Eigenvector centrality is introduced by Bonacich [46]. It takes into account the centrality of the adjacent nodes. It can be interpreted as a measure of influence of a node in a network. A high eigenvector score means that a node is connected to many nodes that themselves have high scores. Relative scores are assigned to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to the score than equal connections to low-scoring nodes. For the node i and constant λ centrality c e i of node i is defined as [46]:
c e i = 1 λ j ( i ) c e j .
Eigenvector centrality computes the centrality for a node based on the centrality of its neighbors. The eigenvector centrality for node i is the ith element of the vector x defined by the equation [45]
A x = λ x ,
where A is the adjacency matrix of graph G with eigenvalues λ . There is a unique solution x, all of whose entries are positive, if λ is the largest eigenvalue of the adjacency matrix [45].
For directed graphs, Equation (6) calculates the “left” eigenvector centrality which corresponds to the in-edges in the graph. For calculating out-edges’ eigenvector centrality, it is necessary to reverse the graph G.
Katz centrality introduced by Leo Katz [47] calculates topological centrality that helps to discover the relative influence of each node on the network. It is a generalization of the eigenvector centrality. Katz centrality computes the centrality for a node based on the centrality of its neighbors. The general equation for calculating Katz centrality for node i is [47]:
k c i = α j ( i ) k c j + β ,
where parameter β controls the initial centrality and α < 1 λ m a x .
Katz centrality computes the relative influence of a node within a network by measuring the number of the immediate neighbors (first degree nodes) and also all other nodes in the network that connect to the node under consideration through these immediate neighbors. For directed graphs, it is possible to calculate in- and out-Katz centrality by taking into account that Equation (8) can find “left” eigenvectors which correspond to the in-edges in the graph. For out-edge Katz centrality, it is necessary to use the reverse graph G.
Clustering coefficient of a node measures how well are neighbors interconnected and quantifies if they are becoming a clique. The local clustering coefficient is calculated as the proportion of links between the nodes within its neighborhood divided by the number of links that could possibly exist between them. Real-world networks (and in particular social networks) have on average higher clustering coefficient than random networks (when comparing networks of the same size). The clustering coefficient of a node i is defined as [48]
C i = e i j k i ( k i 1 ) ,
where e i j represents the number of pairs of neighbours of a node i that are connected.
For each layer, we compute a set of network features separately and quantify different aspects of the information-spreading process. Based on these five centrality measures, we calculate values for in- and out-centrality measure (except clustering coefficient which is undirected) according to the equations defined in (4)–(6), (8), (9). As a result, we have nine features for each layer which makes 36 features in total.
In addition, we integrate network measures with the Twitter network metadata from MF . We incorporate metadata from the Twitter network and use the following information as additional vector features for each tweet: number of user followers, number of user friends, number of mentions, number of hashtags, number of user statuses, indicator whether tweet contains a URL, indicator whether tweet contains media, indicator whether tweet contains COVID-19 related keywords, etc. We add some auxiliary variables, such as whether the user is in the follower network, etc. Overall, 13 features are extracted from the set T of Twitter metadata.
The result is a 49-dimensional vector as the representation of tweets extracted from the multilayer framework MF .

3.4. Text Features

When we are faced with the problem of natural language processing, the choice of an appropriate language model that will be useful in solving the given problem certainly involves the development of a new sophisticated model or the choice of an existing language model that includes, e.g., semantic, syntactic and other linguistic features of the text. The seminal work of [49] contributed to the emergence of numerous variants of text representation models in terms of low-dimensional vectors in continuous space embeddings, where embeddings allow semantically related linguistic units to be represented with similar vector representations. As described in [8], the first generation was characterised by shallow language models, such as Word2Vec [49], Doc2Vec [50], GloVe [51], and fastText [52]. They have some shortcomings, such as static embeddings in which multiple concepts (i.e., different meanings of the same entity, polysemy) are not represented by different embedding vectors, or poor performance in new domains. Due to such shortcomings, the next generation of deep language models have been developed, namely ELMo [53], GPT/GPT-2 [54], GPT-3 [55], and BERT [56]. They replace static embeddings with contextualized representations and successfully solve the mentioned shortcomings. Moreover, they enable learning of context- and task-independent representations which yielded an improvement in performance on various NLP tasks [57,58].
To represent tweets in this study, we used the Cro-CoV-cseBERT language model from [8]. Cro-CoV-cseBERT is based on CroSloEngualBERT [59], a trilingual language model that was pre-trained on a large volume of texts from online news articles in Croatian, Slovenian, and English, and additionally fine-tuned on a large corpus of texts related to COVID-19 in Croatian (dataset Cro-CoV-Texts). Cro-CoV-Texts contains 186,738 news articles and 500,504 user comments related to COVID-19 published on Croatian online news portals, as well as 28,208 COVID-19 tweets in Croatian (excluding tweets from the Senti-Cro-CoV-Tweets dataset) [8].

3.5. Classification Models

Here, we describe six ML models that we trained for binary classification of tweets in our research.
Random forest (RF) is well known for taking care of data imbalances in different classes [60,61], especially for large datasets [62].
Multilayer perceptron (MLP) is another relatively simple model that can be used to perform classification [63].
The light gradient boosting machine (LGBM) classifier is based on decision trees to increases the efficiency of the model and reduces memory usage. It is described in [64].
The category-embedding model (CEM) is a basic model that is relatively simple with relatively simple architecture—a feed-forward network with categorical features passed through a learnable embedding layer. It is similar to MLP, but with learned embeddings for category variables.
The neural oblivious decision ensembles (NODE) for deep learning on tabular data is a model presented in ICLR 2020 [65]. According to the authors, it has beaten well-tuned gradient-boosting models on many datasets. It uses a neural equivalent of oblivious trees (the kind of trees that catboost uses) as the basic building blocks of the architecture.
Attentive interpretable tabular learning (TabNet) is another model coming out of Google Research that uses sparse attention in multiple steps of decision making to model the output [66].

3.6. Data Collection and Experiment Setup

To perform the experiments, data were collected from the social network Twitter. The data were collected automatically by using a pipeline for a continuous collection of tweets over a long period of time, with the data structure organized so that there are records of users and their friends, their followers, and all their posts (i.e., published tweets) for a given period of time. The data collection pipeline is organized in such a way that it first collects accounts located in Croatia, and then it collects all their friends and followers, as well as the published tweets of all the previously mentioned profiles.
The collected Twitter dataset (Cro-Tweets2021) captures tweets posted in the Croatian language during the period between 1 January 2020 and 31 May 2021. The data were collected by using tweepy [67], a Python library for accessing the Twitter API. After preprocessing the tweets and removing tweets without retweet, the final dataset consists of 199,431 tweets. Next, we performed cleaning and processing of tweets following the same procedure as proposed in [68]. This includes several steps including: replacing usernames, replacing URLs, and translating emojis to ASCII code.
In the next step, we constructed the corresponding multilayer network M and multilayer framework MF . Calculation of network measures was performed in a Python package NetworkX [69]. Then, we extracted the multilayer network features and text features. Before feature selection, we performed a detailed analysis of the features sets (The results of the feature analysis are available at: https://github.com/InfoCoV/Multi-Cro-CoV-cseBERT/blob/main/notebooks/exploration/features_analysis.ipynb, accessed on 20 September 2022) including mutual information analysis. The whole procedure of collecting and analysing tweets is described in Figure 2, and Cro-Tweets2021 dataset (https://github.com/InfoCoV/InfoCoV/blob/main/Cro-Vect-Twitter.csv?fbclid=IwAR0m1Ahk6Jui200DQozGp4eeLa7n8AaBaf53ROLmMOUsSYCMaAvS2LTfwuc, accessed on 20 September 2022) is publicly available.
Figure 2. Tweets processing procedure.
In the next, step we train six ML classifiers in the task of binary retweet classification: random forest, multilayer perceptron, light gradient boosting machine, category-embedding model, neural oblivious decision ensembles and attentive interpretable tabular learning model. For the purpose of training the classification models, we split the initial set of tweets, T into training, validation, and test sets with an 80:10:10 ratio. It is important to mention that we split the tweets according to the time stamps of tweets.
After training and testing all classifiers, we perform the SHAP analysis [70] to identify the features that have the most impact on the classification.

4. Results

In this section, we present the comparison results of the performance of six trained models by using three different sets of features. These are features from the text, features from the multilayer network, and their combination. In the next step, we perform SHAP values analysis.

4.1. Evaluation Results

We trained and compared six ML models, namely RF, MLP, LGBM, CEM, NODE, and TabNet. The evaluation was performed in terms of standard machine learning classification metric such as: accuracy ( A c c ), precision (P), recall (R), and F 1 -score ( F 1 ). Model performance was measured in a macro-averaged setting to ensure equal care for all classes.
Based on the results presented in Table 1, several important observations can be highlighted.
Table 1. Comparison of results for six trained models in combination with three different set of features.
The first observation suggests that classifiers regularly achieve better results on network features than on text features in terms of all considered performance measures ( A c c , P, R and F 1 ).
Another observation concerns combined features (the union of text and network features), which provide classifiers with even more fruitful ground for inducing classification models. With respect to the standard measure of accuracy ( A c c ), the classifiers induced from the combined features show a meaningful improvement over those induced from the text features, ranging from 3.9 to up to 7.7%, whereas with respect to the F 1 score, this progress ranges from 3.7 to up to 8.2%. Considering the features from the network, we also find that the performance improvement, which favors combined features over network features, is at most 1.4% for A c c and at most 1.7% for F 1 score. There are also exceptions: for the LGBM classifier, performance remains the same whether features from the network or a combination of features are used, and the exception is the RF classifier, where combined features do not improve performance. In short, the observation based on the results suggests that the features from the network complement the text features well, and in such a combined set achieve better classification performance.
Considering only the most fruitful results are obtained with a set of combined features, in terms of F 1 score, CEM is the most successful classification model with 67.9%. The lowest performance is achieved for the TabNet model with 66.6%. The MLP and NODE models perform well compared to the CEM model, as their performance is only one percentage point lower.
Based on these results, we decided to integrate the CEM model as the part of the Multi-Cro-CoV-cseBERT model for retweet prediction based on multilayer network and text features. We will further use this model for information-spreading analysis in the domain related to COVID-19 pandemic.

4.2. Feature Analysis

Shapley additive explanations (SHAP), introduced in [70], is used to show the contribution or importance of each feature to the prediction of the model. SHAP values analysis, in the case of the RF model, was performed on a sample of 1000 examples from the test set. The absolute SHAP value indicates how much a single feature affected the prediction.
In order to understand the importance or contribution of the features for the whole dataset, the bee swarm plot is illustrated in Figure 3. In this plot, the features are ordered by their effect on the prediction in such a way that the most important feature is listed on the top, and the rest of the list is sorted in descending order. The features’ importance is determined according to SHAP values, which are calculated with a unified framework for interpreting predictions [70] and presented simply with the mean average value for each feature. Features are sorted by the sum of the SHAP value magnitudes across all samples. In addition, the plot also illustrates how higher and lower values of the feature affect the outcome. Small dots on the plot represent a single observation. The horizontal axis represents the SHAP value, whereas the color of the dot shows whether this observation has a higher (red) or lower value (blue) compared to other observations.
Figure 3. SHAP values analysis on bee swarm summary plot illustrating impact on model output.
The features listed in Figure 3 are in order of global importance, with the first feature being the most important and the last being the least important. The most important feature—log1p_followers_count—is found to have a very high positive contribution when its values are high, and a very low negative contribution when its values are low. The same applies to the variable entities.media, which is second in the order of feature importance. For the third most important feature (log1p_statuses_count), high values of the variable were found to make a high negative contribution to prediction, whereas low values made a high positive contribution. Such conclusions can also be drawn from the plot for all other features. Moreover, it can be seen that some features, such as tweets_keywords_3_out_strength, hardly (or do not) contribute to prediction, regardless of whether their values are high or low. It is interesting to note how some properties of the network are reflected in features that have a stronger impact than other features that reflect other properties of the network. The most important feature is the number of followers (1. log1p_followers_count), and the most important feature from the group of centrality measures is follower in-degree (4. fallowing_users_graph_in_degree). It is fair to say that the concept of followers plays an important role in the selection of contributing features. Apart from that, keywords are also a superior contributing feature, especially in the form of features resulting from centrality measures in/out-degree, in-strength and clustering coefficient (in Figure 3 those are the features 5, 9, 11 i 12). A detail to note is that in-degree centrality of keywords has a greater impact than out-degree. In terms of layers/graph types, the replay network has spawned a larger number of features with valuable impact than the mention layer/graph. Last but not least, network metadata also make a satisfactory contribution to the retweet prediction, for example number of followers (follower count), number of changed statuses of the user (statuses count), presence of media in the tweet (entities media), or presence of URL in the tweet (entities URLs) are important features that are positioned at the top of the list.
Compared with other similar studies, our results are in line with the findings of Suh et al. [34]. They also examined the content and contextual features in the task of retweet prediction, but to a much lesser extent than our study. In general, their findings suggest that retweetability has a very close relationship with the social network context of the authors and the informational content and value contained in tweets, similar to our results. In particular, they showed that among content features, URLs and hashtags have strong potential in the prediction of retweeting. Furthermore, among contextual features which are related to the network, the number of followers and followees as well as the age of the account affect retweetability. In our results, the number of followers also has a high correlation with retweeting. However, according to our SHAP analysis, the presence of media has a much stronger impact than the presence of URLs. Next, the in-degree measures of the follower layer and the tweet layer are ranked as high, as well as the impact of the in-strength and out-strength centrality measures of the retweet layer. This suggests that multilayer network measures have the potential for retweet prediction which has not been shown before.

5. Discussion on Retweet Prediction Based on Heterogeneous Features

In this study, several aspects related to the retweet-prediction task are investigated. The two main objectives were to explore the potential of multilayer representation of the social network for the retweet prediction and to analyse the possibilities of retweet prediction based on heterogeneous data sources.
Overall, the multilayer network features perform better than text features for all six trained models. According to that, we can confirm that multilayer network representation of Twitter has great potential for retweet prediction. These findings are in line with results of the study [22] in which Pierri et al. have shown that multilayer network features perform better in the task of disinformation classification on Twitter. Although this study modeled Twitter differently than the MF method proposed here and used different network measures, all these results indicate that multilayer approach in the task of retweet prediction is worthy of further examination.
Furthermore, according to the results presented in the previous section, we can conclude that the combination of multilayer network features with text features in general performs better than only one set of features in the task of predicting the number of retweets. We have to emphasize that the combination of features only slightly outperforms the multilayer network features. However, this is consistent for five of six models (only the RF algorithm has better performance in the case of multilayer network features). The potential of combination of features from heterogeneous data sources has been considered in several studies before [1,13,14,15,16,17,34,35], and it has been shown that the combination of features is better than one set of features. Specifically, in [15] authors showed that models based on multidimensional features extracted from author, tweet, and user outperform models based on the standard set of features. Our approach also combines features extracted from tweet with data related to author and user, but in a different way.
To the best of our knowledge, the combination of multilayer network measures and text embedding as features for information-spreading prediction has not been examined previously. The feature analysis performed by the SHAP approach indicates that among multilayer network measures, the in-degree calculated from the follower layer and the in-degree calculated from the tweet layer have highest impact on the model. Furthermore, the impact of the in-strength and out-strength centrality measures of the retweet layer are also high. Besides that, as expected, the number of followers and some other metadata, such as the presence of media and URLs, have a positive influence on retweeting. These findings suggest that except for the standard measures, multilayer network measures may be valuable in retweet-prediction models.
Here we need to emphasize that feature engineering and the selection of appropriate feature sets is an important step in all classification tasks, although some studies have examined the potential of using deep neural networks to avoid the manual construction of features. For example, in [12] authors proposed attention-based deep neural networks in the task of retweet prediction, and in [71] the authors applied graph representation learning to extract the structural attributes of the ego network and predict user retweet behavior. However, it is still worth examining different possibilities in the construction sets of features, especially the combination of features from heterogeneous data sources. In this way, it is possible to detect which data sources have higher influence to information spreading, and in the next step we can include these sources as an input into a deep neural network. In this context, the next research direction of the proposed approach is to perform joint representation learning from heterogeneous data sources: multilayer network and text.
Another important aspect of this research is the comparison of the performance of six different ML algorithms in the task of retweet prediction. We identify the CEM model as the one with the best performance according to all used evaluation measures in all three feature set scenarios, whereas the overall lowest performance is achieved in the case of TabNet model. Again, it has to be emphasised that differences across all algorithms are not so significant. The only significant difference is in the performance of models that use only text features in comparison to a multilayer network set of features which seems to be significantly better for multilayer network features (as well as for combined features) for all six models. This is again an indicator that multiyear network features have great potential in the analysis of information spreading.
This research is an extension of our previous studies of online communication on social media during the COVID-19 pandemic. In [72], we compared the retweeting of COVID-19-related tweets and tweets that are not related to COVID-19. Our findings indicate that nearly 60% of tweets related to COVID-19 belong to the high-spreadable class, whereas less than 40% of non-COVID-19 tweets belong to this high-spreadable class. This suggests that tweet content may have a high impact on retweeting (spreadability), especially during a global crisis, such as the COVID-19 pandemic. In another study [73], we explored the potential of graph neural networks (GNNs) in the task of prediction if the user would tweet about COVID-19 or not. By using the proposed multi-Cro-CoV-cseBERT model for retweet prediction, we will further analyse the information-spreading patterns in the domain of COVID-19-related communication on Twitter.
This research has several limitations that we plan to address in future work. First, our results are not directly comparable to other studies, because we modelled the task of retweet prediction as the binary classification task into two classes: (i) class of tweets with only one retweet and (ii) class of tweets with more than one retweet. In this way, we try to predict whether the amount of retweets would be poor or not, but we did not take into account tweets that are not retweeted at all. We decided to discard all tweets with no retweets because there are too many reasons why the tweet is not retweeted and this may negatively affect the prediction. We assumed that the prediction models would perform better if we concentrated only on the dataset of retweeted tweets in this first step. In addition, we used this setup because this way we ensured balanced classes of the dataset. However, in future research we plan to include tweets with no retweet into the prediction task. Another limitation is that we used only one dataset of tweets to compare the performance of features and ML models. However, this dataset is a representative sample of tweets in the Croatian language posted during the pandemic years 2020 and 2021, and our intention was to analyse the crisis-related communication in Croatia during the COVID-19 pandemic period. That is the reason why we trained and compared ML models on this specific dataset of tweets.

6. Conclusions and Future Work

In this paper, we introduce a multilayer framework formalism for the representation of online communication on social media. We utilized this formalism for feature extraction from heterogeneous data sources: multilayer networks and text messages. We performed a detailed analysis of possible features and a combination of network and multilayer features in the task of binary classification of tweets according to the amount of retweeting.
The main focus of this research is to compare the performance of different sets of features and its combination. In addition, we evaluated six different ML classification models: random forest, multilayer perceptron, light gradient boosting machine, category-embedding model, neural oblivious decision ensembles, and attentive interpretable tabular learning model.
According to the overall results, exclusively multilayer network features performed significantly better than exclusively text-based features for all six algorithms. Overall, our results indicate that the structural features of Twitter represented as the multilayer network might be effectively exploited in the retweeting-prediction task.
The combination of both feature sets has the best performance in the case of all classification models, except the random forest. We identify that the category-embedding model has the best performance according to the F1 score, which is 0.679. However, this result is only slightly better than results of other algorithms, and we can conclude that all six algorithms have similar performance in the task of retweet classification. Additionally, we explored the impact of different features by using SHAP analysis and determine that the number of followers in the network, the presence of media, the number of changed user statuses, the in-degree on the follower network layer, and the in-degree on the Twitter network layer features have major impacts on the model. Thus, we believe that our multilayer network-based approach provides useful insights into the future development of a system for predicting information spreading on social media.
The proposed approach can be further extended in the several directions, and we have several of plans for future work. First, we plan to test more multilayer network measures as predictors and also to explore the potential of deep learning automatic feature extraction from the multilayer network in the task of retweet prediction. Secondly, we plan to extend the multilayer framework model with the dynamic aspect (in the sense that we capture the dynamics of users’ actions) and to use three sets of features for prediction of retweeting and information spreading on social media in general. Thirdly, we plan to utilize graph neural networks for link prediction.

Author Contributions

Conceptualization, A.M.; data curation, M.P.; formal analysis, S.B.; funding acquisition, A.M.; investigation, A.M., M.P. and S.B.; methodology, A.M. and S.B.; software, M.P. and S.B.; supervision, A.M.; visualization, validation, A.M. and S.B.; writing—original draft, A.M. and S.B.; writing—review & editing, A.M. and S.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported in part by the Croatian Science Foundation under the project IP-CORONA-04-2061, “Multilayer Framework for the Information Spreading Characterization in Social Media during the COVID-19 Crisis” (InfoCoV), and by University of Rijeka project number uniri-drustv-18-38.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented and used in this study (Cro-Tweets2021 dataset) are openly available at https://github.com/InfoCoV/InfoCoV/blob/main/Cro-Vect-Twitter.csv?fbclid=IwAR0m1Ahk6Jui200DQozGp4eeLa7n8AaBaf53ROLmMOUsSYCMaAvS2LTfwuc (accessed on 20 September 2022).

Acknowledgments

We would like to thank Velebit AI, especially Mladen Fernežir for leading the implementation of the classifiers.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AccAccuracy
ASCIIAmerican Standard Code for Information Interchanges
AUROCArea Under the Receiver Operating Characteristics
BERTBidirectional Encoder Representations from Transformers
CEMCategory Embedding Model
COVID-19Corona Virus Disease-19
ELMoEmbeddings from Language Models
F1F1 score
GloVeGlobal Vectors for Words Representations
GNNGraph Neural Networks
GPTGenerative Pre-trained Transformer
LGBMLight Gradient Boosting Machine
MLMachine Learning
MLPMultilayer Perceptron
NLPNatural Language Processing
NODENeural Oblivious Decision Ensembles
PPrecision
RRecall
RFRandom Forest
SHAPSHapley Additive exPlanations
TabNetAttentive Interpretable Tabular Learning

References

  1. Firdaus, S.N.; Ding, C.; Sadeghian, A. Retweet Prediction based on Topic, Emotion and Personality. Online Soc. Netw. Media 2021, 25, 100165. [Google Scholar] [CrossRef]
  2. Wang, J.; Yang, Y. Tweet retweet prediction based on deep multitask learning. Neural Process. Lett. 2022, 54, 523–536. [Google Scholar] [CrossRef]
  3. Eysenbach, G. Infodemiology: The epidemiology of (mis) information. Am. J. Med. 2002, 113, 763–765. [Google Scholar] [CrossRef]
  4. Petrović, M.; Levnajić, Z.; Meštrović, A. Analysis of the COVID-19 Communication on Twitter via Multilayer Network. In Proceedings of the 2nd International Symposium on Automation, Information and Computing (ISAIC 2021), Beijing, China, 3–6 December 2022. [Google Scholar]
  5. Cuello-Garcia, C.; Pérez-Gaxiola, G.; van Amelsvoort, L. Social media can have an impact on how we manage and investigate the COVID-19 pandemic. J. Clin. Epidemiol. 2020, 127, 198–201. [Google Scholar] [CrossRef] [PubMed]
  6. Bunker, D. Who do you trust? The digital destruction of shared situational awareness and the COVID-19 infodemic. Int. J. Inf. Manag. 2020, 55, 102201. [Google Scholar] [CrossRef] [PubMed]
  7. Malecki, K.M.; Keating, J.A.; Safdar, N. Crisis communication and public perception of COVID-19 risk in the era of social media. Clin. Infect. Dis. 2021, 72, 697–702. [Google Scholar] [CrossRef]
  8. Babić, K.; Petrović, M.; Beliga, S.; Martinčić-Ipšić, S.; Matešić, M.; Meštrović, A. Characterisation of COVID-19-related tweets in the Croatian language: Framework based on the Cro-CoV-cseBERT model. Appl. Sci. 2021, 11, 10442. [Google Scholar] [CrossRef]
  9. Jay, A. FinancesOnline. Available online: https://financesonline.com/number-of-twitter-users/ (accessed on 1 July 2022).
  10. Kuang, S.; Davison, B.D. Learning Word Embeddings with Chi-Square Weights for Healthcare Tweet Classification. Appl. Sci. 2017, 7, 846. [Google Scholar] [CrossRef]
  11. Singh, C.; Imam, T.; Wibowo, S.; Grandhi, S. A Deep Learning Approach for Sentiment Analysis of COVID-19 Reviews. Appl. Sci. 2022, 12, 3709. [Google Scholar] [CrossRef]
  12. Zhang, Q.; Gong, Y.; Wu, J.; Huang, H.; Huang, X. Retweet prediction with attention-based deep neural network. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 75–84. [Google Scholar]
  13. Yin, H.; Yang, S.; Song, X.; Liu, W.; Li, J. Deep fusion of multimodal features for social media retweet time prediction. World Wide Web 2021, 24, 1027–1044. [Google Scholar] [CrossRef]
  14. Sharma, S.; Gupta, V. Role of twitter user profile features in retweet prediction for big data streams. Multimed. Tools Appl. 2022, 81, 27309–27338. [Google Scholar] [CrossRef] [PubMed]
  15. Fu, X.; Cheng, S.; Zhao, L.; Lv, J. Retweet Prediction Based on Multidimensional Features. Wirel. Commun. Mob. Comput. 2022, 2022, 1863568. [Google Scholar] [CrossRef]
  16. Dai, T.; Xiao, Y.; Liang, X.; Li, Q.; Li, T. ICS-SVM: A user retweet prediction method for hot topics based on improved SVM. Digit. Commun. Netw. 2022, 8, 186–193. [Google Scholar] [CrossRef]
  17. Ma, R.; Hu, X.; Zhang, Q.; Huang, X.; Jiang, Y.G. Hot topic-aware retweet prediction with masked self-attentive model. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Paris, France, 21–25 July 2019; pp. 525–534. [Google Scholar]
  18. Boccaletti, S.; Bianconi, G.; Criado, R.; Del Genio, C.I.; Gómez-Gardenes, J.; Romance, M.; Sendina-Nadal, I.; Wang, Z.; Zanin, M. The structure and dynamics of multilayer networks. Phys. Rep. 2014, 544, 1–122. [Google Scholar] [CrossRef]
  19. Kivelä, M.; Arenas, A.; Barthelemy, M.; Gleeson, J.P.; Moreno, Y.; Porter, M.A. Multilayer networks. J. Complex Netw. 2014, 2, 203–271. [Google Scholar] [CrossRef]
  20. Martinčić-Ipšić, S.; Margan, D.; Meštrović, A. Multilayer network of language: A unified framework for structural analysis of linguistic subsystems. Phys. A Stat. Mech. Its Appl. 2016, 457, 117–128. [Google Scholar] [CrossRef]
  21. Vukić, D.; Martinčić-Ipšić, S.; Meštrović, A. Structural analysis of factual, conceptual, procedural, and metacognitive knowledge in a multidimensional knowledge network. Complexity 2020, 2020, 9407162. [Google Scholar] [CrossRef]
  22. Pierri, F.; Piccardi, C.; Ceri, S. A multi-layer approach to disinformation detection in US and Italian news spreading on Twitter. EPJ Data Sci. 2020, 9, 35. [Google Scholar] [CrossRef]
  23. Nesi, P.; Pantaleo, G.; Paoli, I.; Zaza, I. Assessing the reTweet proneness of tweets: Predictive models for retweeting. Multimed. Tools Appl. 2018, 77, 26371–26396. [Google Scholar] [CrossRef]
  24. Zaman, T.R.; Herbrich, R.; Van Gael, J.; Stern, D. Predicting information spreading in twitter. In Proceedings of the Workshop on Computational Social Science and the Wisdom of Crowds, Nips, Citeseer, Sierra Nevada, Spain, 17 December 2010; Volume 104, pp. 17599–17601. [Google Scholar]
  25. Kupavskii, A.; Ostroumova, L.; Umnov, A.; Usachev, S.; Serdyukov, P.; Gusev, G.; Kustarev, A. Prediction of retweet cascade size over time. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, HI, USA, 29 October–2 November 2012; pp. 2335–2338. [Google Scholar]
  26. Moreno, Y.; Pastor-Satorras, R.; Vespignani, A. Epidemic outbreaks in complex heterogeneous networks. Eur. Phys. J. Condens. Matter Complex Syst. 2002, 26, 521–529. [Google Scholar] [CrossRef]
  27. Yang, R.; Wang, B.H.; Ren, J.; Bai, W.J.; Shi, Z.W.; Wang, W.X.; Zhou, T. Epidemic spreading on heterogeneous networks with identical infectivity. Phys. Lett. A 2007, 364, 189–193. [Google Scholar] [CrossRef]
  28. Ikeda, K.; Okada, Y.; Toriumi, F.; Sakaki, T.; Kazama, K.; Noda, I.; Shinoda, K.; Suwa, H.; Kurihara, S. Multi-agent information diffusion model for twitter. In Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Warsaw, Poland, 11–14 August 2014; Volume 1, pp. 21–26. [Google Scholar]
  29. Daga, I.; Gupta, A.; Vardhan, R.; Mukherjee, P. Prediction of likes and retweets using text information retrieval. Procedia Comput. Sci. 2020, 168, 123–128. [Google Scholar] [CrossRef]
  30. Kushwaha, A.K.; Kar, A.K.; Ilavarasan, P.V. Predicting retweet class using deep learning. In Trends in Deep Learning Methodologies; Elsevier: Amsterdam, The Netherlands, 2021; pp. 89–112. [Google Scholar]
  31. Firdaus, S.N.; Ding, C.; Sadeghian, A. Topic specific emotion detection for retweet prediction. Int. J. Mach. Learn. Cybern. 2019, 10, 2071–2083. [Google Scholar] [CrossRef]
  32. Pierri, F.; Piccardi, C.; Ceri, S. Topology comparison of Twitter diffusion networks effectively reveals misleading information. Sci. Rep. 2020, 10, 1–9. [Google Scholar] [CrossRef] [PubMed]
  33. Miao, W.Y.; Fang, B.; Meng, L.Q. Retweet Prediction within Communities on SNS Based on Social Network Analysis. J. Comput. 2018, 29, 147–160. [Google Scholar]
  34. Suh, B.; Hong, L.; Pirolli, P.; Chi, E.H. Want to be retweeted? large scale analytics on factors impacting retweet in twitter network. In Proceedings of the 2010 IEEE Second International Conference on Social Computing, Minneapolis, MN, USA, 20–22 August 2010; pp. 177–184. [Google Scholar]
  35. Tsur, O.; Rappoport, A. What’s in a hashtag? Content based prediction of the spread of ideas in microblogging communities. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, Seattle, WA, USA, 8–12 February 2012; pp. 643–652. [Google Scholar]
  36. Amitani, R.; Matsumoto, K.; Yoshida, M.; Kita, K. Buzz Tweet Classification Based on Text and Image Features of Tweets Using Multi-Task Learning. Appl. Sci. 2021, 11, 567. [Google Scholar] [CrossRef]
  37. Omodei, E.; De Domenico, M.D.; Arenas, A. Characterizing interactions in online social networks during exceptional events. Front. Phys. 2015, 3, 59. [Google Scholar] [CrossRef]
  38. Oro, E.; Pizzuti, C.; Procopio, N.; Ruffolo, M. Detecting topic authoritative social media users: A multilayer network approach. IEEE Trans. Multimed. 2017, 20, 1195–1208. [Google Scholar] [CrossRef]
  39. Magnani, M.; Rossi, L. The ml-model for multi-layer social networks. In Proceedings of the 2011 International Conference on Advances in Social Networks Analysis and Mining, Kaohsiung, Taiwan, 25–27 July 2011; pp. 5–12. [Google Scholar]
  40. Hristova, D.; Noulas, A.; Brown, C.; Musolesi, M.; Mascolo, C. A multilayer approach to multiplexity and link prediction in online geo-social networks. EPJ Data Sci. 2016, 5, 24. [Google Scholar] [CrossRef]
  41. Perc, M. Diffusion dynamics and information spreading in multilayer networks: An overview. Eur. Phys. J. Spec. Top. 2019, 228, 2351–2355. [Google Scholar] [CrossRef]
  42. De Domenico, M.; Granell, C.; Porter, M.A.; Arenas, A. The physics of spreading processes in multilayer networks. Nat. Phys. 2016, 12, 901–906. [Google Scholar] [CrossRef]
  43. Bródka, P.; Musial, K.; Jankowski, J. Interacting spreading processes in multilayer networks: A systematic review. IEEE Access 2020, 8, 10316–10341. [Google Scholar] [CrossRef]
  44. Matas, N. Comparing Network Centrality Measures as Tools for Identifying Key Concepts in Complex Networks: A Case of Wikipedia. J. Digit. Inf. Manag. 2017, 15, 203–213. [Google Scholar] [CrossRef]
  45. Newman, M. Networks; Oxford University Press: Oxford, UK, 2018. [Google Scholar]
  46. Bonacich, P. Power and centrality: A family of measures. Am. J. Sociol. 1987, 92, 1170–1182. [Google Scholar] [CrossRef]
  47. Katz, L. A new status index derived from sociometric analysis. Psychometrika 1953, 18, 39–43. [Google Scholar] [CrossRef]
  48. Opsahl, T.; Panzarasa, P. Clustering in weighted networks. Soc. Netw. 2009, 31, 155–163. [Google Scholar] [CrossRef]
  49. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
  50. Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, PMLR, Bejing, China, 22–24 June 2014; pp. 1188–1196. [Google Scholar]
  51. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  52. Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
  53. Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, LA, USA, 1–6 June 2018; ACL: New Orleans, LA, USA, 2018; pp. 2227–2237. [Google Scholar] [CrossRef]
  54. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  55. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  56. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; ACL: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  57. Ethayarajh, K. How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; ACL: Hong Kong, China, 2019; pp. 55–65. [Google Scholar] [CrossRef]
  58. Babić, K.; Martinčić-Ipšić, S.; Meštrović, A. Survey of neural text representation models. Information 2020, 11, 511. [Google Scholar] [CrossRef]
  59. Ulčar, M.; Robnik-Šikonja, M. Finest bert and crosloengual bert. In Proceedings of the International Conference on Text, Speech, and Dialogue, Brno, Czech Republic, 8–11 September 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–111. [Google Scholar]
  60. Khoshgoftaar, T.M.; Golawala, M.; Van Hulse, J. An empirical study of learning from imbalanced data using random forest. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), Patras, Greece, 29–31 October 2007; Volume 2, pp. 310–317. [Google Scholar]
  61. Liu, X.Y.; Wu, J.; Zhou, Z.H. Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 2008, 39, 539–550. [Google Scholar]
  62. Dietterich, T.G. Ensemble methods in machine learning. In Proceedings of the International Workshop on Multiple Classifier Systems, Cagliari, Italy, 21–23 June 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 1–15. [Google Scholar]
  63. Ruck, D.W.; Rogers, S.K.; Kabrisky, M. Feature selection using a multilayer perceptron. J. Neural Netw. Comput. 1990, 2, 40–48. [Google Scholar]
  64. Kumar, S.; Mallik, A.; Panda, B. Link prediction in complex networks using node centrality and light gradient boosting machine. World Wide Web 2022, 25, 2487–2513. [Google Scholar] [CrossRef]
  65. Popov, S.; Morozov, S.; Babenko, A. Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  66. Arık, S.O.; Pfister, T. Tabnet: Attentive interpretable tabular learning. In Proceedings of the AAAI, Online, 2–9 February 2021; Volume 35, pp. 6679–6687. [Google Scholar]
  67. Roesslein, J. Tweepy Documentation. 2009. Available online: http://tweepy.readthedocs.io/en/v3 (accessed on 10 September 2022).
  68. Müller, M.; Salathé, M.; Kummervold, P.E. Covid-twitter-bert: A natural language processing model to analyse covid-19 content on twitter. arXiv 2020, arXiv:2005.07503. [Google Scholar]
  69. Hagberg, A.A.; Schult, D.A.; Swart, P.J. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference, Pasadena, CA, USA, 19–24 August 2008; pp. 11–15. [Google Scholar]
  70. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
  71. Guo, H.; Yang, L.; Liu, Z. UserRBPM: User Retweet Behavior Prediction with Graph Representation Learning. In Proceedings of the International Conference on Mobile Multimedia Communications, Okayama, Japan, 12–16 December 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 613–632. [Google Scholar]
  72. Babić, K.; Petrović, M.; Beliga, S.; Martinčić-Ipšić, S.; Pranjić, M.; Meštrović, A. Prediction of COVID-19 related information spreading on Twitter. In Proceedings of the 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 27 September–1 October 2021; pp. 395–399. [Google Scholar]
  73. Petrović, M.; Hrelja, A.; Meštrović, A. Prediction of COVID-19 tweeting: Classification based on graph neural networks. In Proceedings of the 2022 45th Jubilee International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 23–27 May 2022; pp. 307–311. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.