1. Introduction
Many real-world systems and complex phenomena can now be modeled as a network where actors represent the entities or individuals and links denote their relationships or inter-dependencies. Due to the ubiquity of such real-world network intrinsic applications in various disciplines, dynamic network data have recently become widely prevalent where network events are time-stamped. One of the inherent underlying structures of these networked systems is their evolution over time experiencing temporal changes in overall network dynamics. Understanding the mechanism by which these evolutions occur is yet to be congruously standardized. However, network science has offered various methods supporting the study and modeling of the network evolution process that governs their dynamics [
1]. Among them, link prediction is the fundamental computational problem that models the underlying growth mechanism of evolving networks [
2]. Due to its primacy in understanding the evolution of networks, the link prediction mechanism of complex social networks has attracted extensive research attention. Subsequently, a wide range of methodological improvements are also engendered to support this part of link analysis. Most of these methods attempt to estimate the possibility of the emergence of new links among non-connected network actors by leveraging topological properties; actor and link attributes; local, global, or quasi-local network structures [
3]; or probabilistic models [
4]. The dependency on feature engineering [
5] and failure to acknowledge the temporal changes that emanate in dynamic networks [
6] are two major encumbrances of these methods. Furthermore, despite being called a time-evolving model, different link prediction strategies generally overlooked the evolutionary aspects of the network to take into account.
In an evolutionary (i.e., ‘longitudinal’, ‘temporal’, or ‘dynamic’) network, temporal patterns emerge through simultaneous arrivals and/or departure of actors including the creation and/or deletion of links among these actors. Actors (i.e., nodes) in dynamic social networks are subject to varying dynamicity regarding their network positions, neighborhoods, and communities formed within the temporal network snapshots. Temporal variations in different network activities (e.g., forming or severing links) result in temporal changes in actors’ structural positions and neighborhoods. Furthermore, these actor-oriented microscopic network changes may result in mesoscopic alterations of network structure (e.g., community). These facts led the scholars to take into account the evolutionary community related information in the dynamic network in the link prediction task.
Communities in social networks implicitly denote groups of actors with similar features or attributes or actors closely tied according to their roles, social interests, or collective behavior. As attributes, social patterns, roles, and interests of actors change over time, so do their network activities and association patterns. These result in fluctuations in both local and global network structures. Due to the evolutionary patterns of link structures of actors in dynamic networks, they eventually endure existing community membership or gain new membership to different communities. Consequently, communities of actors may shrink or increase in size or completely disappear, erode, or engender new ones over time. Therefore, it is believed that in evolving social networks, temporal microscale actor-level changes trigger mesoscopic or collective changes. By mining the similarity between actor-level temporal microscale (i.e., neighborhood changes) and mesoscale (i.e., community membership) fluctuations, it is possible to generate dynamic similarity metrics (i.e., dynamic features) for dynamic link prediction. Therefore, this study sought to develop such dynamic features by analyzing and mining temporal community-aware information, incident to actors, in dynamic networks. The contributions of this study are as follows: First, this study defines the rate of actor-level evolutionary changes regarding community memberships and associated neighborhood changes over time. Second, it computes dynamic features by mining community evolution representing the similarity between non-connected actors. To compute these features, it considers network structure, temporal information, and evolutionary community-aware information over several network snapshots. Third, this study conducted extensive experiments and evaluations of these features via supervised machine learning algorithms to measure their performance in dynamic link prediction tasks.
2. Related Work
The researchers explored historical or temporal information [
7] in conjunction with a wide range of techniques to infer the possibility of future links among the network actors of dynamic networks—known as link prediction in dynamic (i.e., temporal) networks or dynamic (i.e., temporal) link prediction. Myriad methods were designed for this purpose that differ in the comprehension of the temporal nature of the networks and the definition of the network property to be preserved. The researchers used different centrality measures [
8] as the influence factor of the network actors that support both the prediction of future associations among them and capturing their associated network structural changes in dynamic networks [
9]. For example, Zhang et al. [
10] used the eigenvector centrality measures for temporal link prediction that compared and contrasted the contributions of common neighbors in the emergence of future links. Chi et al. [
11] categorized the actors in the dynamic network by considering their evolving influence strength in comparison to their neighbors and used this factor to compute the attraction force as the connection probability among them. Most of the primitive methods of link prediction used either heuristics or different structural and network topological features to compute individual similarities as the connection probability between actors. The most comprehensive list of similarity indices based on both neighborhood topological and nodal attributes was presented by Bliss et al. [
12], where the authors used the covariance matrix adaptation evolution strategy to compute the weight of individual similarity measures.
The time-varying nature of dynamic networks and the inter-dependency between the evolutionary patterns and link prediction in dynamic networks make the task of dynamic link prediction even more challenging. Therefore, researchers also delved into other techniques including dynamic latent space representation of actors and random walk-in temporal networks [
13], probabilistic temporal measures [
14], probabilistic generative models [
15], matrix and tensor factorization [
16], and deep learning methods [
17]. Divakaran and Mohan [
18] developed a taxonomy of dynamic link prediction methods based on various approaches and categorized them into five main classes including: (i) time series approaches, (ii) probabilistic approaches, (iii) matrix factorization, (iv) spectral clustering, and (v) deep learning methods. The frequently evolving structure of the network makes the time series approach a promising option for dynamic link prediction. Studies [
19,
20,
21] in this approach deployed various time series of actor’s centrality measures, network structural features, or various similarity indices between each node pair, along with forecasting models for predicting the future values of the features or indices as the connection probability between node pairs. For example, Wu et al. [
22] considered the eigenvector-based nodal centrality in conjunction with a forecasting method (i.e., adaptive weighted moving) for link prediction in dynamic networks. A few techniques [
23,
24] of dynamic link prediction employed probabilistic models that deployed maximum likelihood approaches or probability distributions. This allowed them to consider variations and quantify the uncertainty around emerging links. As an effective tool for large-scale data processing and analysis, matrix factorization-based methods factorize (i.e., decompose) a matrix into its constituent factors to simplify complex operations. Dynamic link prediction methods based on matrix factorization [
25,
26] represent the network property in the form of a matrix (e.g., adjacency matrices) and factorize this matrix to form the features for performing the link prediction task. Finally, by exploring the properties of a graph via eigenvalues and eigenvectors of the adjacency matrix or Laplacian matrix associated with the graph, a few studies [
27,
28] of dynamic link prediction exploited the spectral graph theory. These techniques utilized a low-rank approximation approach that supports large-scale graphs where the matrix factorization approach does not fit well.
Nevertheless, the high representational ability [
29] of a Deep Belief Network (DBN) built upon the Restricted Boltzmann Machine (RBM) allowed researchers to use deep learning approaches to solve the dynamic link predicting problems. A few earlier deep-learning-based dynamic link prediction models concentrated on modeling an RBM, which is a special case of Markov random field. The basic function of these models is to incorporate temporal and neighbor information to train an RBM over a sequence of observations of the dynamic network structures to compute the connection probability among neighbors. In one of the earliest models, Li et al. [
7] proposed a generative model called the conditional temporal restricted Boltzmann machine(ctRBM) that can integrate the neighbors’ influence and individual transition variance in a dynamic network with nonlinear transitional patterns to compute the connection probability between neighbors. Among recent techniques, Jinyin Chen and coauthors [
30] developed a graph convolution network(GCN) embedded long-short-term memory(LSTM) deep-learning model known as GC-LSTM. It was capable of learning spatiotemporal features that extracted the network structural features from dynamic temporal network snapshots via graph convolution and learned temporal structure through LSTM. Their model can predict both the emerging links and obtain accurate predictions of the whole dynamic network evolution. Recently developed techniques such as network representation learning [
31] and graph embedding techniques [
32] promoted the representation of graphs in a low-dimensional vector space that not only preserved the network properties but also eased the strenuous feature engineering process. The objective of the embedding-based dynamic link prediction approaches [
33,
34] is to predict the emerging links of a network from the low-dimensional embedding vectors.
Despite their improved performance in predicting emerging or hidden links, some of these aforementioned methods are subject to their inherent limitation. For example, probabilistic models require a prior definition of the distribution of link occurrences, which is strenuous for temporal networks. Some probability-driven models (e.g., exponential random graphs) are only suitable for small networks. Furthermore, matrix- or tensor-based methods are not feasible for real-time link prediction in large networks due to their inherent complexity regarding computational and processing time requirements [
35]. Additionally, the data-hungry deep learning methods not only rely on large amounts of labeled data to train and optimize their models effectively but also employ an extra layer of complexity in the representation of data and the prediction process [
36].
Researchers also exploited network community information in dynamic link prediction. Yuhang Zhu et al. [
37] considered the concept of collective influence in percolation optimization theory—an effective attribute of nodes, community multi-feature fusion, and embedded representation to predict links in dynamic networks. Their method integrated collective influence, the community random-walk features, and the centrality features for dynamic link prediction. In a more recent study, Kumar et al. [
38] proposed a dynamic link prediction algorithm that used the parameterized influence area of actors and their contribution to community partitions. Their method considered different features based on local, global, and quasi-local similarity and community information. By mining the temporal patterns of evolutionary changes associated with actors concerning their neighborhoods and communities in the dynamic networks, Choudhury and Uddin [
39] developed dynamic similarity metrics (i.e., dynamic features) for supervised dynamic link prediction.
5. Results
Table 2 sets out the performance scores of three different classifiers in classifying positively and negatively labeled links using the dynamic features
, a topological similarity metric
known as “ResourceAllocation”, and a time series forecasting-based metric
for link prediction in dynamic networks. Note that
denotes all dynamic features (i.e.,
,
,
, and
). The metric
was computed by considering an aggregated network consisting of all SINs in the training phase of each network dataset. On the other hand, to compute
, as mentioned earlier, for each pair of actors in the classification dataset, the Jaccard coefficient was calculated for each SIN to build a time series of topological similarity metrics, followed by the ARIMA forecasting method to predict future values of this co-efficient. These forecasted values were fed into the classifiers for training purposes. The classifiers’ performances are demonstrated by considering three different performance metrics (i.e., accuracy, AUCROC, AUCPR), as described before. In regards to the accuracy score, this study observed that both linear and ensemble-based classifiers performed reasonably well using the dynamic similarity metrics/dynamic features constructed in this study compared to the other two. In each row of
Table 2, for each dataset (
,
,
,
, and
), the highest score for three different evaluation metrics (i.e., accuracy, AUCROC, and AUCPR), the top-performing metric (i.e.,
,
, and
) are presented in bold-faced numbers. For example, in
, for the bagging classifier,
was the highest performer in accuracy score (i.e., 77.26),
was the highest performer in AUCROC (i.e., 0.617), and
was the best performer in terms of AUCPR (i.e., 0.26) Alternatively, according to each classifier, irrespective of the datasets, the highest score for each evaluation metric category is colored red, and the second highest is colored green. For example, considering the same bagging classifier, the highest (i.e., 93.24) and the second-highest accuracy scores (i.e., 90.82) were recorded in the co-authorship dataset
. However, the highest and the second-highest AUCROC scores for the same classifier (bagging) were recorded in the co-authorship dataset
and the Facebook dataset
, respectively (i.e., 0.876, and 0.655). Conversely, the highest and the second-highest AUCPR scores for the same classifier (bagging) were recorded in the Facebook dataset alone
.
By considering the dynamic features, and the accuracy scores, the highest accuracy-based performance was achieved in the co-authorship dataset using the bagging classifier, and the lowest performance was recorded in considering the linear classifier logistic regression. Considering the AUCROC scores, the highest performance was also achieved in the dataset using the random forest classifier, whereas the lowest was logged again in by considering the linear classifier logistic regression. Considering the lowest AUCPR score as defined earlier, most of the classifiers demonstrated optimal performances exceeding the minimum values calculated earlier (i.e., for and for the rest of the datasets); however, the highest value was recorded in the dataset using both the ensemble classifier random forest and the linear classifier logistic regression. Conversely, the lowest was recorded in considering both the ensemble classifiers. Regarding different classifiers, using the random forest classifier, the dynamic similarity metrics outperformed both and in all datasets in regards to the accuracy scores. However, considering the AUCPR and AUCROC, it is four out of five (i.e., exceeded in ). On the other hand, considering the bagging algorithm, was outperformed by the other metrics in three out of five datasets in regards to the accuracy score, two out of five in regards to the AUCPR, and only in considering the AUCROC score. Overall, showed superior performance compared with the other two approaches considered in this study. However, in a few cases, and outperformed . Therefore, a further comparison of these three approaches is needed.
Table 3 serves this purpose by taking RPI to compare these link prediction approaches across three performance measures. According to this table, the proposed dynamic attribute-based link prediction approach outperformed the other two considered in this study. In summary, our proposed approach for dynamic link prediction in complex networks showed superior performance by a large margin compared with two other existing well-known approaches across three performance measures (i.e., accuracy, AUCROC, and AUCPR).
Considering the accuracy scores, the worst performance, although the extent was insignificant, was noticed in the case of logistic regression where was overtaken by the other metrics in three datasets out of five. With regard to the AUCROC and AUCPR, a similar performance was observed by the linear classifier. From the aforementioned performance observations, it is evident that the dynamic features constructed in this study undoubtedly outperformed the existing metrics, used in link prediction, in most cases by considering the ensemble classifiers. In the case of the linear classification, the rivalry between three different features (i.e., , , ) were competitive. Nevertheless, the performance demonstrated by the logistic regression algorithm is better than a random classifier and justifies the fact that the dynamic features of this study can be effective in predicting emerging links in dynamic networks, even by considering a simple linear classifier. In the case of the ensemble-based classifiers, bagging, where a decision tree was used as a base classifier, is susceptible to overfitting and computationally expensive, as it considers all the available features to split a node in decision trees. Conversely, the random forest, a special case of bagging, randomly considers only a subset of the best features of those available. Therefore, it performed superior to bagging in some cases. In the co-authorship networks, considering the dynamic features, the bagging algorithm observed better performance by considering all three performance metrics.
In
, considering all three performance metrics and all three classifiers, improved performances were demonstrated, as in
Figure 4. This study presents the ROC and P-R curves of the other four datasets to portray a comparable picture of the three classifiers’ performances. It is noteworthy that in P-R plots, curves tend to lie in the bottom left corner of the graph. The closer a curve is to the diagonal line, the higher the classifier’s performance in classification. Conversely, in ROC plots, curves tend to lie in the top-left region of the plots. The higher the curve is from the diagonal line, the better the predictor’s performance. Apart from
, considering both the P-R and ROC curves in the other datasets, logistic regression was found to compete with the random forest algorithm for superiority, whereas bagging was found comparably to be the least-performing one in most cases. In
, considering the ROC plots, it is observed that all classifiers tend to perform similarly and closer to a random classifier, which can achieve a maximum AUCROC score of 0.50. However, the best performance was observed in the P-R curve (i.e., closer to the diagonal line), which established the fact found in
Table 2 in regards to the AUCPR score. Further study can reveal the underlying reason behind the classification performance differences, demonstrated by different classifiers; however, from the aforementioned classification performances, demonstrated in the table and figure, it can be concluded that the dynamic similarity metrics constructed in this study can be successfully employed to predict future links in dynamic networks.
After the performance measurement of the dynamic features, at this stage, this study attempted to determine the relative importance of four different dynamic features (i.e.,
,
,
, and
) to assess their relative competency in dynamic link prediction tasks in all five datasets. For this purpose, this study took advantage of two different algorithms (i.e., information gain and chi-square evaluation) provided in the WEKA machine learning software (
https://sourceforge.net/projects/weka/files/).
Table 4 provides a comparable picture of these features regarding their rank of importance obtained by these algorithms. The ranks of the features are assigned in decreasing order, with one denoting the highest ranking. Information gain and chi-square evaluator algorithms evaluate the worthiness of a feature by calculating the information gain and chi-square statistics for the class variables. On the other hand, the last two columns denote the rank of a feature regarding the support vector machine (SVM) and random forest classifier. Finally, all the ranks for the four algorithms were aggregated to generate the final rank. From this table, it is observable that
, which represents the dynamic similarity metric, constructed by considering the correlation between time series of actor-level community dynamicity values, became the most prominent feature in two datasets (i.e.,
,
), and the second-best in the co-authorship datasets
. On the other hand,
, the dynamic similarity metric constructed by considering the temporal similarity of community dynamicity values of actor pairs using the DTW method, became the leading feature in
,
, and
. Among the dynamic features, generated by considering the temporal community-aware network structures and with the help of two different community detection algorithms, the feature constructed by considering the Louvain algorithm was found to be more effective than the other.
To answer the second research issue of this study, as described in the introduction section, distributions of dynamic feature values are represented in
Figure 5. For each network dataset, the first and second-best-performing features were selected from
Table 4; for example,
and
for the
dataset. This study observed from this figure that it is not always obvious whether either dissimilar or distant actors in regards to their dynamic feature values (i.e., lower values) or similar and closer actors (i.e., higher feature values) participate in emerging links considering any particular feature. For example, in
, considering
(i.e., the temporal similarity of community dynamicity values computed by the DTW method), the non-participating actor pairs (i.e., negatively labeled links in the classification datasets) had higher values than the actors’ genuinely formed links in the test phase. This signifies that actors with dissimilar temporal evolution had a higher possibility of forming emerging links. Conversely, in the same dataset, considering
(i.e., temporal community-aware network-structure-based features constructed by considering the Louvain community detection method), the picture is the opposite. In this case, the positively labeled links (i.e., actors participating in emerging links) had higher values. On the other hand, in the case of
, from the distribution of
, this study observed that actors having similar temporal evolution had a higher possibility of forming links.
6. Discussion and Conclusions
The link prediction problem in social networks has gained considerable interest from various domains, and consequently, divergent prediction strategies, metrics and methodologies have emerged in aiding this problem of network science. The ineptness of these strategies in accommodating the associated dynamicity and evolutionary information in dynamic networks has led them to be incompetent in dynamic link prediction, despite their compliance with the performance expectations. Therefore, the “time” component needs to be integrated as a parameter to the dynamic link prediction problem to better approximate the temporal network evolution. Consequently, researchers complied with this requirement and applied both time series analyses and evolutionary aspects (e.g., temporal link decay, duration of link activeness) in the link prediction task in dynamic networks. Although the topological information and actor attributes are predominantly the principal sources of information used in the prediction problem, however, due to the modular structure of social networks, community information can also be effectively exploited for this purpose. The dominant rationales behind this are: (i) community structure manifests the information about actors with similar behavior that can be conducive to predicting their future interaction [
41] and (ii) the high and low condensation of links among actors can be an effective prophecy towards emerging links [
60]. Furthermore, incorporating community-related structural information can drastically improve the accuracy of link prediction [
61]. Therefore, scholars tend to acknowledge that emerging links can be predicted by mining the evolutionary information extracted from the network snapshots over time, in association with dynamic network topology, evolutionary mesoscale network structure, and temporal actor-level neighborhood changes. Motivated by the aforementioned phenomenon, this study attempted to propose a novel solution to the problem of dynamic link prediction by defining dynamic similarity metrics using the dynamic community-aware information extracted both at the local (i.e., actor-level) and global (i.e., network) level. In addressing the problem of dynamic link prediction, this study first defined an actor-level measure to render the temporal community-aware evolution, known as community dynamicity. It also considered the rate of changes concerning the actor’s cliquishness, community participation, and associated neighborhood changes. These attributes were later used to develop evolutionary features. Since these features were constructed by considering the temporal evolution experienced by actors, it is noteworthy that one of the important aspects of dynamic network analysis is to define the optimal time scale to sample the network to generate time series of network snapshots (i.e., SIN). For this purpose, we selected a method from the literature. Once the optimal temporal window size was defined, three different dynamic features were constructed: first, by measuring the temporal similarity of both temporal sequences of community dynamicity values, incident to a pair of non-connected actors, with the help of the DTW method; and second, by computing the correlation between both sequences. Finally, with the help of two different existing community detection algorithms, by integrating evolutionary community-aware topologies in conjunction with both inter- and intra-community network structures. In a supervised link prediction setup, these features were applied to five different undirected social networks of different sizes and domains. Two ensemble-based classifiers and one linear classifier were used to measure the performance of the dynamic features. Needless to mention here, since time series analysis is well-adopted in dynamic link prediction tasks, this study used a well-defined time series forecasting method, known as exponential smoothing, to predict the future values of actor-level community dynamicity. However, unlike other dynamic link prediction strategies, instead of using these predicted values to train the classifiers, it computed the similarity between two time series with the help of DTW and Pearson correlation measures and used these similarity measures to train the classifiers. By considering the performance metrics, this study observed that these features could be indulged for dynamic link prediction purposes and can effectively support modeling the network growth. The performance of dynamic features was also compared with a traditional topological metric (i.e., ResourceAllocation), which is widely used for link prediction purposes in cross-sectional networks. We also considered a time-series-based dynamic link prediction strategy as a baseline method. In both cases, this study observed that dynamic features, constructed by leveraging the evolutionary community-aware aspect of actors, performed not only as outstanding as the existing ones but also, surprisingly in most cases, outweighed them to demonstrate superior prediction performance. This study can further be extended in different ways. For example, instead of the temporal clustering tendency of actors, other network structures or topology (e.g., assortativity) can be exploited, including other time series forecasting methods (e.g., ARIMA) instead of exponential smoothing, and other similarity measures (e.g., Euclidean, Manhattan) can be employed to measure the similarity between temporal information. In the case of the third metric, other community detection algorithms (e.g., edge betweenness) can be used to enhance the prediction performance. Finally, like many other applications of link prediction problems, this study can be valuable to help define new dynamic similarity metrics for dynamic link predictions in networks that inherently evolve over time, including terrorist networks, online social networks (e.g., Twitter), scholarly and knowledge networks (e.g., keyword network), and collaborative filtering to model the consumers’ buying behavior.