Finding Influential Users in Social Media Using Association Rule Learning

Influential users play an important role in online social networks since users tend to have an impact on one other. Therefore, the proposed work analyzes users and their behavior in order to identify influential users and predict user participation. Normally, the success of a social media site is dependent on the activity level of the participating users. For both online social networking sites and individual users, it is of interest to find out if a topic will be interesting or not. In this article, we propose association learning to detect relationships between users. In order to verify the findings, several experiments were executed based on social network analysis, in which the most influential users identified from association rule learning were compared to the results from Degree Centrality and Page Rank Centrality. The results clearly indicate that it is possible to identify the most influential users using association rule learning. In addition, the results also indicate a lower execution time compared to state-of-the-art methods.


Introduction
Online social networks are playing an important role in our society and have created a platform for people to communicate and express their thoughts. With the use of online social media, we have created a way to mimic real human communication in an online environment. Facebook alone attracts 1.3 billion users with 640 million minutes spent each month on the site. Consequently, discovering trending topics or influential users is of interest for many researchers interested in areas such as marketing [1]. Several studies have tried to identify user influence; however, most have used Page Rank Centrality [2,3] or Degree Centrality [3,4] based approaches to identify influential users. This paper builds on the initial discoveries on association rule learning in social networking sites: [5].
In this article, we argue that users on Facebook groups are following each other and that it is possible to detect influential users and predict user participation. For example, if users A, B, C and D share common interests, there is a chance that if A, B, and C already have commented on a topic, D will also comment on it. Therefore, this paper relates to how users perform actions (e.g., comments or likes) on posts in Facebook pages. In addition, we use association rule learning to discover relationships between users in our dataset [6]. Given a list of posts from a specific domain, we extract users' actions, such as comments and likes. Using association rule learning on the data, we argue that it is possible to predict if a particular user will or will not participate on a post discussion based on the other users' activity. This article has three major contributions: firstly, possibilities to identify influential users using association rule learning are presented; secondly, we present time performance of well-known is done by using two criteria, namely, support and confidence. Support indicates the frequency of such items, while confidence indicates how many times those rules in the whole dataset are correct. An example of an association rule is the following: "Ninety-percent of transactions that purchase bread and butter also purchase milk" [33].
As stated in Section 1, we are trying to assess user participation in a post based on previous interactions with other users on common posts within one page. We assume that if user A participates in most of the posts where user B is participating as well, there is a high chance of A participating in a new post where B is already active, either because participation of B influences A to participate and/or they both have similar interests. The method of matching items in different transactions is called association rule learning. We apply association rule learning to the domain of social media where we model the data as follows. Items correspond to users on Facebook and transactions correspond to posts. A user is considered to be active and part of the transaction as an item if the user comments on a post. From the selected dataset described in Section 4, we firstly count the frequency of all posts where A and B are active, respectively. Secondly, we count all posts where A ∪ B both participate. This gives us two measures, length (the number of participating users in the set) and frequency (the sum of all posts where the users are participating). These two steps can be summarized as building frequent item-sets ({I }). Finally, all possible rules from the computed {I}s are generated. In this step, we also compute the evaluation metrics described below.

Evaluation Metrics
Several metrics exist that will help understand the learned association rules. The first measure, Support, shows how big of a portion of {D} the item-set covers. It is calculated by dividing the frequency of a given item-set, {I}, with the total number of transactions (posts) in our dataset, {D}, or the number occurrences of {A, B} divided by the number of items in {D}. As shown in Equation (1): The second measure, Confidence, indicates the proportions of transactions that contain {A, B} that also will contain C in the set of transactions in {D}, given the following rule {A, B} ⇒ C. Confidence is calculated as shown in Equation (2). Say that {A, B, C} participates in four common posts and {A, B} participates in eight posts in total. This leads to 4/8 = 0.5, or the confidence that C will participate on a post where A and B already are active is 50 %: The third measure, lift, shows the ratio of interdependence of the observed values. As we see from Equation (3), if lift is 1, it implies that the rule and the items are independent from each other. However, if lift is > 1, the lift indicates the dependency of our item-sets: Finally, conviction is the ratio of the expected support that {A, B} occurs without C as shown in Equation (4). Notably, conviction is infinite (due to division with zero) when the confidence is 1: The described measures enable understanding of the learned rules in {D}, where higher numbers of all four measures indicate that the learned rule has relevance for prediction.

Usage of the Eclat Algorithm
To build association rules from our dataset, we evaluated several implementations. Agrawal and Srikant [34] presented the Aprori algorithm, which was proven to be an efficient method for association rule learning. However, this algorithm is proven to have efficiency issues in large datasets [35], and the identified implementation for Python is very slow (considering that in our dataset it was not possible to get a result within a reasonable time). Hence, other algorithms were tested, in particular, the Eclat algorithm [36]. The Eclat algorithm quickly discards items with low frequency by considering a minimum number of associations as input parameters. We have found that a reasonable trade-off between resolution and speed is four, in our dataset, where a lower frequency of items is ignored. The use of four as a lower bound was identified empirically by starting at the number of comments divided by the number of users and then calculating the item-sets with decreasing threshold until the execution speed reached 10 s. At 10 s, all available RAM memory in our experiment environment was exhausted, and we stopped the execution. For one of the investigated pages, we saw that with a threshold of five, we can generate 4230 item-sets in 350 ms, and with a threshold of four, we can generate 9117 item-sets in 600 ms. A threshold of three fills up available resources and never completes the calculations.

Data Model
The data used in this study have been obtained from the crawler described by Erlandsson et al. [37]. This crawler gathers complete posts from Facebook. In this context, the term complete, stands for posts that contain all likes and comments created up to the crawling time as well as the data about the users who have created them. Our current dataset, captured from public pages and groups on Facebook, consists of over 56 million posts, 560 million comments and 7.3 billion likes made by 820 million Facebook users. The crawled data was parsed and made available from an SQL database, structured as described in [38], making all fields needed for our task available. In this study, we assume that the investigated posts will not get any new comments. We simplify the dynamics of social media by saying that the posts we are investigating were "dead" when the data was collected, in which the term of dead posts refers to posts that no longer attract attention, new comments, or likes.
This study is limited to only active users. Thus, we exclude posts with less than 20 comments and users who had less than five comments, as they are considered to be occasional visitors and not real page participants.

Data Selection
We have sampled 195 pages from our dataset, varying in terms of the number of users, posts, comments and user activity to make the sample of Facebook data as broad and as diverse as possible. Despite the fact that we have calculated the rules using a server with 144 GB of RAM memory and a 24 core processor, we could not calculate the rules for the biggest pages (44 of them), thus we had to remove them from our dataset. An example of such a page is Fox News with 837,176 users 4485 posts, 6,967,304 comments, and a lifetime of 2034 days (almost six years). An additional 43 pages had to be removed because they were too small, i.e., having less than 10 posts with more than 20 comments and/or less than 10 users with more than five comments. After the preprocessing, we still had 108 pages ranging from 152 to 675,200 active users, from 18 to 161,264 posts, and from 577 to 1,340,730 comments. Table 1 presents the descriptive statistics of this dataset.
For the initial results, the page [39] has been selected. This page was selected based on the following properties: it is active, it has a high number of users, and it is political with a biased user group (most of the users have positive perceptions of the Occupy movement). It was also selected as it is a page in the median range of the complete dataset with respect to the number of active users, 2443, and active posts, 610.

Experiments and Results
To verify the findings, several experiments were executed. These experiments were firstly performed on the page OccupyTogether, and were extended to the whole dataset described in Section 4 for verification of the results. First, a comprehensive experiment of association rule learning was conducted. Secondly, the learned rules were evaluated with respect to prediction accuracy of user participation using a training test split (80/20). Finally, social network analyses for each page were performed to verify and evaluate ranked users identified as influential by the first experiment.

Item-Sets and Rules
Using the methods described in Section 3.2, an experiment was performed to create frequent item-sets and build association rules for these sets. The resulting frequent item-sets are depicted in Figure 1 for the page OccupyTogether. This figure illustrates frequency, or the number of occurrences for each item-set, with respect to the length of elements, or the number of collaborating users. The main scatter-plot illustrates how the frequency decreases when the number of users (length) increases, a natural feature of frequent item-sets. Figure 1 also depicts the distribution as histograms. The top histogram, in green, shows the distribution of frequency and, the histogram on the right hand side, in red, shows the distribution of the length of the learned item-sets. The histogram to the right (in green) illustrates a significant density of user collaboration that occurs at a low frequency, between 1 and 10. This is natural as the frequency of user participation decreases for most of the users. Noticeable on the length distribution (in red) is the fact that the density is higher for two and three participating users than for just one. This is because there exist more combinations of users than the number of single users. Association rules supporting the hypothesis of user participation based on other users' activities were computed from the calculated frequency item-sets. This resulted in 55, 166 rules for the page [39]. Table 2 shows descriptive statistics for all the computed rules. It can be noted that although the confidence median and mean is low, the high level of lift indicates a high dependency of the learned rules, i.e., the computed rules show that our hypothesis is valid and users tend to follow each other. Since our dataset is big, with many users and many posts, a low support mean and median is expected. Moreover, it is noticeable that users are not active in all posts but more on a subset of them.  Figure 2 depicts the distribution, Confidence, Lift, Conviction and Frequency respectively in our learned model. The figures are violin-plots, which illustrate the kernel density (shown as height and depth) in addition to normal box-plots with outer quartiles as thin lines, inner quartiles as bold lines and the mean as a white dot. Figure 2a shows a dense distribution of support at 0.025 and, interestingly, a higher density at 0.20. The confidence distribution is illustrated in Figure 2b, in which we obtained a dense distribution around 1.0, i.e., there are a significant number of learned rules with high confidence, thus, the rule is accurate. Figure 2c shows that the lift measure has a heavy tail distribution. In addition, Figure 2d illustrates a distribution of conviction to be concentrated between zero and five.   19. Notably, when sorting by confidence and lift, the conviction is infinite (this is due to the confidence of 1.0) which is shown in how conviction is calculated in Equation (4). All of the rules in Table 3 have high confidence and show high dependency (via the lift metric), i.e., the top five rules sorted by either Confidence, Lift or Conviction are relevant for predicting user participation.
The rule, {u 580 , u 861 , u 1352 , u 1466 } ⇒ {u 896 , u 1291 } presented in Table 3 with a confidence of 1.0 and a lift of 152.5, strongly indicates that the left-hand-side user set influences the right-hand-side user set, i.e., when the left-hand-side user set is active on a post, the right-hand-side user set also will be active. A confidence of 1.0 means that 100% of the posts where the left-hand-side user set is active, the right-hand-side user set also will be active. A lift value of 152.5, in this specific rule, shows that the right-hand-side user set is dependent on the left. Considering rules where at least two separate users affect another user with a confidence of 95%, we can reduce the 55, 166 rules to 4959 rules, which have a median lift of 4.80 and a median support of 0.21. In other words, we have close to 5000 rules that strongly indicate that users are affected by each other when it comes to participating in online social networks. From learned rules, we can also identify influential users, or the users that exists on the left side of multiple rules as presented in Section 5.3.
The learned rules of the complete dataset are presented in Table 4, after filtering out rules with Confidence 95%.

Verification of Learned Rules
To test how well association rule learning works for predicting user participation, a split, learn and test pattern have been used. For the page in question, we sort all comments based on creation time and use the first 80 % for learning and the last 20 % of the posts for testing. The learning part is performed as described in Section 3.2, and the testing part is carried out as follows: for each post with comments in the testing set, the active users are considered by finding rules that affect the users with respect to temporal order. Say that user D is commenting on a post (in the testing set), and there exists a rule saying that A, B & C affect user D, this rule will only be considered to be valid if all of A, B & C have made at least one comment each before D makes a comment. Of the 787 intersecting users between the learning and test sets, it is possible to predict 113 (14.36 %) users, making use of 5310 (9.63 %) of the original 55,166 rules.
To calculate accuracy and precision of learned rules, we have defined true/false positive/negatives as follows: A true positive is a rule that predicts user activeness, and the user is active. A false positive is when a rule predicts user activeness, but the user is not active. A true negative is when no user is active, and there is no rule. A false negative is when a user is active, but there is no rule. An example of all four classes are shown in Table 5.  Table 6. The recall is low because there are many false negatives (calculated with TP/(TP+FN)). The relatively high accuracy is then achieved with a relatively high number of true negatives used in (TN+TP)/(TP+FP+TN+FN). In general, the unfiltered rules show a lower accuracy, precision, and recall compared to the filtered rules. Furthermore, the complexity of the rule set is reduced by filtering the rules, indicating the beneficial use of rule filtering. The rules set was on average reduced by approximately 93%. A less complex rules set could be easier to test and also to understand.

Identifying and Verifying Influential Users Using Social Network Analysis
The state-of-the-art method for identifying influential users is social networks analysis (SNA), using the methods Page Rank Centrality [3] or Degree Centrality [40] for ranking users. It is of interest to see how well influential users identified using association rule learning (ARL) match the state-of-the-art techniques. Therefore, we have conducted an SNA of our pages as follows: for each page, we have created social networks in such a way that two users are linked together if they commented on the same post: next, for all social networks, Page Rank [3] and Degree [40] measures have been calculated; and, based on those measures, two ordered (descending) user lists were created, one for each of them.
We have created a similar list for the most influential users from association rule learning. Most influential users are defined as the top-k users from the left side of the rules, with a confidence level of greater than 95 %, that affect other users to comment on posts. In the most influential users list, users are ranked based on how often they appear on the left side of the rule, e.g., if user A has appeared three times in all rules and users B, C and D have appeared one, five and four times, respectively, and the list will look as follows: [C, D, A, B].
Finally, we compared the most influential users identified from association rule learning with top users according to the degree and Page Rank. Comparison between association rule learning, Degree and Page Rank are considered the top 1 %, 5 %, 10 %, 25 %, 50 %, 75 %, and 100 % of the most influential users identified by association rule learning, respectively. The comparison was made as an intersection of two sets created from two lists. For example, if the top four users are [A, B, C, D] for Degree and [F, A, C, D] for association rule learning, the intersection of those two sets will be [A, C, D] and the size of that set is three, and, in this case, the similarity is 75 %.
The example of the SNA analysis for one of the pages [39] is presented in Table 7. The table shows that for the top 209 users on the page OccupyTogether (the 50 % most influential users from association rule learning), there is a similarity of 95 % between the users ranked by Page Rank and Degree. When considering users ranked from association rule learning, there is a similarity of 51 % compared to Degree and 53 % compared to Page Rank. From the SNA analysis, we detected yet another interesting insight into users' behavior in social media pages. We noticed that 10 % of users with the highest value of degree measure, created an average of 82.64 % posts, and an additional 10 % of the most important users add only four more percentage points of posts, i.e., 20 % of users with the highest value of the degree measure, create 86.84 % posts on average. In Figure 3, the distribution of that phenomena is depicted for all pages.
As described above, the three different approaches were used to detect the most influential users. The intersection between the different user lists were then calculated to evaluate how much each method differs from the others. To detect whether any statistical significant difference exists, Friedman's test was used with the Nemenyi post hoc test. Friedman's test is a non-parametric statistical test that ranks the methods over datasets [41]. When a normal distribution cannot be assumed and several datasets are used, Friedman's test has been suggested as preferable when comparing algorithms [42]. The Nemenyi post hoc test evaluates between which intersections a significant difference exists. The means and standard deviation for the intersections of several posts are presented in Table 8. A low standard deviation indicates that the expected value, i.e., the intersection between two sets, is close to the mean. However, there might still exist results which are not close to the mean, e.g., as seen in Table 7. The average shows that, regardless of the size of the intersection, Page Rank ∩ Degree has more users in common than the other intersections, while Page Rank and Degree, considered state-of-the-art, have a high amount of users in common (see Page Rank ∩ Degree in Table 8), the rule based learner has fewer users in common with both the Page Rank (Page Rank ∩ ARL) method and the Degree method (Page Rank ∩ Degree). Friedman's test shows that there are some significant differences between the intersects, 210, d f = 2, p = 0.01. The Nemenyi test result (see Table 9) demonstrates that the Page Rank ∩ Degree set performs significantly better than the Degree ∩ ARL set at a confidence level of both 0.95 and 0.99. The three different methods were investigated to identify influential users. The amount of time needed to identify influential users differs between the methods. This is shown in Table 10. Rule based learning is suggested to be the fastest method, and Page Rank the slowest. This might be explained by Page Rank being a global measure compared to the Degree, which is a local measure. The execution time of the different methods with the confidence intervals are also presented in Figure 4, where intuitively it would seem that the rule based learner has a significantly lower execution time than the other methods.  Whether there is any statistical significant difference is evaluated using a Kruskal-Wallis test followed by a pair-wise Wilcoxon post hoc test [41]. The Kruskal-Wallis test is used to see if there is a significant difference between any of the methods, and the post hoc test is used to detect between which methods the differences exist. The Kruskal-Wallis test detected a significant difference between the methods (χ 2 = 6.626, d f = 2, p < 0.05). The Wilcoxon post hoc tests showed a significant difference between Rule based and Degree (p < 0.05, w = 14130). No other statistical significant differences were found. While there exists a large difference in mean, there is no detectable significant difference between the Association Rule based method and Page Rank(p = 0.054, w = 13704). This might be due to the high standard deviation.

Discussion
Users within online social networks create a large amount of generated data in the form of interactions (comments and likes). Not enough attention has been put on the analysis of how users influence each other and how to predict the behavior of users within Facebook groups. In this paper, we have collected a significant amount of user data and then by using association rule learning, implemented and examined how users influence each other. Based on the results and analysis, we are able to determine to what extent users influence other users to participate and interact in new groups.
To verify the results from the page OccupyTogether, an additional 195 pages were sampled to verify our assumptions. These pages were reduced to 108 due to size constraints. Arguably, pages that were too large could have been processed by limiting the time span, i.e., instead of considering all six years of the page, a time span of the latest six months could have been considered. Association rules were computed for each page in our dataset. For association rules with confidence 95%, the mean was 33, 426.89 (sd = 87, 457.39), and a median of 2351 was found for the number of rules.
The computed rules were tested resulting in an average of 0.913 (sd = 0.115) for accuracy, 0.614 (sd = 0.340) for precision, and 0.141 (sd = 0.256) for recall when predicting user activity on a post. In other words, it is possible to predict a subset of users' future participation with high correctness.
The results also indicate that influential users can be identified using association rule learning. That is, users on the left-hand-side, in a rule with high confidence and high lift, are influencing users on the right-hand-side to participate in the conversation. These results have been verified and compared with the traditional network analysis methods, Page Rank Centrality and Degree Centrality. Showing that at best ∼30% of the users ranked using association rule learning overlap with the users ranked using traditional methods.
Interestingly, association rule learning are magnitudes faster in execution time for ranking users than other methods. Another finding related to the ranking of users is that we see no significant difference between ranked influential users based on Page Rank or Degree. However, we show that Page Rank is a more time consuming algorithm.
The main disadvantage of association rule learning is the fact that we cannot extract rules for the biggest pages in our dataset. We have not shown in this paper that association rule learning is better/or worse than other approaches. However, it was not the point of our research. Since there is no ground truth, it is not possible to say which approach is better (or worse). Our objective was to present a different approach for identifying influential users and leave the final decision of which approach to use to the researcher.
Furthermore, from the list of influential users, presented in Section 5.3, it is also possible to limit the size of the item-set. This will result in an increasing speed when building rules without a significant decrease in quality of the rules. As a validation threat, information on Facebook is filtered by a secret algorithm. This poses a potential validity threat to our results as users are presented posts filtered by the algorithm. For example, a reason for a user not commenting on a post might be due to visibility (the filtering algorithm is not presenting the post to the user) rather than by topic.

Conclusions
This article presents four contributions. Firstly, insights on user behavior on public pages on Facebook indicates that the top 10% and top 20% of users corresponds to a vast majority of the content. Secondly, it is possible to identify influential users using association rule learning. The results indicate no statistically significant difference between our rule based method compared to Page Rank. Thirdly, execution times of well known methods for ranking users in social media together with our approach using association rule learning are investigated. The results suggest that rule based ranking of users has lower execution time compared to state-of-the-art methods, 9.0 vs. 633.1 and 329.1 seconds on average. Finally, the article verifies how association rule learning can be used to predict user participation in social media pages on Facebook. The results indicate an average prediction accuracy of 0.913 (sd = 0.115) for the association rule learning approach.
For future work, it would be interesting to investigate rule creation with a time series perspective of the data e.g., using a sliding window approach. Additionally, methods to investigate a subset of users for rule creation need to be investigated.