Hot Topic Community Discovery on Cross Social Networks

The rapid development of online social networks has allowed users to obtain information, communicate with each other and express different opinions. Generally, in the same social network, users tend to be influenced by each other and have similar views. However, on another social network, users may have opposite views on the same event. Therefore, research undertaken on a single social network is unable to meet the needs of research on hot topic community discovery. “Cross social network” refers to multiple social networks. The integration of information from multiple social network platforms forms a new unified dataset. In the dataset, information from different platforms for the same event may contain similar or unique topics. This paper proposes a hot topic discovery method on cross social networks. Firstly, text data from different social networks are fused to build a unified model. Then, we obtain latent topic distributions from the unified model using the Labeled Biterm Latent Dirichlet Allocation (LB-LDA) model. Based on the distributions, similar topics are clustered to form several topic communities. Finally, we choose hot topic communities based on their scores. Experiment result on data from three social networks prove that our model is effective and has certain application value.


Introduction
The era of Web 2.0 has witnessed the rapid expansion of online social networks that allow users to obtain information, communicate with each other, and express different opinions.The content published by these social networks usually includes news events, personal life, social topics, etc.The information is not only used to discover hot topics and analyze topic evolution, but also to analyze and supervise public opinion.
The existing hot topic discovery methods are mainly limited to a single social network, such as Sina Weibo, Twitter, and so on.Generally, in the same social network, users are affected by each other, resulting in a similar point of view.However, under another social network, it is also possible for users to hold the opposite opinion for the same event.Therefore, research undertaken on a single social network is unable to meet the needs of research on hot topic community discovery.
In this paper, we propose a hot topic discovery method on cross social networks."Cross social networks" refer to multiple social network platforms.The integration of information from multiple social network platforms form a new unified dataset.In the dataset, information from different platforms for the same event may contain similar or unique topics.First, we fuse data from different social networks and build a unified model, which contains a lot of short text.Then, a new topic model called Labeled Biterm Latent Dirichlet Allocation (LB-LDA) is proposed to get latent topic distributions.Thirdly, topic cluster operation is processed to get multiple topic communities consisting of several latent topic labels.Finally, scores of different topic communities are calculated and communities with higher score are regarded as hot topic communities.Experiments on data from three different social networks show the hot topic discovery method can find hot topics at that time, and the hot topics are verified to be effective.The main innovations of this paper are as follows:

•
We conduct hot topic discovery on cross social networks instead of a single social network.

•
We propose a new topic model called LB-LDA, which can relieve the sparseness of the topic distribution.
The remainder of the paper is organized as follows.Section 2 presents an overview of related work.In Section 3, we elaborate on our hot topic discovery method on cross social networks.Section 4 describes an experimental result and presents our final hot topic communities.In Section 5, conclusions and suggestions for future research are made.

Existing Research on Cross Social Network
There is not much research on cross social network.Skeels et al. [1] conducted research on the usefulness of different social networks in large organizations.In 2010, research about comparison of information-seeking using search engines and social networks was conducted by Morris et al. [2], and the result showed that it was desirable to query search engines and social networks simultaneously.Most research is focused on user identification in different social networks to make recommendations.Dale and Brown [3] proposed a method to aggregate social networking data by receiving first authentication information for a first social networking service.Farseev et al. [4] performed a cross social network collaborative recommendation and showed that fusing multi-source data enables us to achieve higher recommendation performance as compared to various single-source baselines.Tian et al. [5] demonstrated a more powerful phishing attack by extracting users' social behaviors along with other basic user information among different online social networks.Shu et al. [6] proposed a CROSS-media joint Friend and Item Recommendation framework (CrossFire), which can recommend friend sand items on a social media site.

Existing Research on Topic Model
Topic modeling techniques have been widely used in natural language processing to discover latent semantic structures.The earliest topic model was Latent Semantic Analysis (LSA) proposed by Deerwester et al. [7].This model analyzed document collections and built a vocabulary-text matrix.Using Singular Value Decomposition (SVD) method, researchers can build the latent semantic space.Later, Hofmann et al. [8] proposed the Probabilistic Latent Semantic Analysis (PLSA), which improved upon the LSA model.PLSA considered that documents include many latent topics, and the topics were related to words.Prior to PLSA, dirichlet distribution was introduced by Blei et al. [9] and the Latent Dirichlet Allocation (LDA) approach was proposed.Due to the characteristics of LDA generation, this topic model has been improved and used in many different areas.The drawbacks for using LDA are that topic distribution tends to be less targeted and lacks definite meaning.Researchers improved the LDA topic models and applied these models in different areas.For example, Ramage et al. [10] improved the unsupervised LDA model by creating a supervised topic model called Labeled-LDA, in which the researchers could attach the topic meanings.Separately, many researchers chose to add a level to the three levels of document-topic-word.Ivan et al. [11] proposed a multi-grain model that divided the topics into two parts: local topics and global topics.This model was used to extract the ratable aspects of objects from online user reviews.A range of other approaches had been used as well.Chen et al. [12] modeled users' social connections and proposed a People Opinion Topic (POT) model that can detect social communities and analyze sentiment.Iwata et al. [13] took time into consideration and proposed a topic model for tracking time-varying consumer purchasing behavior.To recommend locations to be visited, Kurashima et al. [14] proposed a geographic topic model to analyze the location log data of multiple users.Chemudugunta et al. [15] suggested that a model can be used for information retrieval by matching documents both at a general topic level and at a specific level, and Lin et al. [16] proposed the Joint Sentiment Topic (JST) model, which can be used to analyze the sentiment tendency of documents.Wang et al. [17] proposed the Life Aspect-based Sentiment Topic (LAST) model to mine from other products the prior knowledge of aspect, opinion, and their correspondence.Targeting short text, Cheng et al. [18] proposed the Biterm Topic Model (BTM), which enlarged the text content by defining word pairs in one text as biterms.

Existing Research Topic Discovery
The Topic model has been widely used in hot topic discovery.Wang et al. [19] presented topical n-grams, a topic model that discovers topics as well as topical phrases, which was able to discover topic and phrase.Vaca et al. [20] introduced a novel framework inspired from Collective Factorization for online topic discovery able to connect topics between different time-slots.Li et al. [21] proposed a double-layer text clustering model based on density clustering strategy and Single-pass strategy, to find a way to process network data and discover hot news based on a user's interest and topic.Liu [22] proposed an effective algorithm to detect and track hot topics based on chains of causes (TDT_CC), which can be used to track the heat of a topic in real time.All the methods are limited to a single social network, therefore, it is necessary for us to discover hot topics on cross social networks.

Hot Topic Community Discovery Model
The general process of our method is shown in Figure 1.First, we collect text data from different social networks.Then, we execute data preprocessing and establish a unified model (datasets) as corpus.A single datum in this corpus includes time, label, content, and source.Considering that the corpus contains short text data, which may lead to sparseness of topic distribution, the LB-LDA topic model is proposed to get the topic distributions.Based on these topic distributions, similar topics are clustered to form topic communities, which contain a certain number of topic labels.Finally, the scores of different communities are calculated and communities with higher scores are chosen as hot topic communities.Overall, the main purpose of our model is to discover topics from cross social networks and cluster similar ones to form hot topic communities.proposed a geographic topic model to analyze the location log data of multiple users.Chemudugunta et al. [15] suggested that a model can be used for information retrieval by matching documents both at a general topic level and at a specific level, and Lin et al. [16] proposed the Joint Sentiment Topic (JST) model, which can be used to analyze the sentiment tendency of documents.Wang et al. [17] proposed the Life Aspect-based Sentiment Topic (LAST) model to mine from other products the prior knowledge of aspect, opinion, and their correspondence.Targeting short text, Cheng et al. [18] proposed the Biterm Topic Model (BTM), which enlarged the text content by defining word pairs in one text as biterms.

Existing Research Topic Discovery
The Topic model has been widely used in hot topic discovery.Wang et al. [19] presented topical n-grams, a topic model that discovers topics as well as topical phrases, which was able to discover topic and phrase.Vaca et al. [20] introduced a novel framework inspired from Collective Factorization for online topic discovery able to connect topics between different time-slots.Li et al. [21] proposed a double-layer text clustering model based on density clustering strategy and Singlepass strategy, to find a way to process network data and discover hot news based on a user's interest and topic.Liu [22] proposed an effective algorithm to detect and track hot topics based on chains of causes (TDT_CC), which can be used to track the heat of a topic in real time.All the methods are limited to a single social network, therefore, it is necessary for us to discover hot topics on cross social networks.

Hot Topic Community Discovery Model
The general process of our method is shown in Figure 1.First, we collect text data from different social networks.Then, we execute data preprocessing and establish a unified model (datasets) as corpus.A single datum in this corpus includes time, label, content, and source.Considering that the corpus contains short text data, which may lead to sparseness of topic distribution, the LB-LDA topic model is proposed to get the topic distributions.Based on these topic distributions, similar topics are clustered to form topic communities, which contain a certain number of topic labels.Finally, the scores of different communities are calculated and communities with higher scores are chosen as hot topic communities.Overall, the main purpose of our model is to discover topics from cross social networks and cluster similar ones to form hot topic communities.2 and 3.In Figure 2, news sites usually contain the news title, news time, news sources, and news content.Figure 3 tells us that Weibo information generally includes user ID, Weibo time, Weibo source, and Weibo content.In this paper, we only take text content into consideration.

Introduction to Cross Social Network
In cross social network, different social networks often have different data formats and presentations.News sites and Weibo website are representatives of different social networks, which are shown in Figures 2 and 3.In Figure 2, news sites usually contain the news title, news time, news sources, and news content.

Unified Model Establishment Method
Although social networks have different data formats, there are similar parts in the information.To establish our unified model, we choose data title, data time, and data content.If the data does not contain data title such as Weibo, this part is set to null temporarily.
To get the meaning of the latent topic, labels for each piece of data need to be added.The data can be divided into different parts.The data containing data title only need to perform word segmentation on the data title, and select the entity words as the labels.The data that does not contain the data title needs to perform word segmentation on the data content and choose some of the entity words as labels.Users with hashtags on social networks need to take both the tags and user-generated content into consideration.The user-generated content needs to be split into words and matched with hashtags.If matched, the matched tags can be considered as the text labels.If not, the entity words are chosen as the labels of this text.Besides, before the data fusion process, a data item needs to be added to identify the source of the current data, such as Sina Weibo, etc. Overall, a piece of data from our unified model contains four parts, including data time, data labels, data content, and data source.

LB-LDA Model
By analyzing latent topics in documents, the topic models can mine semantic connotations.However, the latent topics generated by LDA model does not have clear meaning.Besides, when faced with short text, the topic distributions tend to become sparse.Therefore, this paper proposes an improved topic model called LB-LDA, referring to the BTM model proposed by Cheng et al. [18] in 2014 and the L-LDA model proposed by Ramage D et al. [10] in 2009.

Definition of Biterm
Extending text is an effective way to mine latent topics from short texts.This paper refers to the BTM model [17], using biterms to expand texts."Biterm" refers to disordered word pairs occurring in a short text simultaneously.For instance, let us assume that there are four words in one short text { , , , } w w w w , the biterms are {( , ),( , ),( , ),( , ),( , ),( , )} w w w w w w w w w w w w .

Unified Model Establishment Method
Although social networks have different data formats, there are similar parts in the information.To establish our unified model, we choose data title, data time, and data content.If the data does not contain data title such as Weibo, this part is set to null temporarily.
To get the meaning of the latent topic, labels for each piece of data need to be added.The data can be divided into different parts.The data containing data title only need to perform word segmentation on the data title, and select the entity words as the labels.The data that does not contain the data title needs to perform word segmentation on the data content and choose some of the entity words as labels.Users with hashtags on social networks need to take both the tags and user-generated content into consideration.The user-generated content needs to be split into words and matched with hashtags.If matched, the matched tags can be considered as the text labels.If not, the entity words are chosen as the labels of this text.Besides, before the data fusion process, a data item needs to be added to identify the source of the current data, such as Sina Weibo, etc. Overall, a piece of data from our unified model contains four parts, including data time, data labels, data content, and data source.

LB-LDA Model
By analyzing latent topics in documents, the topic models can mine semantic connotations.However, the latent topics generated by LDA model does not have clear meaning.Besides, when faced with short text, the topic distributions tend to become sparse.Therefore, this paper proposes an improved topic model called LB-LDA, referring to the BTM model proposed by Cheng et al. [18] in 2014 and the L-LDA model proposed by Ramage D et al. [10] in 2009.

Definition of Biterm
Extending text is an effective way to mine latent topics from short texts.This paper refers to the BTM model [17], using biterms to expand texts."Biterm" refers to disordered word pairs occurring in a short text simultaneously.For instance, let us assume that there are four words in one short text { , , , } w w w w , the biterms are {( , ),( , ),( , ),( , ),( , ),( , )} w w w w w w w w w w w w .

Unified Model Establishment Method
Although social networks have different data formats, there are similar parts in the information.To establish our unified model, we choose data title, data time, and data content.If the data does not contain data title such as Weibo, this part is set to null temporarily.
To get the meaning of the latent topic, labels for each piece of data need to be added.The data can be divided into different parts.The data containing data title only need to perform word segmentation on the data title, and select the entity words as the labels.The data that does not contain the data title needs to perform word segmentation on the data content and choose some of the entity words as labels.Users with hashtags on social networks need to take both the tags and user-generated content into consideration.The user-generated content needs to be split into words and matched with hashtags.If matched, the matched tags can be considered as the text labels.If not, the entity words are chosen as the labels of this text.Besides, before the data fusion process, a data item needs to be added to identify the source of the current data, such as Sina Weibo, etc. Overall, a piece of data from our unified model contains four parts, including data time, data labels, data content, and data source.

LB-LDA Model
By analyzing latent topics in documents, the topic models can mine semantic connotations.However, the latent topics generated by LDA model does not have clear meaning.Besides, when faced with short text, the topic distributions tend to become sparse.Therefore, this paper proposes an improved topic model called LB-LDA, referring to the BTM model proposed by Cheng et al. [18] in 2014 and the L-LDA model proposed by Ramage D et al. [10] in 2009.

Definition of Biterm
Extending text is an effective way to mine latent topics from short texts.This paper refers to the BTM model [17], using biterms to expand texts."Biterm" refers to disordered word pairs occurring in a short text simultaneously.For instance, let us assume that there are four words in one short text {w 1 , w 2 , w 3 , w 4 }, the biterms are {(w 1 , w 2 ), (w 1 , w 3 ), (w 1 , w 4 ), (w 2 , w 3 ), (w 2 , w 4 ), (w 3 , w 4 )}.Therefore, the number of biterms in one short text is C 2 n , in which n points to the number of words in the text.

LB-LDA Model Description
Suppose given a corpus with M documents denoted by C = {d 1 , d 2 , . . ., d M }, containing V terms denoted by W = {w 1 , w 2 , . . ., w V }.These corpora constitute K topic labels, expressed as T = {l 1 , l 2 , . . ., l K }.For document d m = {w 1 , w 2 , . . ., w r }, the topic labels are denoted as T m = {t 1 , t 2 , . . ., t K }, and t k ∈ {0, 1}, which indicates the existence of the topic labels contained in the current text in the topic labels set T. For example, the 1st, 3rd, and 7th topic labels exist in d m , in the T m vector, the number of digits t 1 , t 3 and t 7 are set to be 1, and the rest are set to be 0. Based on Section 3.2.1, the d m can be enlarged to → β be hyper-parameters.Similar to the LDA model, LB-LDA model is a three-layer topic model including document layer, latent topic layer and word layer.In contrast to the traditional LDA model, two words in one biterm (w p , w q )(p = q) share one latent topic label, and the topics in latent topic layer have definite meanings.A graphical generation representation is show in Figure 4 and described as follows.
For each topic label k, k ∈ {1, 2, . . ., K} Generate a topic-word distribution Suppose given a corpus with M documents denoted by  w w p q ≠ share one latent topic label, and the topics in latent topic layer have definite meanings.A graphical generation representation is show in Figure 4 and described as follows.

LB-LDA Model Inference
In text mining, documents and words are visible while the distributions are invisible.Therefore, the parameter distributions need to be estimated, including θ and ϕ.Similar to LDA, the Gibbs Sampling algorithm is used to estimate these parameter distributions.For one biterm, the two words share the same latent topic label.If other biterm latent topic labels are known, Equation (2) can be used to estimate this biterm existence probability in different the topic labels.The meaning of each element in Equation ( 2) is showed in Table 1.
Table 1.The meaning of each element in Equation (2).

N d,i
The number of biterms in document d, The number of biterms in document d, for which topic label is k, The number of words in the corpus for which topic label is k, excluding this word The number of word w i,1 in the corpus for which topic label is k, excluding this word The number of word w i,2 in the corpus for which topic label is k, excluding this word The Gibbs sampling procedure is used to update each biterm's latent topic label.Firstly, topic labels are assigned to each biterm in the corpus randomly.In every iteration, elements in Table 1 are updated.Then, Equation ( 2) is used to update each biterm's topic label.When the specified number of iterations reaches, it stops.The Gibbs sampling procedure is shown in Algorithm 1.

Input
Enlarged corpus C', topic labels set T, hyper-parameters The equations to estimate the parameters θ, ϕ are shown as Equations ( 3) and (4).θ is a M × K matrix and represents topic distribution over each document.ϕ is a K × V matrix and represents the word distribution over each topic label.

Topic Similarity Calculation Method on Cross Social Networks
The distributions of words under different topics can be calculated by LB-LDA.Topics can be clustered by the similarity of these distributions.

Topic-Word Distribution Dimension Reduction Strategy
The dimension of topic-word distribution ϕ is K × V. K means the number of topic labels; V means the number of terms.In reality, the value of V will be very large, which makes it difficult to perform subsequent calculations.Therefore, the dimensionality of ϕ need to be reduced.Generally, in each topic, the words appearing at a high frequency are usually limited to a small part.Therefore, for each topic, words are sorted by probability and the first X words are chosen as the frequency words of each label.After dimension reduction, the format of ϕ is shown in Figure 5 and the dimension is K × X.

Topic-Word Distribution Dimension Reduction Strategy
The dimension of topic-word distribution ϕ is K V × .K means the number of topic labels; V means the number of terms.In reality, the value of V will be very large, which makes it difficult to perform subsequent calculations.Therefore, the dimensionality of ϕ need to be reduced.Generally, in each topic, the words appearing at a high frequency are usually limited to a small part.Therefore, for each topic, words are sorted by probability and the first X words are chosen as the frequency words of each label.After dimension reduction, the format of ' ϕ is shown in Figure 5 and the dimension is K X × .
Figure 5.The format of topic distribution after dimension reduction.

Topic Similarity Calculation Method
Jensen-Shannon (JS) divergence is often used to measure the degree of discrepancies between different distributions.In general, for two probability distributions P, Q, the value of JS divergence is between 0 and 1. Considering that the elements in matrix ' ϕ are two-tuple, the JS calculation ought to be improved.When two different words in P, Q (P, Q are from ' ϕ ) are from the same document, then the two words belongs to similar latent topic, and they are treated as the same word for JS divergence calculation.The improved JS divergence formula is shown in Equation ( 5). ( Calculate the JS divergence between any two of the distributions in ' ϕ by Equation (5) and a

K K
× dimensional matrix S can be obtained.S is a symmetric matrix with a diagonal of 0, and the value of S[i][j] represents the JS divergence value between the i-th topic label and the j-th topic label in topic label set T. Moreover, the more similar the two distributions, the larger the value of JS divergence.Define a matrix called Distance to measure the distance between any two topics in topic label set T. The size of The construction method of the Distance matrix is shown in Equation (6).The smaller the distance between the two topic distributions, the more similar the two distributions are.

Topic Similarity Calculation Method
Jensen-Shannon (JS) divergence is often used to measure the degree of discrepancies between different distributions.In general, for two probability distributions P, Q, the value of JS divergence is between 0 and 1. Considering that the elements in matrix ϕ are two-tuple, the JS calculation ought to be improved.When two different words in P, Q (P, Q are from ϕ ) are from the same document, then the two words belongs to similar latent topic, and they are treated as the same word for JS divergence calculation.The improved JS divergence formula is shown in Equation (5).
Calculate the JS divergence between any two of the distributions in ϕ by Equation ( 5) and a K × K dimensional matrix S can be obtained.S is a symmetric matrix with a diagonal of 0, and the value of S[i][j] represents the JS divergence value between the i-th topic label and the j-th topic label in topic label set T.
Moreover, the more similar the two distributions, the larger the value of JS divergence.Define a matrix called Distance to measure the distance between any two topics in topic label set T. The size of T is K × K.The construction method of the Distance matrix is shown in Equation (6).The smaller the distance between the two topic distributions, the more similar the two distributions are.

Hot Topic Community Discovery Method
Based on a specified standard, the clustering method divides mass data into clusters according to the degree of similarity.For example, the distance can be considered as a standard.The data located in one cluster is as similar as possible, and the data between two clusters tends to be more different.

Topic Clustering Method
Clustering algorithms can be divided into partition-based clustering, density-based clustering, layer-based clustering, graph theory-based clustering, etc.These methods all can be applied to cluster topics.
Each topic in the topic label set T can be considered as a point in the graph.We know the distance between every two points instead of the coordinate of each topic point.For some cluster methods, such as K-means, this is not enough to calculate topic clusters.Under these circumstances, a Multidimensional Scaling (MDS) algorithm is to be used to get the "coordinate".MDS algorithm was proposed by Torgerson in 1958 [23] and the core idea of the algorithm is to display the high-dimensional data in low-dimensional space.By the algorithm, topic point coordinate in the graph can be obtained based on the Distance matrix.

Hot Topic Community Calculation Method
Using the MDS algorithm and various clustering algorithms (K-means, DBSCAN(Density-Based Spatial Clustering of Applications with Noise) etc.), topic communities can be obtained.Suppose that P topic clusters denoted by Cluster = {C 1 , C 2 , . . ., C P } have been got.C p in Cluster is a topic community and contains uncertain number of topic labels.
When defining a hot topic community, two factors ought to be considered, including the number of topic labels in the current community and frequency of the topic label.Therefore, Equation ( 7) is defined to calculate topic community score.In the equation, for each topic label l in cluster C p , document that containing the topic label l denoted by doc_m, label_nums_m means label number in doc_m.In fact, there may be one or more doc_m.
Choose the communities with higher score to be hot topic community.Finally, the hot topic communities are obtained.

Experiment and Results
Data from three different social networks are collected including Tencent QQ Zone, Sina Weibo, and Netease news.Based on the method proposed in Chapter 3, a related experiment is executed, and some hot topic communities are obtained.

Cross Social Network Dataset
The experiment data was collected from Tencent QQ Zone, Sina Weibo, and Netease News.All of them are derived from previous laboratory collections and the time span is 2011.The data items in Tencent QQ Zone contain user ID, release time, and content, which is shown in Figure 6.The data items in Sina Weibo contain user ID, release time, and content, which is shown in Figure 7.The data items in Netease News contain news title, release time, news source, and content, which is shown in Figure 8.

Unified Model Building
The data collected is complex and has different forms, so it is necessary to establish a unified model.Two items, including time and content in data from Tencent QQ Zone and Sina Weibo, and three items, including title, time, and content in data from Netease News, are kept.
The data preprocessing is required, including repeated data filtering, word segmentation, removal of stop words, and so on.The data to be preprocessed here mainly includes content from three sources, and the text title in the Netease News data.
For data in Tencent QQ Zone and Sina Weibo, some entity words are chosen from content as the labels.For data from Netease News, title segmentation results can be considered to be the labels.Mixing these three kinds of data, the unified model can be obtained.A unified data of this model is shown in Figure 9.

Unified Model Building
The data collected is complex and has different forms, so it is necessary to establish a unified model.Two items, including time and content in data from Tencent QQ Zone and Sina Weibo, and three items, including title, time, and content in data from Netease News, are kept.
The data preprocessing is required, including repeated data filtering, word segmentation, removal of stop words, and so on.The data to be preprocessed here mainly includes content from three sources, and the text title in the Netease News data.
For data in Tencent QQ Zone and Sina Weibo, some entity words are chosen from content as the labels.For data from Netease News, title segmentation results can be considered to be the labels.Mixing these three kinds of data, the unified model can be obtained.A unified data of this model is shown in Figure 9.

Unified Model Building
The data collected is complex and has different forms, so it is necessary to establish a unified model.Two items, including time and content in data from Tencent QQ Zone and Sina Weibo, and three items, including title, time, and content in data from Netease News, are kept.
The data preprocessing is required, including repeated data filtering, word segmentation, removal of stop words, and so on.The data to be preprocessed here mainly includes content from three sources, and the text title in the Netease News data.
For data in Tencent QQ Zone and Sina Weibo, some entity words are chosen from content as the labels.For data from Netease News, title segmentation results can be considered to be the labels.Mixing these three kinds of data, the unified model can be obtained.A unified data of this model is shown in Figure 9.

Unified Model Building
The data collected is complex and has different forms, so it is necessary to establish a unified model.Two items, including time and content in data from Tencent QQ Zone and Sina Weibo, and three items, including title, time, and content in data from Netease News, are kept.
The data preprocessing is required, including repeated data filtering, word segmentation, removal of stop words, and so on.The data to be preprocessed here mainly includes content from three sources, and the text title in the Netease News data.
For data in Tencent QQ Zone and Sina Weibo, some entity words are chosen from content as the labels.For data from Netease News, title segmentation results can be considered to be the labels.Mixing these three kinds of data, the unified model can be obtained.A unified data of this model is shown in Figure 9.  Since the time span of the data is the whole of 2011, the data is divided into four parts by time quarter.Document number and other information are shown in Table 2.As we can see from the table, there is little difference in the number of documents for different quarter.However, there is a significant difference in document length, ranging from less than 10 to 1000-2000.The numbers of text from different social networks are shown in Table 2.The difference between these values is not significant.To some extent, it is fair for data from different social networks.2 tells us there are huge differences in document length.To reduce the sparsity of topic distribution, we need to enlarge some of the documents by the method proposed in Section 3.2.1.Documents of less than 10 words are chosen for text expansion and others maintain their original state.

Topic Distribution Calculation Method
The process of obtaining topic distributions is shown in Figure 10.The short documents ought to be applied to LB-LDA and the longer documents should be applied to L-LDA.Using the two topic models, topic distributions of the corpus in different quarters can be obtained.
quarter.Document number and other information are shown in Table 2.As we can see from the table, there is little difference in the number of documents for different quarter.However, there is a significant difference in document length, ranging from less than 10 to 1000-2000.The numbers of text from different social networks are shown in Table 2.The difference between these values is not significant.To some extent, it is fair for data from different social networks.2 tells us there are huge differences in document length.To reduce the sparsity of topic distribution, we need to enlarge some of the documents by the method proposed in Section 3.2.1.Documents of less than 10 words are chosen for text expansion and others maintain their original state.

Topic Distribution Calculation Method
The process of obtaining topic distributions is shown in Figure 10.The short documents ought to be applied to LB-LDA and the longer documents should be applied to L-LDA.Using the two topic models, topic distributions of the corpus in different quarters can be obtained.The Gibbs sampling algorithm of L-LDA is shown in Equation (8).In each sampling process of L-LDA, each word has a latent topic label rather than a word pair shares a topic label.The meaning of element in Equation ( 8) is similar to Equation (2).,

Comparisons with Other Topic Models
To demonstrate the effectiveness of LB-LDA in reducing the sparsity of topic distributions, a series of comparative experiments on different topic models are presented.JS divergence is chosen as the criterion for the sparseness evaluation of the topic distributions and the calculation method has been show in Equation (5).For a group of distributions, the average JS divergence value between any two of distributions in the group can be calculated.The experimental data is text from the four quarters.We compare LB-LDA with some new topic models including GPU-DMM [24], LF-DMM The Gibbs sampling algorithm of L-LDA is shown in Equation (8).In each sampling process of L-LDA, each word has a latent topic label rather than a word pair shares a topic label.The meaning of element in Equation ( 8) is similar to Equation (2).

.3. Comparisons with Other Topic Models
To demonstrate the effectiveness of LB-LDA in reducing the sparsity of topic distributions, a series of comparative experiments on different topic models are presented.JS divergence is chosen as the criterion for the sparseness evaluation of the topic distributions and the calculation method has been show in Equation (5).For a group of distributions, the average JS divergence value between any two of distributions in the group can be calculated.The experimental data is text from the four quarters.We compare LB-LDA with some new topic models including GPU-DMM [24], LF-DMM [25], and SeaNMF [26] models and the results are shown in Figure 11.In Figure 11, abscissa represent different quarters and ordinate represent average JS divergence value.According to this figure, we can find that average JS divergence values of LB-LDA are larger than others generally, which means LB-LDA performs better than the other three models in terms of sparsity reduction in general.[25], and SeaNMF [26] models and the results are shown in Figure 11.In Figure 11, abscissa represent different quarters and ordinate represent average JS divergence value.According to this figure, we can find that average JS divergence values of LB-LDA are larger than others generally, which means LB-LDA performs better than the other three models in terms of sparsity reduction in general.

Topic Distance Calculation Method
Firstly, the words with the highest probability of 20 under each topic are selected as the highfrequency words under each topic label.Then the similarity between different topics are calculated by Equation ( 5) to form the Distance matrix by Equation ( 6).The element in Distance matrix describes the distance between every two topics.K-means and DBSCAN need "topic coordinate" for clustering, so MDS algorithm ought to be applied to Distance matrix to cluster topics.For hierarchical clustering and spectral clustering, the Distance matrix is used for clustering directly.

Evaluation Standard
Silhouette Coefficient was proposed by Peter J. in 1986, and it is an evaluation standard for cluster algorithm.For element i in cluster C, the average distance between i and other elements in C is called cohesion degree, denoted by a(i).The average distance between i and elements in other clusters constitute a set

Topic Distance Calculation Method
Firstly, the words with the highest probability of 20 under each topic are selected as the high-frequency words under each topic label.Then the similarity between different topics are calculated by Equation ( 5) to form the Distance matrix by Equation ( 6).The element in Distance matrix describes the distance between every two topics.K-means and DBSCAN need "topic coordinate" for clustering, so MDS algorithm ought to be applied to Distance matrix to cluster topics.For hierarchical clustering and spectral clustering, the Distance matrix is used for clustering directly.

Evaluation Standard
Silhouette Coefficient was proposed by Peter J. in 1986, and it is an evaluation standard for cluster algorithm.For element i in cluster C, the average distance between i and other elements in C is called cohesion degree, denoted by a(i).The average distance between i and elements in other clusters constitute a set B = {b i1 , b i2 , . ..}, and choose the minimum value as the coupling degree, denoted by b(i).The Silhouette Coefficient of element i calculation method is shown in Equation (9).
The average of the Silhouette Coefficients of all samples in one cluster is defined as the Silhouette Coefficient of the current clustering algorithm.The value of the Silhouette Coefficient is between −1 and 1.The closer the value is to 1, the better the corresponding clustering method works.On the contrary, the clustering method is not good.

Comparison of Different Clustering Method
Figure 12 shows the Silhouette Coefficients value of different clustering algorithms in different quarters.In each subgraph, the abscissa represents the number of different clusters and the ordinate represents the value of the Silhouette Coefficients.Generally, the Silhouette Coefficients of spectral clustering algorithm are around 0.9 and it proves the algorithm performs best.The Silhouette Coefficients of K-means is around 0.4 and it shows K-means is not so good.The Silhouette Coefficients of DBSCAN and hierarchical clustering is around −0.3, which explains both algorithms are not good choices for our model.In addition, the number of clusters of DBSCAN is automatically generated, and the Silhouette Coefficients value is independent of the value of the abscissa.The average of the Silhouette Coefficients of all samples in one cluster is defined as the Silhouette Coefficient of the current clustering algorithm.The value of the Silhouette Coefficient is between −1 and 1.The closer the value is to 1, the better the corresponding clustering method works.On the contrary, the clustering method is not good.

Comparison of Different Clustering Method
Figure 12 shows the Silhouette Coefficients value of different clustering algorithms in different quarters.In each subgraph, the abscissa represents the number of different clusters and the ordinate represents the value of the Silhouette Coefficients.Generally, the Silhouette Coefficients of spectral clustering algorithm are around 0.9 and it proves the algorithm performs best.The Silhouette Coefficients of K-means is around 0.4 and it shows K-means is not so good.The Silhouette Coefficients of DBSCAN and hierarchical clustering is around −0.3, which explains both algorithms are not good choices for our model.In addition, the number of clusters of DBSCAN is automatically generated, and the Silhouette Coefficients value is independent of the value of the abscissa.

Hot Topic Community Results and Analysis
Figure 13 shows the result of hot topic communities clustered by spectral clustering algorithm, which performs best in the four clustering algorithms.For better display, the top 10 most frequently occurring topic labels are chosen in each hot topic community.Table 3 shows some of the topic labels in hot topic communities.

Conclusions and Future Work
In this paper, a hot topic community discovery method on cross social networks is proposed.By building a unified data model in cross social networks, the improved LB-LDA topic model and clustering algorithms are used to discover hot topic communities.Using the method we put forward,  The hot topic communities in the first quarter are mainly focused on entertainment topics due to the new year and spring festival.In the second quarter, the earthquake in Japan becomes the focus of attention.Hot topics are mainly focused on social events in the third quarter, such as "Libyan Qaddafi arrested", "Guo Meimei incident ", "Rainstorm in Guangdong" etc.In the fourth quarter, the 100th anniversary of the 1911 Revolution turns into a new hot topic.
To verify our topic discovery results, we found the hot news of 2011 summarized by the Xinhua News Agency (http://www.xinhuanet.com/2011xwfyb/).Some of the news is shown in Table 4.The news "The Ningbo-Wenzhou railway traffic accident", "The celebration of 1911 Revolution", "Earthquake happened in Japan", "Gaddafi was captured and killed", "NATO bombs Libya" etc. have been discovered in our hot topic communities.We have bolded the topics in Table 3 and related events in Table 4.As we can see, in the first quarter, we find no hot topic communities related to hot news.We think that it is because the hot that we find are generally related to the Spring Festival, but the Spring Festival really cannot be considered as annual news.However, in reality, Spring Festival must be a hot topic in the first quarter in China.To verify the effectiveness of cross social networks, we conducted an experiment on each social network.Considering that the data volume of each social network is not large, we did not divide it into quarters like cross social networks.The result of the hot topics is shown in Table 5 and topics mentioned in the result of cross social networks are bolded.As we can see, the hot topics from each social network are part of hot topics from cross social networks.Certainly, hot topics from each social network also contain the topics that are not mentioned in our pervious result.This is because these topics are hot topics in the current social network, but cannot be regarded as hot topics in the cross social network.Sina Weibo and Netease News contains more hot topics and QQ Zone contains fewer hot topics.This is because hot topics are usually associated with major events.Information from Sina Weibo and Netease News usually relate to these events and data from QQ Zone is usually associated with daily life.Compared with daily life, social events are more likely to be hot topics.The result proves that our method about cross social network is effective.

Conclusions and Future Work
In this paper, a hot topic community discovery method on cross social networks is proposed.By building a unified data model in cross social networks, the improved LB-LDA topic model and clustering algorithms are used to discover hot topic communities.Using the method we put forward, the hot topic communities from data in three social networks, including Tencent QQ Zone, Sina Weibo, and Netease News in 2011, are obtained.An amount of hot topic communities including "The Ningbo-Wenzhou railway traffic accident", "The celebration of 1911 Revolution", "Earthquake happened in Japan", "Gaddafi was captured and killed", "NATO bombs Libya" etc. can be found in the hot news summarized by the Xinhua News Agency.The results prove that our model is effective and has certain application value.Furthermore, the hot topics from each social network are part of the results from cross social networks.It proves that it is effective for us to discover hot topics from cross social networks.
In the future, we will collect more comprehensive and updated data from more Chinese websites such as Zhihu, Toutiao, etc.It is also feasible to collect data from English-language social networks such as Twitter, Instagram, and Facebook.Furthermore, based on the hot topic communities, popular opinion can also be analyzed.We can also obtain hot topic communities from data collected from different locations and times.By these communities, the hot topics in different locations and the evolution of hot topics can be analyzed.Moreover, this method can also help the government to obtain a comprehensive understanding of public opinion and develop solutions to urgent problems.

Figure 1 .
Figure 1.General process of our model.
Figure 3  tells us that Weibo information generally includes user ID, Weibo time, Weibo source, and Weibo content.In this paper, we only take text content into consideration.

Figure 4 .
Figure 4. Generation process of LB-LDA.In this procedure, we need to explain the calculation method of Λ d and L d , and this part mainly refers to L-LDA.For each document d', we firstly get Bernoulli distribution Λ d .Then we define the vector of document's labels to be λ d = {k Λ d k = 1} .Next, we define a document-label matrix L d and the size is M d × K, in which M d = λ d .The element in L d is set as Equation (1), where i means each row and i ∈ {1, . . ., M d }, j means column and j ∈ {1, . . ., K}.
times iter Output Document-topic distribution θ, topic-word distribution ϕ 1 Initialize each biterm's the topic label randomly 2 For iter_times = 1 to iter 3 For each document d' in corpus C'do 4 For each biterm b in document d' do 5 Calculate the probability of each topic label by Equation (1) 6 Sample b's topic label based on the result of step 5; 7 Calculate θ and ϕ based on Equations (2) and (3)

Figure 5 .
Figure 5.The format of topic distribution after dimension reduction.

Future
Internet 2019, 11, x FOR PEER REVIEW 9 of 16 items in Netease News contain news title, release time, news source, and content, which is shown in Figure 8.

Figure 8 .
Figure 8. Data item in Netease News.

Figure 9 .
Figure 9. Single data in unified model.

Figure 8 .
Figure 8. Data item in Netease News.

Figure 9 .
Figure 9. Single data in unified model.

Figure 8 .
Figure 8. Data item in Netease News.

Figure 8 .
Figure 8. Data item in Netease News.

Figure 9 .
Figure 9. Single data in unified model.

Figure 9 .
Figure 9. Single data in unified model.

Figure 10 .
Figure 10.The process of obtaining topic distribution.

Figure 10 .
Figure 10.The process of obtaining topic distribution.

Figure 11 .
Figure 11.The JS divergence comparison among different topic models.

4. 3 . 1 .
Cluster Method In Section 3.4.1,this paper mentions four clustering methods: partition-based clustering, density-based clustering, layer-based clustering, and graph-based clustering.The representative algorithms-K-means, DBSCAN, hierarchical clustering, and spectral clustering-are chosen to obtain topic clusters.
the minimum value as the coupling degree, denoted by b(i).The Silhouette Coefficient of element i calculation method is shown in Equation(9).

Figure 11 .
Figure 11.The JS divergence comparison among different topic models.

4. 3 .
Hot Topic Community Discovery 4.3.1.Cluster Method In Section 3.4.1,this paper mentions four clustering methods: partition-based clustering, density-based clustering, layer-based clustering, and graph-based clustering.The representative algorithms-K-means, DBSCAN, hierarchical clustering, and spectral clustering-are chosen to obtain topic clusters.

Figure 12 .
Figure 12.The Silhouette Coefficients of different clustering algorithm.

Figure 12 .
Figure 12.The Silhouette Coefficients of different clustering algorithm.

Figure 13 .
Figure 13.Hot topic community results in different quarters.

Figure 13 .
Figure 13.Hot topic community results in different quarters.
, which indicates the existence of the topic labels contained in the current text in the topic labels set T. For example, the 1st, 3rd, and 7th topic labels exist in m d , in the m T vector, the number of digits 1 3 , t t and 7 t are set to be 1, and the rest are set to be 0. Based on k t ∈

Table 2 .
Information in unified model.

Table 2 .
Information in unified model.

Table 5 .
Hot topics from each social networks.

Table 3 .
Part of frequently occurring topic labels.

Table 4 .
Hot news summarized by the Xinhua News Agency.

Wenzhou railway traffic accident Gaddafi captured and killed The celebration of 1911 Revolution NATO bombs Libya
The most restrictive property market in history The death of Steve Jobs Tiangong-1 successfully launched US Army kills Bin Laden