GTIP: A Gaming-Based Topic Influence Percolation Model for Semantic Overlapping Community Detection

Community detection in semantic social networks is a crucial issue in online social network analysis, and has received extensive attention from researchers in various fields. Different conventional methods discover semantic communities based merely on users’ preferences towards global topics, ignoring the influence of topics themselves and the impact of topic propagation in community detection. To better cope with such situations, we propose a Gaming-based Topic Influence Percolation model (GTIP) for semantic overlapping community detection. In our approach, community formation is modeled as a seed expansion process. The seeds are individuals holding high influence topics and the expansion is modeled as a modified percolation process. We use the concept of payoff in game theory to decide whether to allow neighbors to accept the passed topics, which is more in line with the real social environment. We compare GTIP with four traditional (GN, FN, LFM, COPRA) and seven representative (CUT, TURCM, LCTA, ACQ, DEEP, BTLSC, SCE) semantic community detection methods. The results show that our method is closer to ground truth in synthetic networks and has a higher semantic modularity in real networks.


Introduction
In recent years, with the rapid development of mobile internet technology and the continuous popularization of mobile terminal devices, social platforms such as Micro-blog, WeChat, QQ, SNS, RSS, etc., have changed social interaction deeply. People can join or set up their own community and update their status in the form of text, pictures, and videos to realize the sharing, dissemination, and acquisition of personal information. According to statistics from comScore, Inc. (Reston, VA, USA), as of 2018, an average of 395,833 people logged in to WeChat per minute and 19,444 people were engaged in video or voice chat; Sina Micro-blog sent or forwarded 64,814 microblogs per minute; Facebook users shared an average of four billion dynamic items of information per day; Twitter processed 340 million items of data per day; Tumblr authors published an average of 27,000 new posts per minute; and Instagram users shared an average of 3600 photos per day. Facing this data explosion caused by the growing to social media data, the traditional topological space of social networks is shifting towards a rich semantic form which poses great challenges to the detection of social network communities.
Community detection can effectively improve the performance of social application systems. For example, by analyzing the social behavior patterns of network users and detecting the audience groups of social services, the commercial value of advertising and product marketing can be significantly improved [1]. Han et al. [2] used community detection to realize information transfer between networks and solved the cold start problem of recommendation systems caused by network sparsity. In addition, community detection is widely used in network embedding [3], public health [4], and link prediction [5].
In conventional community detection methods, the network is represented as a topology graph and the nodes do not contain semantic information. Representative methods in this field include the GN (Girvan-Newman) algorithm [6], FN (Fast Newman) algorithm [7], CPM (Cluster Percolation Method) algorithm [8], and Louvain algorithm [9]. In recent research, Qiao et al. [10] proposed Picaso, a parallel community discovery model which uses the Mountain model to calculate the weight of each edge in the network and apply a gradient algorithm to discover the community structure. To solve the problem of community detection in large-scale complex networks, Lu et al. [11] proposed an improved label propagation algorithm using node importance ranking. Lyzinski et al. [12] embedded graphs in Euclidean space to obtain their lower-dimensional representation, then used non-parametric graph reasoning technology to identify the structural similarity between communities. This method performed well in detecting fine-grained community structures. Tagarelli et al. [13] integrated multi-layer network community modularity, which retains multi-layer topology information and optimizes the edge connectivity of multi-relational communities.
In semantic community detection tasks, the nodes are the basic components of the topology graph as well as the carriers of semantic information which leads to fundamental changes in the community's form [14]. For example, after considering the document attributes of nodes, the common topics between nodes play a decisive role in the formation of the community. Two people who share a common topic may join the same community even if they do not have a strong connection in the topology graph [15]. Therefore, the use of semantic information to analyze the correlation between network nodes has become a critical issue in this field.
The Probabilistic Topic Model (PTM) is a common semantic representation method used for social network nodes [16]. For example, Xin et al. [17] defined the semantic feature of nodes according to the similarity between user documents and a set of global topics, then adopted multi-sampling to accelerate the convergence of the algorithm. He et al. [18] transformed LDA (Latent Dirichlet Allocation) and Markov Random Field (MRF) into a unified factor graph to form an end-to-end learning system for community detection, then derived an effective propagation algorithm to train their parameters. Jin et al. [19] stated that links in the network contain semantic information as well. They proposed a new probabilistic model for link community detection, and developed a dual nested Expectation Maximum (EM) algorithm to learn the model. Wang et al. [20] found that there are correlations between topics which significantly affect community structures. They proposed a Topic Correlations-based Community Detection (TCCD) model which can simultaneously output the community structure and the semantic interpretation of nodes. Node attributes can be used to address semantic data as well; for example, Fang et al. [21] grouped nodes that satisfied both structure cohesiveness and keyword cohesiveness into the same community.
Non-negative Matrix Factorization (NMF) has good performance in discovering implicit patterns from high-dimensional data. Therefore, scholars have integrated semantic information into the adjacency (or feature representation) matrix and used NMF to analyze the correlation between nodes. For example, Pei et al. [22] proposed a clustering framework based on Non-negative Matrix Tri-Factorization (NMTF) which can effectively identify both user similarity and message similarity. Qin et al. [23] introduced an adaptive parameter to control the contribution of the network topology and content information and use NMF to discover semantic communities. Wang et al. [24] set the member matrix and attribute matrix as two groups of parameters of NMF, which allows semantic interpretation for the communities to be added. Yang et al. [25] introduced an adaptive weighted group for sparse low-rank regularization in NMF in order to automatically obtain the number of semantic communities.
Deep learning has a natural advantage in attribute representation of high-dimensional data; thus, researchers have begun to introduce semantic attributes into the feature dimension of deep learning models [26]. For example, Jin et al. [27] proposed a uniformed graph representation of network topology and semantic information and developed a multi-component network embedding approach via a deep autoencoder. Cao et al. [28] designed a combination matrix consisting of a modularity matrix for linkage information and a Markov matrix for content information. After matrix factorization, the matrix is used as the input of the multi-layer deep auto-encoder framework for obtaining the deep representation of the graph. Jin et al. [29] proposed that the words in user documents have a hierarchical structure. They proposed a new Bayesian probability model which can explain the multiplex semantic community more clearly. He et al. [30] developed a co-learning strategy to jointly train the structure and semantic parts of the model by combining a nested EM algorithm and belief propagation.
While the above methods have made a great many exploratory contributions to the field of semantic community detection, there are several remaining deficiencies: (1) When measuring the semantic relevance between nodes, each topic receives the same status without considering the difference of topic influence. (2) There has been little exploration of the impact of topic propagation and influence propagation in community detection. (3) Methods based on deep learning require a large number of samples, high computational performance, and long training times. When the network evolves rapidly, these methods cannot meet the online requirements of social systems.
To better cope with these situations, and inspired by the information dissemination in social networks, we propose a user topic influence propagation model based on percolation theory that uses the Nash equilibrium to generate communities in a game-based way. Experiments with real social networks show that the proposed method has a high semantic modularity [17] in social networks with rich semantic attributes. In addition, the algorithm can converge in a short time without additional training. In summary, the contributions of this paper include: (1) Integrating topic influence into the correlation analysis of nodes, which makes the community detection process conform to the law of information dissemination in social networks. (2) A proposed one-dimensional diffusion model in percolation mechanics that can quantify the propagation of topic influence, which in turn can describe the impact of nodes near the topic source in the semantic space more accurately and solve the situation in which high-influence nodes in the network present a low influence score. (3) Use of the Nash equilibrium from game theory to generate communities, thereby identifying overlapping and non-overlapping communities at the same time and identifying community structures with smaller granularity.

LDA Representation of Nodes
The semantic space representation of nodes is generated based on LDA, a three-tier Bayesian probability model used for document-topic generation, including words, topics, and documents. LDA considers documents to be composed of topics, and each topic can be presented with a set of keywords. For example, technology topics have a high probability of containing the keywords: "Chip" and "Artificial Intelligence". The probability distribution of the document on each topic shows the relevance of the document to each topic. The mathematical symbols involved in LDA are shown in Table 1.
The LDA vector is stored as a triplet, (w, d, z), where w i , d i and z i are the number, the node number, and the topic number of keyword i, respectively [31]. Figure 1 shows the data storage structure of the LDA vector, in which the shadow part represents the same elements in the vector. For example, w i1 = w i2 = w i4 = w i5 indicates that w i1 , w i2 , w i4 , w i5 are the same words, d i1 = d i3 = d i5 = d i6 indicates that w i1 , w i3 , w i5 , w i6 are the keywords of the same node d i1 , and the keyword w i1 appears twice in d i1 . Additionally, z i1 = z i2 = z i6 indicates that z i1 , z i2 , z i6 belong to the same topic z i1 , the keyword w i1 appears twice in z i1 , and z i1 belongs to d i1 and d i2 , respectively. According to [31], the mathematical descriptions of w, d, z are as follows: (1) θ ∼ Dir (α); the topic distribution θ of nodes follows the Dirichlet distribution (noted as Dir in the formula) with parameter α.
; the probability of topic z i in node d i under topic distribution θ follows Multinomial distribution (noted as Multinomial in the formula). (3) λ ∼ Dir (β); the keyword distribution follows the Dirichlet distribution with parameter β.
, the probability of keyword w i in topic z i under keyword distribution λ follows Multinomial distribution.
To generate the LDA model, the first step is to extract the distribution of keywords that satisfy λ ∼ Dir (β). Next, the topic distribution is extracted for each document in the corpus, satisfying θ ∼ Dir (α). Finally, for each keyword, topics and keywords are further extracted to satisfy z i |θ (d i ) ∼ Multinomial (θ (d i ) ) and w i |z i , λ (z i ) ∼ Multinomial (λ (z i ) ), respectively.  The topic number to which w i belongs The topic distribution probability of node i The distribution of keywords in topic j The probability that w i belongs to topic j α Prior parameter of topic distribution for each node β Prior parameter of keyword distribution within a topic

Gibbs Iterative Process
In statistics, Gibbs sampling is a Markov Monte Carlo (MCMC) algorithm which is used to approximately extract sample sequence from a multivariate probability distribution when it is difficult to directly sample. The key is to establish a posterior estimate for a sample and perform Gibbs sampling on the posterior estimate expression.
The expression of the Bayesian relation of z and w is After transformation, we have The process of Gibbs sampling is as follows: (1) z i is initialized as a random integer between 1 and K (i = [1, 2, ..., N]), which is the initial state of the Markov chain.
(2) According to the literature [32], the right side of Equation (3) can be expanded as Therefore, we have In Equation (6), |w| and |z| denote the number of keywords and topics, respectively, f (w i ) −i,j represents the number of words assigned to topic j that are the same as w i , f (·) −i,j represents the number of words assigned to topic j, f (d i ) −i,j represents the number of words assigned to topic j in node d i , f (d i ) −i,· represents the number of all the words assigned to a topic in node d i , and z i is updated iteratively according to Equation (6) .
(3) When step (2) has iterated enough times (when P(z i = j|z −i , w i ) converges), the process ends. We now normalize P(z i = j|z −i , w i ) to obtain the keyword topic probability matrix B i,j , B i,j = P(z i = j|w = i), B i,· = 1.

Semantic Feature Representation of Nodes
In a semantic social network G = (V, E, T), the node set V represents the users in the semantic social network, the edge set E represents the relationship between users, and T is the document collection, representing the text information published by users.
We used Gensim (a topic generation toolkit in Python) to extract K topics in T as the base of a K-dimensional semantic space. The coordinate m i of the node v i (v i ∈ V) in the semantic space can be expressed by the mean value of the keywords in the document t i (t i ∈ T) published by v i , which is shown in Equation (7) .
In Equation (7) , N i represents the number of keywords (the words with the highest cosine similarity to the topic that t i belongs to) in document t i , N i,j represents the j-th keyword in document t i , and B N i,j ,· represents the coordinate (expressed as the sequence of the cosine similarity between the j-th keyword and K topics) of the j-th keyword in document t i in the K-dimensional semantic space.

Motivation
The flow of a fluid through porous media (soil voids or other permeable media) is called percolation. Each percolation source point contains a certain amount of substance, which diffuses to the area in a finite space that has not been penetrated. In the example shown in Figure 2, the grid represents the percolation area. We assume that there are three percolation source points in the figure, labeled red, blue, and green here. In real percolation process, percolation occurs when the difference between the source point and the adjacent area reaches a threshold, which is measured by the point source function. In this example, we simply assume that the probability of percolation is 50%. After four infiltrations, the percolation state changes from Figure 2a It can be found that from the three source points the substance gradually penetrates into the adjacent areas. Inspired by this, we propose to construct the semantic social network topic percolation equation using percolation theory. Our motivation stems from the following four perspectives. First, both fluid percolation and semantic percolation need to be adjacent to the infiltration area. Second, similar to fluid percolation, in semantic social networks, whether users receive topics from neighbors (i.e., semantic percolation) is subject to a threshold, which in this paper is measured by the payoff concept from game theory. Next, both fluid percolation and semantic percolation are multiple source points percolating simultaneously, and this property can be simulated for community detection using a seed expansion strategy. Finally, all source points have the same status, which avoids the problem that nodes with less local influence cannot expand and promotes the formation of local communities. The differences between fluid percolation and semantic percolation are shown in Table 2.

Modeling Topic Influence
In this section, we construct the topic percolation differential equation; the symbols used are provided in Table 3. We propose topic influence percolation strength to measure the capacity of topics to influence the percolation area. In our model, each node is a fixed-size solid sphere filled with unequal topic influence in the semantic space. In the model, S has a virtual dimension [λγ −1 ]. In the semantic space, the inner product m i · m j represents the semantic correlation between nodes v i and v i . The more similar the semantic coordinates of v i and v i are, the larger m i · m j is. We define Z i→j = 1/m i · m j to represent the topic propagation space coordinate of node v j with node v i as the source point, which satisfies Z i→i = 0, and Z i→j → ∞ when m i · m j → 0. Table 3. Description of notations.

S
The topic influence percolation strength λ The dimension of topic influence γ The sphere volume Z i→j The topic propagation space coordinate D The hops between the source point and the affected nodes η z The percolation coefficient of topic propagation κ 0 The initial topic influence value of the source point We design three rules to construct the percolation dynamics of topic influence, based on which the second-order partial differential equation of topic percolation Z is provided in Equation (8) (1) The topic influence of a percolation source point is greatest at the initial state, and spreads outward with the percolation of topic influence.
(2) As the topic influence of the source point continuously penetrates into the surrounding area, the influence of the source point on other nodes becomes smaller.
(3) While the nodes under the influence of the source point absorb and weaken the topic influence of the source point, the influence of the topic contained in the source point is enhanced.
The initial condition of Equation (8) is as follows: Here, δ(Z) is a Dirac function, which satisfies the requirement that the value of the function (except source point a) be equal to 0 and the integral over the entire domain equal to 1. The expression of δ(Z) is Here, S(Z, 0) denotes the topic influence percolation strength when the distance between the source point and the affected node is 0. At this point, the influence is concentrated on the source point, S(Z, 0) = κ 0 .
The boundary conditions of Equation (8) are as follows: Equation (11) indicates that S and the partial differential of S with respect to Z becomes 0 when Z → ∞.
Because the partial differential equation is established using physical phenomena, we use Dimensional Analysis (DA) to solve Equation (9). The basic principle of DA is Buckingham π theorem. The theorem states that if the formula of a physical process contains n physical quantities and k of them have independent dimensions, then the formula can be transformed into an equivalent function containing n − k dimensionless numbers π i composed of these physical quantities.
The topic influence percolation strength S is a function of κ, z, D and η z . Suppose that F(S, κ, Z, D, η z ) = 0; then, the dimension of S and κ is [λγ −1 ] and [λ], respectively, and S is proportional to λ / η z D. Using Buckingham π theorem and selecting S, D, η z as the basic variable, we have Next, we determine the undetermined function f . Let variable ψ = Z/ 4η z D; then, Combined with Equation (8), we have The boundary conditions of Equation (11) becomes After simplification, we have Here, c is a constant. By substituting Equation (8) into Equation (17), we have c = 0; therefore, the general solution of Equation (17) is f = ω o e −ψ 2 . According to the hypothesis, the topic influence of the source point is conserved; therefore, After the transposition of terms, we have Equation (20) is a typical standard normal function with the topic propagation space coordinate Z as the horizontal axis and the topic influence percolation strength S as the vertical axis. According to the mathematical properties of the standard normal function, the instantaneous influence of the source point follows a normal distribution along the Z direction at any D point in the strength field in one-dimensional unbounded semantic space. With increasing distance D, the peak value of influence strength decreases while the range of affected nodes becomes wider, and the distribution curve tends to become more stable.
According to the 3σ principle, the probability of topic influence of each node outside (µ + 3σ, µ − 3σ) is less than 0.3%. Therefore, µ − 3σ < Z ≤ µ + 3σ can be regarded as the actual range of random variable Z, and the topic influence of nodes is only valid within the range of 3σ = 3 2η z D.

The Game Process of Topic Influence Percolation
In social networks, each individual has free will and can decide whether to join a community after weighing the advantages and disadvantages, which is consistent with the behavior of the players in game theory. In semantic social networks, users influence people around them with their preferred topics and are influenced in turn by the topics held by others. When affected by different topics, people react differently. For high-impact topics that they prefer and are hotly discussed by the public, they continue to track the progress of these topics and further spread them. On the contrary, they do not pay further attention. From the perspective of game theory, all social individuals are considered to be rational and selfish players and follow certain rules to join the semantic community with greater influence and closer to their preferred topics in order to maximize their payoffs and achieve Nash equilibrium.

Basic Elements
The basic elements of our game model are as follows.
(1) Players: all nodes except the seed nodes (unequilibrium nodes) in semantic social networks.
(2) Strategy P i : each player chooses a single strategy; P i = 1 (P i = 0) means that after being affected by the topic, node v i does (does not) spread the topic and joins (refuses to join) the community to which the topic belongs.
(3) Payoff U i : in the percolation dilemma game model, the payoff of node v i is defined as follows: Here, U i (P i , P j ) represents the payoffs of v i of spreading topics from v j , S ji represents the percolation strength of the topic from v j to v i , and ξ represents the topic percolation loss. The correlation between P i and U i is as follows.: In a semantic social network, if there is a node with greater topic influence than node v i in the percolation area, v i is percolated by topic influence, and the percolation with smaller strength is covered by percolation with higher strength. On the contrary, the influence percolation strength S i of node v i in this area is considered infinite. S i is defined as follows: In this way, it is only necessary to calculate the payoffs of spreading the topic of nodes that can percolate v i , instead of calculating the payoffs of the global nodes. To calculate faster, the topic influence percolation strength S is stored in a large root heap.
In Equation (23), the nodes only propagate one topic and join one community. However, communities in real semantic social networks generally overlap. If joining multiple communities can increase payoffs, players join multiple communities. Joining multiple communities results in a loss of payoffs. For semantic overlapping communities, the payoff is defined as follows: Here, ζ is the loss factor, |R(i)| represents the number of different topics spread by node v i , and U i (P i , P j ) represents the payoffs of v i spreading only one topic. Obviously, spreading more topics results in the loss of ζ.
Players pursue the maximization of payoffs as well as the maximization of efficiency. In generally, the payoff of joining multiple communities is higher than that of joining a small number of communities; in certain cases, joining a small number of high payoff communities can obtain the equivalent payoffs of joining a large number of low-payoff communities. To maximize the payoff and efficiency at the same time, we define a payoff satisfaction function ρ (i) , which is Here, N i represents the number of communities that node v i has joined. When N i = 1, ρ (i) is set as U i /2 to avoid that the initial payoff satisfaction of node v i is too large to join other communities. When N > 1, the payoff satisfaction is the average of the payoff function. If U G (i) < ρ (i) , this means that joining the new community results in decreased payoff. In this case, v i chooses strategy P i = 0.

Slecting the Source Point
Random selection of the source point may result in percolation failure due to the low influence of the selected node and cause additional time cost. Based on the PageRank algorithm, a source point selection algorithm for topic influence maximization is proposed.
(1) Initialize seedSet, hashMap, and outlink[v i ], where seedSet stores the ranked topic influence, hashMap stores the feature pairs (node ID and topic in f luence), and outlink[v i ] is an array that stores the pointing nodes of v i .
(2) According to different transfer probabilities, the node percolates its influence to the pointing nodes. We construct the following transfer matrix to represent the probability of influence passing from v j to v i , where M(i, j) is a weighted adjacent matrix with the formula shown in Equation (27).
If node v i points to node v j , the edge weight of arc (i, j) is m i · m j ; otherwise, the edge weight is 0.
(3) The influence of each node depends on the influence of the nodes that point to it. In the iteration process, we use vector vec to store the influence score of each node, which is updated based on Equation (28).
Here, α is the damping factor, which is used to prevent excessive influence of nodes, while τ/N is the self-restart vector, which establishes the transition probability for the node pair that does not have direct link. Equation (28) is repeated until the entire network converges.
(4) We define conversion coefficient ε and multiply the influence score of each node by ε to obtain the topic influence κ, then update hashMap and seedSet. The pseudo-code of the ranking procedure is provided in Algorithm 1.

Game Rules for Overlapping Community Detection
Based on the topic influence percolation, we propose a game algorithm for overlapping community detection.
(1) A strategy combination is considered to be in Nash equilibrium if no player can increase their payoff by changing decisions unilaterally. In the initial stage, the nodes in the semantic social network are isolated, no payoff is generated, and all local communities are in a state of unequilibrium.
(2) The percolation is a local movement; therefore, choosing a reasonable propagation range (hops) can ensure the effectiveness of the influence and the fast convergence of the algorithm. According to the 3σ principle of Equation (20), the topic propagation space coordinate Z satisfies µ − 3 2η z D < Z ≤ µ + 3 2η z D Here, Z i→j = 1/m i · m j , m i · m j ∈ (0, 1). When m i · m j = 0.2, |Z| max = 3 2η z D = 5, D max = 3 (after rounding). The experiments in Section 5.3.1 show that the community quality decreases rapidly when D max > 3. Therefore, to speed up the algorithm, we assume that there is no percolation between v i and v j when m i · m j < 0.2.
(3) Select nodes sequentially from the head of seedSet; if the node is marked as "divided" in hashMap, select new nodes from seedSet until the node is marked as "not divided", making it the source point of the percolation.
(4) For v i within three hops of source point v j , if v i does not join any community, calculate the non-overlapping payoff function U i (P i , P j ). If U i (P i , P j ) > 0, then v i joins v j community and marks v i as "divided" in hashMap, the number of hashMap elements minus 1. If U i (P i , P j ) < 0, skip v i and analyze the next node. (5) If v i has joined a community and is not in the same community as v j , calculate the cosine similarity between v j and the source point of v i community; the expression is as follows: Here, we use ς(v i ) to represent the community collection of v i if sim(m seed(i) , m j ) > 0.8, merging ς(v i ) and ς(v j ). If sim(m seed(i) , m j ) ≤ 0.8 and the payoff is greater than the payoff satisfaction (U G (i) ≥ ρ(i)), we add v i to v j 's community; otherwise, skip v i and find the next node.
(6) When performing an optimal strategy can improve the payoff, the node acts to achieve local Nash equilibrium. Next, we select nodes from the seedSet to play the game until the whole network reaches Nash equilibrium.
(7) When the seedSet is empty and there are elements marked "not divided" in the hashMap, we can accelerate the convergence of the algorithm by randomly assigning these elements to the nearest community.

a Practical Case
12: else 13: continue; 14: end if 15: 16: if sim(m seed(i) , m j ) > 0.8 then 17: merging community ς(v i ) and ς(v j ); 18: else 19: if U G (i) > 0 then 20: ς k ← v i ; 21: hashMap[i] ← f alse; 22: hashMap.count−−; The topic propagation space coordinate Z i→j = 1/m i · m j ; therefore, the coordinate matrix of G a is The topic influence score of the nodes in G a is shown in Table 4. First, the most influential node v 6 in Table 4 is selected as the source point of percolation. Due to the small amount of data, we assume that the influence range of the topic is one hop, i.e., d = 1.
The nodes affected by v 6 include v 3 , v 4 and v 5 . For v 3 , it is affected by v 6 , v 2 , and v 5 . Let the percolation coefficient η z = 0.5 and the dimensionless number π = 3. According to Equation (19), the percolation strength of v 6 , v 2 , and v 5 to v 3 are S 6,3 = 11.60 × exp{−0.125} = 10.237, S 2,3 = 10.38 × exp{−0.5} = 6.296, and S 5,3 = 4.70 × exp{−0.5} = 2.849, respectively. Therefore, the node with the greatest influence on v 3 is v 6 . Assuming that the cost of propagating topics to v 3 is the topic influence of v 3 itself, therefore, U 6 (P 6 , P 3 ) > 0, and v 3 accepts and continues to spread the topic of v 6 and joins v 6 community. Similarly, v 4 and v 5 are divided into v 6 community.
The local area covered by the influence of v 6 reaches Nash equilibrium. Next, v 2 is selected as the source point of percolation. The influence of v 2 covers v 1 and v 3 ; v 3 is marked as "divided", therefore, we need to compare the topic similarity between v 2 and the source point of v 3 community (i.e., v 6 ) according to Equation (30). Suppose that m 2 · m 6 = 1, |m 2 | = 2, |m 6 | = 1; then, we have U(m 2 , m 6 ) = m 2 · m 6 /|m 2 ||m 6 | = 0.5. Thus, U(m 2 , m 6 ) < 0.8, the communities of v 2 and v 6 , are not merged. According to Equations (24) and (25) , the payoff and payoff satisfaction of v 3 are U G (3) = 10.237 + 6.296 − 8.267 = 8.266 and ρ (3) = 5.119, respectively. U G (3) > ρ (3) ; thus, v 3 joins v 2 community, forming an overlapping structure. Similarly, we can calculate the topic influence of v 2 on v 1 to make the local region reach Nash equilibrium. The community detection result of G a is shown in Figure 3b.

Experimental Environment
All experiments in this paper were performed on a computer with an Intel (R) Core (TM) i5-7500 CPU, 3.40 GHz, and Yuzhan 16GB DDR4 RAM. All the proposed and compared algorithms were programmed in Python.
For semantic networks, GTIP was compared to seven semantic community detection algorithms: CUT (Community User Topic) [35], TURCM (Topic User Recipient Community Models) [36], LCTA (Latent Community Topic Analysis) [37], ACQ (Attributed Community Query) [21], DEEP (Deep Learning Method) [28], BTLSC (Background and Two-Level Semantic Community) [29], and SCE (Single Chromosome Evolutionary) [14]. CUT, TURCM, and LCTA generate communities based on Topic Probability Model; ACQ is an attribute graph community detection method; DEEP and BTLSC are both Deep Learning-based semantic community detection methods; and SCE is a new semantic community detection method based on Single-Chromosome Evolutionary.

Evaluation Criteria
Shen et al. [38] introduced Extension Q-modularity (EQ) to evaluate the quality of algorithms for identifying highly clustered communities; it is defined as follows: where K i is the degree of node v i , M = ∑ j i A ij is the total degree of the network nodes, A i,j is the adjacent matrix of the network, and O i is the number of communities to which v i belongs.
In a semantic social network, the community structure should satisfy both the link density and semantic cohesion between nodes. Xin et al. [17] introduced Semantic Qmodularity (SQ) to evaluate the semantic cohesion of the community structure, which is defined as follows: In Equation (35), m i and m j is the coordinate of node v i and node v j in semantic feature space, sim(m i , m j ) is the cosine similarity between v i and v j (Equation (30)), and the range of EQ and SQ is (0, 1); the closer this value is to 1, the higher the quality of the community.
Lancichinetti et al. [33] introduced Normalized Mutual Information (NMI) to compare the similarity between the ground truth and the detected communities. The normalized mutual information between partition C X and C Y is defined as follows: where H(C X ) is the entropy of C X and H(C X |C Y ) is the variation of information between C X and C Y . In the experiments, N MI is used to compare the communities discovered by the algorithm with the ground-truth communities in the artificial network.

Artificial Networks
For our experiments, we produced ten artificial networks with ground-truth communities using the LFR (Lancichinetti Fortunato Radicchi) benchmark [33]. The parameter settings of the LFR benchmark are provided below.
The number of nodes in the network was set to n = 10, 000. The average node degree of the network was set tod = 5. The minimum and maximum size of the community were set to |C| min = 5 and |C| max = 500, respectively. The overlap degree of each overlapping node was set to O m = 2. The number of overlapping nodes in the network was set to O n = 500. The mixing parameter µ was set to {0.1 : 0.1 : 0.8}, that is, the value of µ varied within the range from 0.1 to 0.8 with a span of 0.1. As µ increases, community boundaries become blurred and communities in the network become less identifiable.

Complex Networks
Complex networks are used to validate the performance of GTIP and traditional community detection methods.
(1) The College Football Network. This network contains 115 nodes and 616 edges, where the nodes in the network represent the football team and the edge between two nodes indicates that there has been a game between the two teams.
(2) The Political Book Network. This network is generated by the sales records of political books on Amazon.com during the president election in the early 21st century, and consists of 105 nodes and 441 edges. The nodes represent the book and the edge represents co-purchasing of books by the same buyers. The network forms three natural communities, "liberal", "neutral", and "conservative".
(3) The Dolphin Family Network. The network consists of two dolphin families with 62 nodes and 159 edges. The nodes in the network represent dolphins and the edge represents the frequency of contact between two dolphins.

Real-World Networks
Real-world networks were used to validate the performance of GTIP and semantic community detection methods. The five semantic-rich real-world networks used in the experiment can be downloaded from https://www.aminer.cn (accessed on 1 August 2022) and https://snap.stanford.edu/data/index.html (accessed on 1 August 2022). (1) Academic Social Network (ASN): this dataset includes paper information, paper citations, author information, and author collaboration, and contains 1,712,433 authors (nodes) and 4,258,615 collaboration relationships (edges).
(2) Youtube social network: Youtube is a video-sharing website where users can establish friendships and create groups. This dataset contains 1,134,890 nodes and 2,987,624 edges.
(3) DBLP collaboration network: the DBLP computer science bibliography provides a comprehensive list of research papers in computer science. This dataset contains 317,080 nodes and 1,049,866 edges.
(4) Amazon product co-purchasing network: this network was collected by crawling the Amazon website. If a product i is frequently co-purchased with product j, the graph contains an undirected edge from i to j. This dataset contains 334,863 nodes and 925,872 edges.
(5) Enron email network: this dataset was originally made public and posted to the web by the Federal Energy Regulatory Commission during its investigation of Enron; it contains 36,692 nodes and 183,831 edges.

Analysis on the Influence Range of Percolation
We use the parameter jump to represent the influence range of percolation, which can affect the aggregation of nodes inside communities.
In non-artificial network experiments (Tables 5 and 6), when jump increases the number of detected communities decreases and the quality (EQ and SQ) of communities declines, especially when jump > 3. According to Equation (29), the source point has a great influence on the nodes within three hops. Beyond this range there is uncertainty in percolation, which leads to the fragmentation of the community and reduces the quality of the community structure. In comparison, the decay rate of SQ is slightly faster than that of EQ. Comparing with Equations (34) and (35), changes in percolation range are more likely to affect the similarity between nodes within the community than that community's proportion of internal and external links.  In artificial network experiments (Table 7), the performance of GTIP varies with parameter µ. As µ increases, communities in the network become less identifiable and the N MI score gradually decreases. The performance of GTIP continues to decrease rapidly when j > 3. In contrast to non-artificial network experiments, the difference in N MI score for j = 1, j = 2, and j = 3 is not significant. One possible reason for this is that the link distribution of the non-artificial network is relatively uniform, which decreases the difference in node influence within three hops. In summary, the performance of GTIP is weak when Jump = 3, and the percolation is ineffective when Jump > 3. Without loss of generality, we set Jump = 3 in the following experiments.

Analysis on the Number of Topics
The number of topics (#Topics) in a document collection T can affect the size of the base of the semantic space; therefore, we verified the change in community quality when the number of topics was T = 1, 2, ..., 20.
The experiment results are shown in Figures 4-6. It can be seen that when #Topics ranges from 0 to 8, the quality (EQ, SQ and N MI) of communities increases exponentially. When #Topics ranges from 8 to 12, the quality of communities tends to be stable. When #Topics ranges from 12 to 20, the quality of communities decreases rapidly. The reason for this is that when #Topics increases, the difference in the semantic space coordinate of each node becomes larger, which increases the possibility of community division. In this experiment, EQ, SQ, and N MI reach the optimal value when the number of topics is around 10. In addition, the SQ values of community structures are higher in networks with obvious topic attributes. For example, the topics in the Enron email network mostly focus on finance, stock price, and energy transportation, which makes the community have strong topic consistency. To better demonstrate the performance of our algorithm, we set #Topics = 10 in the following experiments.    4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 NMI #Topics μ = 0.1 μ = 0.2 μ = 0.3 μ = 0.4 μ = 0.5 μ = 0.6 μ = 0.7 μ = 0.8 Figure 6. The N MI score on non-artificial networks with #Topics range from 1 ro 20.

Experimental Results on Artificial Networks
We executed eleven community detection algorithms on LFR artificial networks and recorded the NMI values. From Table 8, it can be seen that complex network community detection methods (GN, FN, LFM, and COPRA) have lower NMI values, while the NMI values slowly decreases when µ becomes large. In comparison, COPRA performs better and remains effective in mining the community structure when the community boundaries are blurred (µ = 0.6, 0.7 and 0.8). As the community boundaries become clearer, the performance of the semantic community discovery algorithm improves. When µ = 0.4 and 0.5, ACQ and CUT have a higher NMI value. GTIP and DEEP perform better when µ = 0.1, 0.2 and 0.3. However, because DEEP requires a large number of ground-truth communities as samples, its NMI decays faster as µ grows larger. In comparison, GTIP has better performance. The reason for this is that when the community boundary is clearer, node cohesiveness and central tendency are stronger, which is more consistent with the community generation principle of GTIP.

Experimental Results on Complex Networks
We chose the Football, Books, and Dolphins networks as the experimental datasets. The algorithms used for the comparison included GN [6], FN [7], LFM [33], and COPRA [34].
GN and FN are non-overlapping community detection algorithms, while LFM and COPRA are overlapping community detection algorithms. We compared the EQ and SQ of each algorithm on the three complex networks described in Section 5.2.2. Table 9 shows the EQ and SQ score of each algorithm. GN and FN discover communities by cutting edges and if communities do not overlap, their EQ values are lower. LFM and COPRA aim to increase the proportion of internal and external links of the community, therefore, the EQ value of the two algorithms is higher than that of GTIP (5.229% higher on average). The goal of GTIP is semantic similarity among nodes in the community, therefore, the SQ value of GTIP is higher than the other four algorithms (27.153% higher on average). The COPRA algorithm has the highest EQ value in the experiment; its SQ value, however, is lower than GTIP algorithm (8.184% lower on average). In general, traditional non-semantic community detection algorithms have high performance in mining communities based on topology structure and poor performance in community detection with rich semantic information. Horizontal comparison shows that the EQ value of the classical community detection algorithms is higher than the SQ value (10.169% higher on average). COPRA and GTIP show good performance on complex networks. Both of them discover communities based on information diffusion, which indicates that accurately simulating the interaction behavior of social individuals is an effective way to detect communities with tight structure and semantic cohesion.

Experimental Results on Real-World Networks
In this section, we compare GTIP with seven semantic community detection algorithms: CUT [35], TURCM [36], LCTA [37], ACQ [21], DEEP [28], BTLSC [29], and SCE [14]. We used the five real-world networks described in Section 5.2.3 as the experiment data; the results are shown in Tables 10 and 11 .  BTLSC and SCE have better performance on ASN and Youtube networks. For example, in the EQ comparison experiment, BTLSC and SCE outperform GTIP by 0.294% and 11.233%, respectively. In the SQ comparison experiment, BTLSC and SCE outperform GTIP by 2.369% and 12.384%, respectively. On the DBLP, Amazon, and Enron networks, GTIP has a definite performance advantage. In the EQ and SQ comparison experiment, GTIP outperforms the other algorithms by an average of 18.386% and 19.973%, respectively. The reason for this is that the nodes in these three networks generally have a high propensity for topics. Taking the Enron network as an example, Figure 7 depicts the word clouds of the Enron network. It can be seen that the network has a strong topic concentration containing six distinct topics, which enhances the accuracy of the GTIP algorithm in selecting the source point of percolation. Additionally, in networks with rich semantic information SQ is typically lower than EQ. The reason for this is that in a semantic social network, although two users may focus on the same topic, different sentiment tendencies concerning the topic can lead to a split in the community.

Conclusions
This paper proposes GTIP, a semantic community detection method based on topic influence percolation. First, we modeled topic propagation in semantic social networks as the flow of a fluid through porous media based on percolation mechanics, then constructed a partial differential equation to solve the percolation intensity of topic influence. Second, based on game theory, the rules of accepting and forwarding topics were formulated to maximize the benefits of users and achieve Nash equilibrium. Finally, a semantic community was generated based on the seed expansion process.
We conducted experiments on artificial networks, complex networks, and semantic social networks. Our results show that when community boundaries are obvious and the corpus is rich, the modularity and NMI scores of GTIP are significantly better than other comparison algorithms. This shows that GTIP can capture the structural density and semantic cohesion of the network and has a high performance advantage in networks with high topic concentration.
In fact, users have different emotional perceptions of different topics, and even if we gather users with similar topics into one community, the community has the potential to split. In future work, we intend to integrate the sentiment attributes into the base of the semantic space in order to improve the structural stability of the detected communities.  Data Availability Statement: The publicly available datasets analyzed for this study can be found at (https://www.aminer.cn accessed on 1 August 2022) and (https://snap.stanford.edu/data/index. html accessed on 1 August 2022). Further inquiries can be directed to the corresponding author.