Geographical Structural Features of the WeChat Social Networks

: Recently, spatial interaction analysis of online social networks has become a big concern. Early studies of geographical characteristics analysis and community detection in online social networks have shown that nodes within the same community might gather together geographically. However, the method of community detection is based on the idea that there are more links within the community than that connect nodes in different communities, and there is no analysis to explain the phenomenon. The statistical models for network analysis usually investigate the characteristics of a network based on the probability theory. This paper analyzes a series of statistical models and selects the MDND model to classify links and nodes in social networks. The model can achieve the same performance as the community detection algorithm when analyzing the structure in the online social network. The construction assumption of the model explains the reasons for the geographically aggregating of nodes in the same community to a degree. The research provides new ideas and methods for nodes classiﬁcation and geographic characteristics analysis of online social networks and mobile communication networks and makes up for the shortcomings of community detection methods that do not explain the principle of network generation. A natural progression of this work is to geographically analyze the characteristics of social networks and provide assistance for advertising delivery and Internet management.

Section 6 discusses the advantages and limitations of the model. Section 7 summarizes the content of the article and concludes.

Notation
In this paper, M is used to indicate the number of nodes in the network, Z is used to represent the M × M network, and N is used to represent the amount of links. z sr represents the connection number between node s and node r,which is a non-negative integer. Non-zero z sr indicates that there is a link between node s and node r. Ω(.) indicates whether the expression between the brackets is true or not. If it is true, the value is 1; if not, the value is 0. "Spatio-info networks" in this paper refers to a city-based network constructed from a social network containing geographic information. The specific construction method is described in Section 3. The amount of interactions refers to the number of all user message propagation records between the two cities.

Community Detection
The community detection of complex networks is an important research filed. A community detection algorithm aims to get a satisfactory classification of nodes. The labels of nodes indicate the classes of nodes, which are also the results of the algorithms. A community detection algorithm usually contains many iterations, and the labels of nodes will be updated upon each iteration. Iteration will be terminated when achieving the goal of having more links within the community than between the communities. Different algorithms have different approaches to change the label of nodes in each iteration or approaches to measure whether the goal is achieved or not.
Newman and Girvan [23] proposed modularity to measure the results of community detection. Modularity is defined as the difference between the ratio of edges in communities of a network and the expected ratio in a random network.
In Equation (1), k is the number of communities, real f low ijk gives the ratio of the edge between node i and j in the same community, and est f low i jk gives the expected value in a random network. The modularity of different scales of network fluctuates between 0.3 and 0.7.
Newman [24] proposed the most widely used method, which is the Newman modularity maximization method. In practical projects, there are several algorithms to support the actual network analysis in consideration of time complexity and flexibility. The fast greedy algorithm is a hierarchical agglomerative algorithm, the time complexity of which is O(md lg n), where m represents the number of edges, n is the number of nodes, and d is the tree diagram depth of community structure.
Blondel [25] applied the multilevel community detection algorithm. However, the work was based on the dataset of mobile phone communication, and the network was based on individuals devoid of spatial interaction information. Thus far, studies that try to detect the multilevel communities based on spatial interaction are rare, except for works of Sobolevsky et al.. [26], De Montis et al. [24], and Guanghua [16]. The former two works are based on the modularity maximization algorithm, and the last one is based on the Infomap algorithm. Sobolevsky et al. [26] detected deep and small communities based on the communities that formed before. Different from others, this kind of method gets large communities after those small communities are obtained. The work in this paper is based on the algorithm of Informap without detecting multilevel communities. There are many algorithms to detect communities [26]. Lancichinetti and Fortunato [27,28] found that the algorithm of Informap achieves satisfactory performance in different situations. Rosvall and Bergstrom [29] had a more detailed description about this algorithm.
In the research of Chuan et al. [30] and Guanghua [16], cities and base stations for mobile phones are regarded as nodes. The nodes can be divided into classes by community detection. The most prominent features of the community are the aggregation and continuity of the city or base station nodes. According to the general idea, the interaction in the network should not follow the law of face-to-face interaction, so that the city nodes of a community should not be geographically aggregated in one area. However, many previous results show that the geographical distance in the network interaction is preserved, and the city nodes in a community also locate together. It can also be explained that the communication of public opinion is affected by geographical factors to a large extent.

Stochastic Block-Model
Stochastic Block-model (SB) and its associated models belong to a very important statistic model class. The basic SB assumes that each node belongs to one of the potential K classes. For any two nodes (i, j), the probability that they have connection is determined by a specific parameter θ c s ,c r . Under the condition that the classification is determined to be c s and c r , the probability that there is a link between node s and r obeys the distribution with θ c s ,c r as a parameter. Figure 1a is the adjacency matrix of a network generated in this way.
According to Snijder's [19] method, to reconstruct SB with the Bayesian statistics theory, a conjugate prior probability could be placed for classification and other parameters. Based on the conjugate prior probability and likelihood functions, the posterior distribution function could be written to get a new network model.
The Beta distribution and the discrete Dirichlet distribution are used as the prior probability of the connection probability nodes classification, respectively. The SB formed by Bayesian statistics theory is shown in Equation (2).
The SB can be extended into many forms. For example, the Infinite Relational Model (IRM) [31] achieves the goal of allowing infinite classes by placing the Dirichlet process as a prior probability for the classification distribution. This eliminates the need to set an accurate number of classes in advance and allows the number of classes to increase as the number of nodes increase.  Most of these models, including the SB, IRM, and MMSB, describe the structural characteristics of the network. However, they cannot describe the links of the network from a global perspective and cannot predict new links. None of these models can model the networks with increasing nodes.

Dirichlet Network Distribution
The Dirichlet Network Distribution (DND) constructs a network without the limitation of nodes in a very simple way. It generates a distribution based on a countable number of infinite nodes. Thus, a network can be represented as a sequence of (sender, receiver) pairs, and each pair corresponds to a phone, message, mail, or journey from the sender to the receiver.
The SB and associated models assume that the network is static and fully observable. This is why these models cannot predict new links. DND regards the network as a sequence of observed connections, aiming at predicting the classification of new links rather than nodes.
To achieve the purpose of generating a pair of nodes, it is possible to simply sample the sender and receiver from a distribution G based on all nodes. As in Equation (3), taking N as the total number of links, a prior probability of the Dirichlet process is placed for G. y n represents the nth links, which is a message from sender s n to receiver r n , respectively.
This network construction method is called Symmetric Dirichlet Network Distribution (SDND). Figure 1b shows the network generated in this way. The features of these models are that the distribution of the sender and receiver of any link of the network is exactly the same.
However, there are some problems. In real-life networks, senders and receivers are not completely independent and identically distributed. For example, on Facebook or Twitter, more users are concerned about politic stars such as Trump than those who care about general users. From the perspective of national migration, with the process of urbanization, in the network of population migration, many of the migrants come from remote areas such as rural areas or townships and move to more economically developed cities or county town. That is to say, the sender and the receiver do not obey the same distribution when modeling the network.
Therefore, it is a more reasonable choice to describe the sender and receiver by using the asymmetric distributions A and B instead of the distribution G. Based on this idea, Sinead et al. [25] constructed different sender distribution A and receiver distribution B based on a discrete base measure H. It is shown in Equation (4).
(4) Figure 1c is a network generated with this method. However, this method can only solve the problem of asymmetry range between the sender and receiver. The social networks in real life are complex. For example, the followers of a certain topic in social networks are generally different from other topics. Animation, car, and fashion are communication topics that are separated and intertwined. The sender and receiver of one topic are likely to be inconsistent with other information. Generally speaking, the person who sends and receives the information is related to the topic of a message. In this case, all the models mentioned above have obvious defects and cannot apply to such networks.

Data and Networks
We construct the Spatio-info network to analyze the geographical characteristics with the method in the work of Chuan et al. [30]. The main idea of this method is to abstract the information interaction among users in online social networks as the information transmission among cities. Then, a city's network is produced.
WeChat is the most widely and frequently used social media software for Chinese. It is mainly used on the mobile phone and can also be used on PCs and tablets. It has the function of instant messaging, photo display, and comment. It has Official Wechat Accounts, which are managed by government agencies, schools, companies, individuals, etc. The general users follow these accounts, and then receive their messages, news, reviews, articles, and so on. A record will be produced when a WeChat user clicks to browse the pages shared by others. These records became our data source. The data collection company (Fabonacci) collected a large number of historical records of pages propagated among users, forming a dataset.
Each record in the data refers to a behavior that a user clicked and browsed a webpage. A record includes the sharer ID, the viewer ID, the webpage ID, the viewer's IP, and the browsing time.
To protect privacy, the data collection company makes users' IDs unique to the users' WeChat ID, but they are not the same. Through the query of the IP library, the geographical location of the user could be obtained. Song Jian et al. [32] analyzed coverage and coincidence rate of several major IP address libraries, including IP2Location Lite, GeoLite2, Pure IP Address Library, Taobao IP Address Library, Sina IP Address Library, and Baidu IP Address Library. They concluded that the Taobao IP address library has the highest credibility and the highest credibility at the city level. In this paper, it is decided to use the Taobao IP Address Library to geographically locate the IP addresses involved in the data.
M h indicates the number of users in the dataset and M c denotes the number of cities in the dataset. First, the interaction records in the data can be considered as the sequence of links. N represents the number of links in the sequence. The interactive network can be represented as in Equation (5). This is an example of the online social network.
Each user node has a city assignment of g i . The interactions of users between two cities are regarded as the interactions of cities. Each interaction record can be regarded as a directed edge, which is expressed as a city pair. The dataset can be seen as a sequence of city pairs, as in Equation (6). The interactive network between cities can be represented in the form of an adjacency matrix. The network generation process is also shown in Figure 2. In Figure 2a, the logo of WeChat represents the media users, and the links represent the information spreading among the users. The users and the links form a network. In Figure 2b, the users in a city are regarded as an overall entity, and the links from one city to another are regarded as a single link between the respected cities.
x n = (g s n , g r n ) The data from 05:00:00 to 06:00:00 on 22 April 2015 are analyzed, and three successively enlarged regions in central China are selected to form three networks with cities as nodes. The basic statistical characteristics of the networks are shown in Table 1.
Network 2 adds city nodes of two provinces on the basis of Network 1, and Network 3 adds more city nodes of another two provinces on the basis of Network 2. A Mixture of Dirichlet network distribution model is constructed. At the same time, in Section 5, the analysis results of the MDND model and the community discovery results are compared and analyzed.

Mixture of Dirichlet Network Distribution
Based on the three networks obtained in Section 4.1, the MDND model of Spatio-info Networks is constructed according to the Hierarchical Dirichlet Process, and the Gibbs sampler is constructed by combining the sampling methods of the Hierarchical Dirichlet Process.
In addition to the social networks mentioned in Section 2, there are some specific issues in online social networks. Generally speaking, different topics attract different people, thus the people involved in information dissemination are different. Therefore, the distribution of senders and receivers in such networks is related to specific information topics. When modeling networks, the senders and receivers obey different distributions corresponding to the topics. MDND can describe the case where different classes of links are strongly related to the links' nodes. This is a feature that many other statistical models of network such as SB and DND lack. Figure 2b shows the adjacency matrix of a random network generated in this way. According to Sinead's work, the MDND model based on Spatio-info network can be represented by Equations (7) and (8), in which α controls the number of classes, γ controls number of nodes, and τ controls the overlap between classes. The process of generating a network with MDND can be simply understood as a process of gradually generating a series of links.

Sampling Method
The symmetrical Dirichlet network model is sampled to predict new links based on the Chinese restaurant process [33]. The MDND model is based on a hierarchical Dirichlet process, thus it is appropriate to construct a collapsed Gibbs sampler [34].
indicates the number of links connected to the class k. m (1) k,i indicates the number of links in the class k that sender is i . m (2) k,i indicates the number of links in the class k for which the receiver is i. The probability that node i is connected to class k as sender and receiver is ρ and ρ (k) .i . The posterior probability distribution calculation method is shown in Equation (9). s(m is unsigned Stirling function. The probability of each node is determined by the probability measure of the edge to which it is connected, given by Equations (10) and (11) . β 1 , ..., β M c correspond to existing nodes, while β µ corresponds to a new node.
In the case where the classification of all other links are given, the distribution of the classification of the link n is shown in Equations (12) and (13) .

Results and Analysis
We use the MDND model to analyze the network and obtain the links' class assignments and nodes' class assignments by sampling. We compare the node classification effect of MDND and community detection algorithms, and then the MDND model's characterization of Spatio-info networks is verified, thereby explaining the formation principle of Spatio-info networks. We focus on the characteristics of the hierarchical structure in the geographic information network. The characteristics of the hierarchical structure in the Spatio-info networks are mainly reflected in that there are more interactions between cities within the provincial administrative division and fewer interactions with the outside. This phenomenon has been confirmed by the community detection algorithm. However, from the actual situation analysis, it can be speculated that the interaction classification has a strong relationship with the regional distribution of city nodes.

Classification of Links
The three networks are analyzed separately, and the network is inferred based on the network data with the MDND. Different link's classes are denoted by different colors, as shown in Figure 3. Each subfigure of Figure 3 is similar to a matrix: the x-axis and y-axis represent city number series. The cities in the same province are placed together. The order of the city number series of the x-axis and y-axis is the same. If two cities (i and j) have a link, then there should be a point with a specific color at (i,j).
It can be clearly found that the cities in the same province are basically in the same category, and the links among different provinces also exhibit an effect of blocks. Of course, it can be clearly seen in the figure that the MDND model has a strong description ability for the structure of the network. Regardless of the size of the network, the increase of nodes does not significantly affect the quality of the model.

Classification of Nodes
Based on simple rules, according to the classification of links, the classification of a single node is determined by the dominant class of its links. The goal of this paper is to explore the model's ability to describe the structural characteristics of Spatio-info networks. Therefore, it is a good choice to compare the classification of nodes with the administrative division. Figure 4 is in the form of the Spatio-info networks, in which the nodes represent cities, the links represent the links between cities, and different colors of nodes indicate the nodes are in different classes. This figure shows the differences among the classification results of administrative divisions, the community detection, and the MDND model from the perspectives of the network. It can be clearly seen in the two figures that both the community detection algorithm and the MDND model can well discover the structural characteristics of the network that are closely related to geographical factors. However, there are also some differences. Among the analysis results of Network 1, the community detection, MDND model, and administrative division are completely consistent. In the result of Network 2, the results of community detection are consistent with administrative divisions, but the results of MDND models are obviously inconsistent. In the result of Network 3, the community detection algorithm maps the cities of Jiangxi and Fujian provinces into the same category, and the MDND model successfully separates the cities of the two provinces. To better analyze the classification results of MDND model, this paper introduces the Adjusted Rand Index (ARI) [16], which measures the similarity between two division results. It is needed to draw a contingency table and calculate the ARI according to the contingency table as shown in the matrix in Equation (15). According to the calculation method shown in Equation (16), the ARIs between any two divisions are shown in Table 2, noting that AD, MC, and CR mean administrative division, modularity class, and classification results of MDND model, respectively. It can be seen that both the MDND and community detection results are nearly the same as administrative divisions. With the increase of network scale, the similarity is lower for both MDND and community detection results. Based on this, it can be known that the MDND can model such Spatio-info networks, which are obviously affected by geographical factors, and accurately reflect the structural information related to geographical factors in the network.

Computational Performance Analysis
According to the Gibbs sampler of the MDND, the calculation results will be different as the number of iteration increases. Figure 6 shows the similarity index ARI of the MDND result and the administrative division as the number of iteration increase. It can be seen that as the number of iteration increases, the ARI value gradually approaches 1, which means that the classification results of MDND are consistent with the administrative division. It shows that the number of iterations increases, the ARI value increases rapidly in the initial stage, and then the increasing speed gradually slows down. At the same time, as the network scale increases, the speed at which the ARI value increases to 1 is also different. In Table 2, it can be seen that the final results of Networks 1 and 2 will both be 1. However, as the scale grows, the increasing speed will slow down. It can be seen that as the iteration times increases, the ARI value gradually approaches 1, and the volatility is small.

Two Special Cases
There are some special circumstances in the Spatio-info network with cities as nodes. First, there may be cases where the network is extremely sparse. For example, the economic situation in some areas is not good, so the Internet infrastructure is relatively poor. In some areas, because of the strong protection of data privacy, only a small amount of information can be collected. Further analysis is needed to determine whether the MDND is suitable in the case where only minimal interaction information can be gathered. In addition, many neighboring regions have different levels of economic and Internet development. For example, Guangdong and Guangxi provinces in China border each other, but the economic situation and the Internet infrastructure of Guangdong province are much better. This fact causes the unbalanced nature of the Spatio-info network in the scope of Guangdong and Guangxi, and whether this feature will be reflected in the MDND is of great significance.

Sparse Network
For the sake of comparison, the Spatio-info Network 1 formed in Section 5 is sampled to obtain a sparse network. The basic information of the sparse network is shown in Table 3. The average degree of the network is 3.24. It is modeled with the MDND. The links and nodes are classified. We then compare the classification assignments of nodes with administrative divisions and calculate ARI. The ARI value changes with the increase of iterations, which is shown in Figure 7. Table 3. Interaction amounts.

GD to GD GX to GX GD to GX GX to GD
Link amounts 548 50 22 20 Notes: GD, Guangdong Province; GX, Guangxi Province. It can be seen in Figure 7 and Table 4 that, under the condition of a sparse network, MDND is not good for geo-related structural information mining in social networks. The ARI value does not increase significantly with the increase of iteration times, and the volatility is very large. It can be seen that, under the condition of an extremely sparse network, the classification assignments of the two methods are quite different from the administrative division, but the community detection results are better. Therefore, the MDND model is not stable enough to deal with the sparse Spatio-info network.

Unbalanced Network
Imbalance in this paper refers to the fact that, in a Spatio-info network, the interaction amount of an area is significantly less than that of another area. The geographic information interaction networks with cities as nodes in Guangdong and Guangxi provinces are obtained in the same way as the networks in Section 5. First, 640 records are obtained. There are 548 in Guangdong Province and 50 in Guangxi. The details are shown in Table 4. The interactions in Guangdong is significantly more than that in Guangxi.
MDND is used to model and analyze the Spatio-info network, and the comparison between the classification result and the administrative division is shown in Figure 8 and Table 5. It can be seen that the increase of ARI value is obvious with the increase of the number of iteration , and the volatility is very large. Moreover, neither the classification of MDND nor the community detection result is similar to the administrative division. The results show that it is difficult for MDND to mine the geographically related structural information in the Spatio-info network in this special case. A new model should be built when dealing with unbalanced geographic information networks.

Conclusions
This paper focuses on the analysis of geographic characteristics in various networks with complex network methods. These studies usually extract geographic information from different networks, such as mobile phone communication networks and user interaction networks in the social media application. Then, the networks are converted into networks with location nodes, that is, spatial information networks. The theory and method of the complex network are used to analyze the network. The most typical result is that the nodes in a community are often geographically aggregated, which is generally similar to administrative divisions. The principle of community detection is that there are more links within the community. This explains, to a certain extent, the phenomenon that the nodes in the community in the network also aggregate geographically. It is said that, whether online social network or phone-call communication networks, there will be more communication between geographically close regions than other regions. This paper tries to analyze spatial information networks with network statistical models other than community detection. The principles, advantages, and disadvantages of a series of network statistical models are analyzed. The MDND model is chosen in this paper. The construction and implementation details of the MDND model are also introduced. Finally, the analysis results are presented in the form of nodes and links classification. The results are compared with the results of the community detection algorithm and administrative division. It is found that the three types of divisions show a high degree of agreement. It shows that the MDND model can also analyze the geographically aggregating phenomena of nodes in the same community.
At the same time, there are hypotheses and mathematical basis on the construction process of the MDND model. Therefore, the hypothesis of the model explains the network constructed in the paper to a degree. That is to say, the geographically aggregating of nodes in this network may not only be due to the fact that there are more links within the community, but also because the classification of links is correlated with the nodes. Of course, such hypotheses and explanation require more empirical data to analyze and confirm. However, the classification ability of the MDND model for links and nodes of such networks can be used for further analysis and research.
The nodes grouping ability in the network of the MDND model is similar to the community detection method used to group network nodes, and the link classification ability can be used to classify information in social media networks in the future. As for Spatio-info networks, i.e., the MDND model, similar to other statistical models, can predict the rest of the network by building the model of the network with a part of data. For example, 90% of the links in the network are known, and the remaining 10% can be predicted with these methods. This is a typical task in link prediction. Therefore, in the future, with the help of the MDND model, link prediction methods in complex networks will be used to analyze geographic information-related networks, and more valuable results can be obtained. The link prediction methods and community detection algorithms could be used in advertising delivery and Internet management.
The significance of the work is to analyze the spatial structure characteristics of social networks from the perspective of statistical models for the first time. The performance of its node classification can achieve the performance of community detection. The model can mathematically reflect the spatial structure features contained in the network. At the same time, the network model has the great advantage of predictability as it is a statistical model for the network. Naturally, this work can be carried out in the future, and it will play a role in the prediction of the geographical scope of information dissemination and provide assistance for many aspects such as advertising delivery, Internet management, and so on.