You are currently viewing a new version of our website. To view the old version click .
ISPRS International Journal of Geo-Information
  • Article
  • Open Access

23 May 2022

Organizational Geosocial Network: A Graph Machine Learning Approach Integrating Geographic and Public Policy Information for Studying the Development of Social Organizations in China

,
and
1
College of Humanities and Development Studies, China Agricultural University, Beijing 100083, China
2
College of Economics and Management, China Agricultural University, Beijing 100083, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.

Abstract

This study aims to give an insight into the development trends and patterns of social organizations (SOs) in China from the perspective of network science integrating geography and public policy information embedded in the network structure. Firstly, we constructed a first-of-its-kind database which encompasses almost all social organizations established in China throughout the past decade. Secondly, we proposed four basic structures to represent the homogeneous and heterogeneous networks between social organizations and related social entities, such as government administrations and community members. Then, we pioneered the application of graph models to the field of organizations and embedded the Organizational Geosocial Network (OGN) into a low-dimensional representation of the social entities and relations while preserving their semantic meaning. Finally, we applied advanced graph deep learning methods, such as graph attention networks (GAT) and graph convolutional networks (GCN), to perform exploratory classification tasks by training models with county-level OGNs dataset and make predictions of which geographic region the county-level OGN belongs to. The experiment proves that different regions possess a variety of development patterns and economic structures where local social organizations are embedded, thus forming differential OGN structures, which can be sensed by graph machine learning algorithms and make relatively accurate predictions. To the best of our knowledge, this is the first application of graph deep learning to the construction and representation learning of geosocial network models of social organizations, which has certain reference significance for research in related fields.

1. Introduction

With economic and social development, Chinese social organizations have been developing rapidly, participating in planning and governance, providing professional services in various fields such as health care, social security, and public education [1]. Although social organizations often work with or alongside government agencies, and may even receive funding or commissions from the government, they are actually independent third parties outside of the government in most domains.
When the People’s Republic of China was founded, there were only about 100 national social organizations and 6000 local social organizations. Soon after the beginning of the Cultural Revolution in 1966 when the Ministry of the Interior, which was in charge of all Chinese social organizations, was abolished, social organizations almost vanished in mainland China. Thanks to the increasingly liberal social climate in China after the reform and opening up, the announcement of the Regulations on Registration of Social Organizations and the Fund Management Measures laid a solid legal foundation for the development of social organizations, whose number nearly doubled in the following decade.
In the first decade of the 21st century, social organizations in China put on a spurt. Nowadays, however, confronted with a saturated market and continuously perfecting policies and legal systems, the growth rate has decreased (Figure 1), which indicates the shift of development philosophy in China, from the pursuit of speed to the pursuit of quality.
Figure 1. The development trend of social organizations in China before COVID-19 pandemic.
Social organizations in China can be divided into three categories: “top-down”, “bottom-up”, and “external imported”. Government-run organizations and foundations are typical “top-down” social organizations. In contrast, the “bottom-up” social organization includes all kinds of local industry associations and private non-profit organizations. After China’s accession to the World Trade Organization(WTO), the “externally imported” ones, whose funding, project operation and governance are mainly derived from foreign social organizations, is a force to be reckoned with, bringing new ideas and innovations to fields such as environmental protection, poverty alleviation and female rights. The vast territory, the uneven distribution of natural resources, the inter-mingling of various social classes, the unbalanced development and cultural diversity in China have contributed to the great differences in social development as well as the composition of social organizations from all-around China. Generally speaking, geographic location, including local economy, culture and policies, is an important factor in the growth of social organizations, and it’s considerably crucial to explore the impact of abstract structures embedded in geographic information on the development of social organizations in China.
A social network is a structure composed of various social entities; the most familiar one to us is no doubt the Internet-based social network (e.g., Facebook, LinkedIn, or WeChat). However, except from individuals online, social organizations can also be an important composition of a social network [2]. This perspective provides a set of methods and theories for analyzing the structure of social entities as a whole, as well as explaining the patterns observed in these structures [3]. The social networks analysis(SNA) has recently become increasingly popular due to rising technology of graph machine leaning [4,5]. From the mathematical concept of graphs, the simple and straightforward function of graphs enables us to obtain a clearer picture of community structure and their interactions. However, previous literature paid little attention to the quantitative and structural exploration of organizational networks. In this paper, we accomplished the construction and exploratory analysis of specific machine learning algorithms and graph models by synthesizing political and economic information embedded in organizational social network (OGN) based on real-world data.
Figure 2 illustrates the distribution of social organizations in China using the database constructed in this paper, revealing a nationwide organizational social network (OGN), where the dots represent social organizations of each administrative unit and the brightness of each dot represents its degree centrality. The concentration of social organizations is consistent with the distribution of prominent economic zones, such as the Yangtze River Delta and the Pearl River Delta. There is an imaginary diagonal line across China, called the Hu Line. The Hu Line has vast demographic significance and can also represent the distribution of social organizations: the number of social organizations on west side of the line is considerately lower than in those on the east.
Figure 2. Distribution of social organizations in China.
The main contributions of this paper are as follows. Firstly, we used the open source data of the Ministry of Civil Affairs of China to construct a pioneering large-scale database of social organizations fusing public policy and geographic information, which is, to our knowledge, the first large-scale database of social organizations for research use. Secondly, we pioneered the application of graph structure to model the development of social organizations that integrate geographic information and public policy. Last, but not least, based on the graph attention mechanism, we propose a new graph attention network integrating textual information of social organizations, and apply it to the task of classifying graph networks based on geographic information and achieve a good result, laying a foundation for exploring the dynamic development model of regional social organizations.
The structure of this paper is organized as follows: Section 1 presents the introduction, with a brief history of social organizations in mainland China and main research ideas of the article. Section 2 introduces several research topics related to this research, including social networks, geographic information systems, natural language processing and graph neural network models. Section 3 focuses on the construction process of our brand new database and some descriptive statistical analysis of the collected data. In Section 4, we propose four basic types of organizational social networks based on the theory of homogeneous and heterogeneous graphs, and attributed network embedding based on BERT and CNN. In Section 5, we investigate the organizational social network using graph machine learning models to explore the relationship between the network and geographic regions to which they belong. In Section 6, we draw conclusions for the paper.

3. The Novel Database of Social Organizations in China

In China, public access to information related to social organizations can be browsed online through the National Social Organization Credit Information Public Platform (hereafter, the Platform; https://xxgs.chinanpo.mca.gov.cn/gsxt/newList, accessed on 17 May 2022), supervised by the Ministry of Civil Affairs. The Platform stores all of the basic information entries of each organization, Figure 8 is an example.
Figure 8. A flow chart showing the database construction based on open data platform of the Ministry of Civil Affairs of China.
However, users can only search for information about one specific organization by entering keywords or the exact social credit code, and can only search for one organization at a time, which severely limits the amount of data that researchers can access for research purposes. Furthermore, users have to pass a human–machine verification operation before every single search. In China, where tens of thousands of social organizations are established every year and the Platform stores all of their basic information, if we try to manually perform the acquisition of all social organizations, millions of searches and downloads are required, which is a huge drain in terms of manpower, money, and time, thus limiting or even preventing the role of big data analysis of social organizations in China. Therefore, the use of web data scraping methods for bulk collection and collation of web data is a must.

3.1. Design and Implementation of Web Crawlers

In this paper, we have written a web crawler with data processing program using Python. The web crawler accesses web pages through hypertext transfer protocol (HTTP). The web crawler generally sets the starting set of seed URLs at the beginning, and after establishing a successful connection with the seed URL server, it parses the contents of the corresponding web pages to obtain all the URLs that can be linked from them [33]. It then searches the web page and downloads the target data, which, as is shown in Figure 8, may be encoded in Hypertext Markup Language (HTML) or obtained through links to JS codes. The number of pages visited and searched depends on the parameters set in the program prior to startup. New URLs are then added to the queue to be crawled until the termination conditions are met, and then the parsed results are stored. The crawler we designed fully complies with the prescribed robots protocol and sets the request information for legal requests. The final step is to transform the data and integrate it into a structure suitable for analysis, and the obtained data in Datafram format are saved as CSV files to the cloud for subsequent calls.
As seen in Table 1, each web page contains the details of a specific social organization. After using regular expressions to obtain the body information, we can obtain the text information easily. However, difficulties in the design and writing of the web crawler program lie in how to crack the encryption of the web URLs (Figure 9), skipping the human–machine verification and searching process, and directly obtaining the web address of each social organization point-to-point.
Table 1. The detailed elements published in the platform can be used as the basic variables that constitute the database.
Figure 9. Composition rules and decryption of target URLs.
Through the collection and collation of the basic components of social organizations, which are shown in Table 1, data cleaning was carried out to establish a database of social organization. As of January 2022, we have accessed a total of 1.09 million social organizations and their related information. We declare that the data obtained in this study are public and for research use only, without any commercial and malicious behavior. In addition, for legal reasons, we do not publish the exact technical details of how to break the encryption on the website.

3.2. Data Cleaning and Geographic Information Integration

The quality of data plays a key role in the results of data mining. Data cleaning usually includes dealing with missing values and redundant values, as well as noise. The text collected by web crawlers is mostly unstructured data containing data noise. By observation, we found that there was a certain percentage of noise in the acquired data, which is of no help for understanding the semantics of the text. We deduce that, since the Platform of the Ministry of Civil Affairs only serves as a tool for integrating and publishing information, and detailed data are filled in and uploaded by local civil affairs departments, problems and errors may arise during the uploading process, such as meaningless symbols or tags, JS codes, traditional or abandoned Chinese characters, line breaks, different time formats, and so on, so we need to clean and standardize the obtained data and integrate the relevant geographical information of each social organization to provide a high-quality build of a complete and usable database for research use.
After normalizing the temporal data, the study of the temporal dimension could be carried out. For example, Figure 10 uses the data of the registration time of the organizations. Among the established social organizations, 50,774 have been in existence for less than one year, 152,661 have been operating for one to three years, 155,881 have been operating between three and five years, the largest proportion of social organizations have been functioning for five to ten years, and even more than 240,000 have been running for more than 10 years.
Figure 10. Social organizations categorized by the time of operation.
Meanwhile, the geographical information of social organizations can be obtained by two different methods. The first one is to use the registered address information contained in the database, by calling the API to search and obtain its precise latitude and longitude coordinates which, however, is relatively time-consuming and cannot be applied on a large scale. There is another method which we reckon is a more efficient way to categorize the locations directly according to the coding rules of the unified social credit code. As is shown in Table 2, the unified social credit code, a unique, 18-digit national registration number, follows a standard pattern, which means that we can directly use the 6-digit area code embedded in the unified social credit code to locate social organizations down to the exact administrative division of the county where they are located.
Table 2. The composition rules of the Chinese social unified credit code.
After obtaining the basic geographical information of social organizations, we can explore and study social organizations in the spatial dimension. The map in Figure 11 displayed here shows how the number of newly established social organization varies by province. The shade of the province corresponds to the magnitude of the indicator.The darker the shade, the higher the value.
Figure 11. The number of newly established social organizations between August 2020 and August 2021.

3.3. Text Data Analysis

Since most of the information in the database is Chinese text, how to obtain and analyze the features and semantic information of the Chinese text is of great significance to our study, which would determine the research direction. We firstly performed a basic word separation process on the names of social organizations and their business introduction in the database.
Table 3 shows us clearly the frequency of the occurrence of high-frequency words of different lexicons, enabling us to have a more intuitive sense of the development of social organizations in China. The first line of each cell is the Chinese translation of the word, the second line in parentheses is the original Chinese text, and the third line in italics is the number of times the word appears. The shade of the cell corresponds to the magnitude of the indicator. The darker the shade, the higher the value. In the listed categories, v n refers to the gerunds, n refers to the noun, s refers to the preposition, n l refers to the noun idiom, and a d j refers to the adjective.
Table 3. Top 10 popular keywords ranked by the occurrence frequency.
Table 3 reveals that the nouns in the results are all suffixes of certain words. The words “kindergarten” and “school” appearing after “association” is a reflection of the current boom in China’s education market. It corresponds to the fact that private education in China as the essential form of social forces has developed rapidly and accumulated effective experience in the dissemination of knowledge. Note that the gerunds “poverty alleviate” is in first place, which infer that the Chinese government focuses on improving the living conditions of poor households and helping poor areas to develop production and change the face of poverty, while social organizations, as a third-party force, complement the synergistic effect of multi-subject governance. Similarly, we notice that the word “pension” is in second place and “nursing homes” is in sixth place, reflecting the serious aging situation in China and the active participation of social organizations in the pension business.

4. Graph Model in Organizational Social Networks

4.1. Overview of the Graph Structure

Data exist in a plethora of different forms and sizes, but most of them can be presented as two types: structured data and unstructured data (Figure 12).
Figure 12. Euclidean structure data and Non-Euclidean structure data.
Structured data, for example, temperature, names, dates, stock information, location, and pictures, comprise clearly defined data types with patterns in a standardized format that enable them to organize searchable information efficiently. Modern machine learning algorithms have achieved amazing performance in processing structured data (such as AlphaGo [34], ResNet [35], etc.).
Graph, a typical unstructured data, is more flexible and variable compared with structured data, which, at the same time, makes it relatively more difficult to perform machine learning tasks on graph structured data. However, due to the wide application of graph models in human society, it is of great importance to study graph and related machine learning algorithms. One of the most vivid applications of graph structured data is the virus transmission models being used to characterize the transmission pattern of viruses across countries constructed during the COVID-19 pandemic [36], which played a huge role in controlling the spread of epidemics.
A graph G = ( V , E ) , consisting of two sets, nodes V (also called vertices) and edges E (also called arcs), is able to represent entities and their relations in the graph structured data. An edge e i j = ( u i , u j ) E represents an edge pointing from u j to u i , and the neighboring nodes of node v are defined as N ( v ) = { u V ( v , u ) E } . The adjacency matrix A is a matrix of size n × n ; n represents the number of nodes in the graph. If there exists an edge connecting nodes u i and u j , then A i j = 1 , otherwise A i j = 0 . A node in a graph has attributes or features X R n × d which is the attribute matrix of the node, or called the feature matrix of the node, where X v R d represents the attribute vector of the node v. A graph may also have attributes of edges x e , X e R m × c is the attribute matrix of edges, where X v , u R c represents the attribute vector of edge ( v , u ) , and c represents the dimension of the attribute. The attributes and features represent the same meaning.

4.2. Homogeneous Networks of Organizations

Homogeneous networks, which use a single network architecture, have the same node and link types. Homogeneous networks are network structures composed of the same kind of nodes and link types.
As shown in Table 4, we introduce two types of homogeneous networks: competition and cooperation networks, and supply-chain networks. Each of these types is potentially useful in modeling social organizations and their relationships.
Table 4. Homogeneous networks.

4.3. Heterogeneous Networks of Organizations

Heterogeneous networks have a different set of node and link types. The advantages of heterogeneous networks are the abilities to represent and encode information and relationships from different perspectives. During the development process of social organizations, different types of social entities are involved, for example, government, policymakers, policies, services, community members, and, of course, social organizations. Table 5 below provides two types of heterogeneous networks for modeling the relationships between social organizations and other social entities: policy networks and service networks.
Table 5. Heterogeneous networks.

4.4. Attributed Network Embedding with Text Information

In addition to the structural features of the social organization network, the text content in the database, such as name, business scope, registered capital, and so on, needs to be processed in order to obtain the basic information of the social organization before being input into the machine learning model (Figure 13).
Figure 13. Attributed social network embedding.
In this paper, the length of the text content is limited to L. If the length of the text content exceeds L, then the excess part would be truncated, while if the length of the text content is less than L, placeholders would be used to fill the text until the length is L. x j i R d denotes the word vector of the jth word in the text p i , so the vector of the text p i can be expressed as X 1 : L i = x 1 i ; x 2 i ; ; x L i where X 1 : L i R L × d , x 1 i denotes the word vector of the second word in the text p i , x 2 i denotes the word vector of the second word in the text p i , and x L i denotes the word vector of the Lth word in the text p i (Figure 14).
Figure 14. Attention mechanism: natural language processing.

4.4.1. Multi-Headed Self-Attention Mechanism

In the next step, we adopt a multi-headed self-attentive mechanism to update the word vectors in the text content of each social organization in the database. The multi-headed self-attentive mechanism can explore the connections among word vectors from different perspectives, thus improving the expressiveness of word vectors. h denotes the number of heads of the self-attentive mechanism. Consider a self-attentive mechanism with h heads; j denotes the ordinal number of the head, and the three input matrices of the self-attentive mechanism for the jth head are denoted as query matrix Q j R L × d h , matrix K j R L × d h , and the value matrix V j R L × d h . Taking the embedded vector of text p i ,
X 1 : L i = x 1 i ; x 2 i ; ; x L i ,
as an example: For simplicity, we use X to denote X 1 : L i , then we have K j = X W j K , Q j = X W j Q and V j = X W j V , where W j Q , W j K , W j V R d × d h , W j K denotes the parameter matrix corresponding to the key matrix of the jth head in the self-attentive mechanism, W j Q denotes the parameter matrix corresponding to the query matrix of the jth head in the self-attentive mechanism, and W j V denotes the parameter matrix corresponding to the value matrix of the jth head in the attention mechanism. The output of the jth head of the self-attentive mechanism is represented as
Z j = Attention Q j , K j , V j = o p e r a t o r n a m e s o f t m a x Q j K j T d V j
where Z j R L × d h . In this paper, the output of the h—headed self-attentive mechanism is expressed as Z = Z 1 ; Z 2 ; ; Z h , Z 1 is the output of the self-attentive mechanism for the 1st head, Z 2 is the output of the self-attentive mechanism of the 2nd head, and Z h is the output of the self-attentive mechanism of the hth head, then we have
Z = MultiHead ( X , X , X ) = Concat Z 1 , , Z h W 0
where Z R L × d , W 0 R d × d , and W 0 denotes the parameter matrix of the h—head self-attentive mechanism.

4.4.2. Convolutional Neural Networks and Pooling Operations

Then, we use CNN and pooling operations to obtain semantic information from the text contents in the database. We use convolution kernels to perform the convolution operation W R k × d on the text vector X e : e + k 1 i , where X e : e + k 1 i denotes the e th word vector to the e + k 1 th word vector in the text content p i ; and k denotes the perceptual field size of the kernel. For all word vectors in X e : e + k 1 i , the convolution operation can be expressed as
t j = σ W X e : e + k 1 i + b
where t j is the feature obtained, and * denotes the convolution operation, b R is the bias term, σ is the activation function, such as t a n h , and e denotes the ordinal number, namely the eth word vector in the message p i . Finally, by convolving all possible windows in the text vector X using the convolution kernel W, the feature map of the text p i is obtained as t = t 1 , t 2 , , t L k + 1 and t R L k + 1 , where t 1 denotes the output features of the first sliding window in the CNN, t 2 denotes the output features of the second sliding window, and t L k + 1 denotes the output features of the L k + 1 th sliding window, after which the feature map t is processed using a maximum pooling with step size L k + 1 , t ^ = max { t } . In this paper, we apply sense field sizes of k { 5 , 6 , 7 } . After the maximum pooling operation, three feature vectors of length d / 3 will be obtained, and then be spliced together to obtain the text p i and the final text content feature m i R d , which will at last be spliced with the graph-structured feature of social organization networks.

5. Exploratory Analysis of Organizational Geosocial Network with Graph Machine Learning

5.1. Experimental Deployment Environment

In this paper, we completed organizational social network data integration, analysis, and machine learning model construction based on Python version 3.8; feature representation of text for network embedding with BERT; machine learning model (RF, KNN, LR) construction and model performance evaluation with Sklearn. We used DGL [44] for network dataset partitioning, graph construction, and graph neural network (GAT, GCN, MPNN) model construction, and PyTorch for deep learning model training and prediction.
The experiments were conducted on Google Colab platform with a Tesla P100 GPU. The pretrained BERT model has a dimension of 200 and was fine-tuned with a learning rate of 2 × 10−5.

5.2. Dataset Construction for Classification Task

China’s administrative divisions can be roughly divided into three levels: provincial, municipal, and county levels. With our database, we are able to pinpoint the social organizations and build a county-level OGN. As a country with a vast territory, China has thousands of county administrative divisions, forming a database of thousands of graph-structured data and ensuring us with sufficient data for training and testing machine learning models. The network of social organizations in southern Jiangsu in Figure 15 below shows us the tip of the iceberg of the database vividly.
Figure 15. The organizational social network structure in the southern Jiangsu Province. The area framed by the triangle in the left side of the figure is the Yangtze River Delta.
In this paper, we selected three representative regions in China (Table 6): the Beijing–Tianjin–Hebei region, known as the “capital economic circle” of China, the Yangtze River Delta, which has experienced rapid economic development in recent years, and the Pearl River Delta region, which was the first to implement reform and opening up in China. With the three regions mentioned above as labels of the county-level OGN belong to them, machine learning models were trained for geographic-area-affiliation prediction task in these networks. Different regions have different development patterns under the influence of various factors such as economic, social, cultural, and geographical features, where the development of social organizations is embedded. If graph machine learning can effectively classify them, it can be a strong proof that graph machine learning models can map socioeconomic development patterns embedded in the network structure from an abstract dimension.
Table 6. Three regions used to construct the dataset.

5.3. Graph Attention Network Model Construction

In this paper, we use graph attention network (GAT) to construct a neural network layer for representation learning of the embedding vector of the OGN structure, with the maximum aggregation-based READOUT function to aggregate the node features of the network, then input the results into the linear neural network layer and sigmoid activation function in turn to obtain the classification probability, so as to build a social organization–regional economic classification prediction model based on GAT.
As for the training and prediction process, we chose binary cross entropy as the loss function, Adam as optimizer, and the parameters are initialized with Xavier: the learning rate is 2 × 10−5, the dropout coefficient is set to 0.2, the batch size used for training is 16, the maximum number of iterations is 100, the number of layers of the graph attention network is 2, the dimension of the hidden layer is 256, and the coefficient of the L 2 regular term is 1 × 10−3 during the training process.
h i l + 1 = σ j N i α i j h i l W l
where h i l and h i l + 1 are the vector representations of the l and l + 1 layer i nodes, respectively; N i is the set of neighbor nodes of i nodes; a i j is the number of attentional interrelationships between nodes i and j; W l is the parameter matrix of the lth level; σ is the nonlinear activation function.
The calculation procedure of a i j is shown in Equation (8).
α i j = exp e i j k N i exp e i k
where e i j is the edge vector representation of the connected nodes i and j.
After the feature update of the nodes is completed by the GAT feature extraction layer, the node feature aggregation and model output are shown in Equations (9) and (10).
R E A D O U T = max h i l n i N
y output = σ ( Linear ( R E A D O U T ) )

5.4. Evaluation Metrics

In this paper, accuracy (Acc), F1-score, and precision are used as evaluation indicators, and the calculation of the indexes is shown in Equations (12) and (13).
Recall = T P T P + F N
Precision = T P T P + F P
Accuracy = T P + T N T P + T N + F P + F N
F 1 Score = 2 × Precision × Recall Precision + Recall
T P means the true positive case, indicating that the positive class is correctly predicted as the positive class; T N means the true negative case, which means the negative class is correctly predicted as the number of negative classes; while F P means the false positive case, indicating that the number of negative classes is incorrectly predicted to be positive; F N means the false negative case, which means that the number of positive classes is incorrectly predicted to be positive.

5.5. Comparison Experiments with Baseline Models

In the geographic-area-affiliation prediction task, we constructed the GAT-based prediction model with three traditional machine learning models (RF, KNN, LR) and two graph neural network models (GCN, MPNN) as baseline models for comparison. The F1-score and accuracy results of the six models are shown in Figure 16 below.
Figure 16. The experimental results.

5.5.1. Machine Learning Baseline Model

We chose random forest(RF), k-nearest neighbors(KNN) algorithm, and logistic regression(LR) as traditional machine learning baseline models. RF is an algorithm for building decision trees by using training data and random feature selection. RF performs multiple put-back sampling in the training set and builds a decision tree for each sampling result. KNN is a nearest neighbor algorithm for classification tasks [45] by finding K nearest neighbor samples in the feature space of the samples to be classified and then deciding the class of the samples according to their class affiliation.
LR is a generalized linear regression analysis model [46] which constructs a linear hyperplane in the sample feature space by fitting the linear equation y = β 0 + β 1 x 1 + β 2 x 2 + + β n x n , dividing the feature space region into several sub-regions of categories so that each category of data belongs to the same sub-region, thus completing the classification task.
For the machine learning baseline models, the input network feature representations for model training are made by Node2vec [47].
The experimental results in Figure 16 show that the graph machine learning models has at least 8% performance improvement over the traditional machine learning model, mainly because traditional machine learning finds it difficult to learn complex semantic information, The RF model performs well in some simple classification tasks, but is prone to overfitting when it comes to complex data structures. The LR model alleviates the problem to some extent, but its performance improvement is not significant because it is limited by the linear classification space. The KNN model achieves relatively good results, which also reflects the importance of network structure from the side.

5.5.2. Graph Neural Network Baseline Model

We use graph convolution network (GCN) [29] and message passing neural network (MPNN) [48] to build a baseline model of a graph neural network for the organizational social network classification task. In the graph neural network baseline model, the structures of the aggregation and classification prediction models are consistent with the GAT-based prediction model except that GCN and MPNN are used for network structure feature extraction, respectively.
GCN is a classical graph neural network whose core idea is to transfer the image processing method based on convolutional neural network (CNN) to the graph structure data and learn the relationship of the graph structure by aggregating the information around the nodes, and its update mechanism is shown in Equation (15).
h l + 1 = σ D ˜ 1 2 A ˜ D ˜ 1 2 h l W l
where A ˜ is A + I . D ˜ is D + I , which represent the normalized adjacency matrix and degree matrix, respectively.
MPNN is a general computational framework of graph neural network that learns features from graphs through message passing, node updating, and aggregation, and can be independent of graph isomorphism. The update mechanism is shown in Equation (16).
h i l + 1 = U l h i l , j N i M l h i l , h j l , e i j
where, U l represents the update function; M l represents the message passing function.
The result shown in Figure 17 reveals that the accuracy of GAT is about 4 % compared higher than other graph machine learning models on the OGN dataset. Classical graph machine learning is less effective than GAT due to the fact that GCN and MPNN are updated with full graph computation and the learned parameters are related to the complexity of the graph structure, while GAT uses attention coefficients point-by-point computation without relying on the Laplace matrix, which is more adaptive and has the ability to better utilize attention mechanisms to improve model performance based on syntactic dependencies, Compared with GCN and MPNN, the GAT-based model uses adaptive attention coefficients to represent the weights of edges between nodes, so that the neural network can pay attention to neighboring nodes with more influence (namely, larger weights) when nodes are updated, and learn more meaningful spatial and semantic information.
Figure 17. The experimental results.
It is clear that all six machine learning models have relatively good results for the prediction task, with the lowest one reaching an accuracy of 60 % , which indicates that both deep-learning-based and traditional-machine-learning-based methods are able to learn the connection between organizational social networks and geographic, economic, and cultural factors. We hope that subsequent studies can be conducted with interpretable machine learning and thus go further in exploring the specific links between development patterns and geographic regions.

5.6. Ablation Experiment

In the field of artificial intelligence (AI), especially machine learning (ML), ablation refers to the removal of a component of an AI system [49]. Ablation study requires that the system exhibits graceful degradation: the system continues to function even if a component is lost or weakened. In the ablation experiment, we chose p r e c i s i o n as the index to evaluate the performance of the model.
To further investigate the model’s performance, two sets of ablation experiments were conducted on the proposed model on the OGN dataset: Experiment 1 used GloVe [50] with the same dimension of 200 in the word embedding layer instead; Experiment 2 used the multi-headed attention mechanism for model training in the encoding layer instead. The results of the ablation experiments are shown in Figure 17, from which it can be seen that in Experiment 1, the embedding layer used the same dimension of GloVe model for word embedding, and its accuracy differed significantly from that of the pretrained model BERT. Compared with GloVe, the fine-tuned BERT is more effective in capturing the semantic information of the text, i.e., accurate semantic information extraction plays an important role in improving the performance of the model. In Experiment 2, with the adoption of the multi-head, the reason that the effect was not improved after the attention mechanism is that when the OGN structure contains multiple aspect targets, the attention mechanism may focus the socioeconomic embedding on the wrong aspect target, further illustrating the importance of the information of the network structure as a whole in the classification task.

6. Conclusions

Society is a complex system whose development comes from the collision and convergence of different social entities. In this paper, we construct a novel database of social organizations in China with related information, using the open data platform provided by the Ministry of Civil Affairs of the People’s Republic of China, which, to our knowledge, is one of the few social organizational databases that have been applied to computational social science research. We believe that the construction of this database can provide more and more powerful help for researchers to explore the development of Chinese social organizations and the macro changes of Chinese society in the future.
With the database, we explored the network structure composed of social organizations and related social entities. We proposed four types of social organization networks based on graph theory, trying to structuralize the development patterns of social organizations in different regions, which are characterized by local policy, economic, and cultural factors. We construct a graph-model-based organizational geosocial network(OGN), with the help of natural language processing(NLP) technology to embed the textual information into the network, which enables it to fuse more dimensions of information, thus representing richer structural and semantic features of the complex network.
Using machine learning models, we conducted exploratory research on the relationship between the development patterns of organizational social networks and the geographic zones to which they belong. Our machine learning models achieved relatively good results on the training data, with an average accuracy rate of 70 % . However, it is important to emphasize that our aim is not simply to pursue the accuracy or to create a new state of the art (SOTA), but to explore the correlation between the graph-structured network data and the socioeconomic differences embedded in geographic space through the geographic-area-affiliation prediction task.
In future research, we hope to build larger and more complex graph network structures from a multidimensional perspective [51,52], and we also hope to highlight the role of interpretable machine learning [53] to decrease the black box nature of deep learning and help us gain an in-depth understanding of the causal relationship between the development of social organization and relevant policy, economic, and cultural factors.

Author Contributions

Conceptualization, Xinjie Zhao, Hao Wang, and Shiyun Wang; methodology, Xinjie Zhao and Hao Wang; validation, Xinjie Zhao, Shiyun Wang, and Hao Wang; formal analysis, Shiyun Wang and Hao Wang; data curation, Xinjie Zhao; writing—original draft preparation, Xinjie Zhao and Shiyun Wang; writing—review and editing, Xinjie Zhao, Hao Wang, and Shiyun Wang; visualization, Xinjie Zhao and Shiyun Wang. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Youth Project of the National Social Science Foundation of China “Research on Unbalanced and Insufficient Development of Social Organizations Based on Big Data Method” (20CSH089).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Since our database has just been developed and the amount of data is huge, there are still some instability factors in it. Therefore, after some more in-depth testing, we will look for a suitable opportunity to publish it, and if you are interested in the database, you can also contact us directly to collaborate on the research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Yang, A.; Cheong, P.H. Building a Cross-Sectoral Interorganizational Network to Advance Nonprofits: NGO Incubators as Relationship Brokers in China. Nonprofit Volunt. Sect. Q. 2019, 48, 784–813. [Google Scholar] [CrossRef]
  2. Ianni, M.; Masciari, E.; Sperli, G. A survey of Big Data dimensions vs Social Networks analysis. J. Intell. Inf. Syst. 2021, 57, 73–100. [Google Scholar] [CrossRef] [PubMed]
  3. Wasserman, S.; Faust, K. Social Network Analysis: Methods and Applications; Cambridge University Press: Cambridge, MA, USA, 1994. [Google Scholar]
  4. Xiang, Y.; Fujimoto, K.; Schneider, J.; Jia, Y.; Zhi, D.; Tao, C. Network context matters: Graph convolutional network model over social networks improves the detection of unknown HIV infections among young men who have sex with men. J. Am. Med. Inform. Assoc. 2019, 26, 1263–1271. [Google Scholar] [CrossRef] [PubMed]
  5. Peng, H.; Li, J.; Song, Y.; Yang, R.; Ranjan, R.; Yu, P.S.; He, L. Streaming Social Event Detection and Evolution Discovery in Heterogeneous Information Networks. ACM Trans. Knowl. Discov. Data 2021, 15, 1–33. [Google Scholar] [CrossRef]
  6. Boyd, D.M.; Ellison, N.B. Social Network Sites: Definition, History, and Scholarship. J. Comput.-Mediat. Commun. 2007, 13, 210–230. [Google Scholar] [CrossRef] [Green Version]
  7. Dhand, A.; White, C.C.; Johnson, C.; Xia, Z.; De Jager, P.L. A scalable online tool for quantitative social network assessment reveals potentially modifiable social environmental risks. Nat. Commun. 2018, 9, 3930. [Google Scholar] [CrossRef]
  8. Bonacich, P. Some unique properties of eigenvector centrality. Soc. Netw. 2007, 29, 555–564. [Google Scholar] [CrossRef]
  9. Gong, H.; Chen, C.; Bialostozky, E.; Lawson, C.T. A GPS/GIS method for travel mode detection in New York City. Comput. Environ. Urban Syst. 2012, 36, 131–139. [Google Scholar] [CrossRef]
  10. Borgatti, S.; Mehra, A.; Brass, D.; Labianca, G. Network Analysis in the Social Sciences. Science 2009, 323, 892–895. [Google Scholar] [CrossRef] [Green Version]
  11. Stephure, R.J.; Boon, S.D.; MacKinnon, S.L.; Deveau, V.L. Internet Initiated Relationships: Associations between Age and Involvement in Online Dating. J. Comput.-Mediat. Commun. 2009, 14, 658–681. [Google Scholar] [CrossRef] [Green Version]
  12. Kane, G.C.; Alavi, M.; Labianca, G.; Borgatti, S.P. What’s Different About Social Media Networks? A Framework and Research Agenda. MIS Q. 2014, 38, 275–304. [Google Scholar] [CrossRef] [Green Version]
  13. Espín-Noboa, L.; Wagner, C.; Strohmaier, M.; Karimi, F. Inequality and inequity in network-based ranking and recommendation algorithms. Sci. Rep. 2022, 12, 2012. [Google Scholar] [CrossRef] [PubMed]
  14. Shiau, W.L.; Dwivedi, Y.K.; Yang, H.S. Co-citation and cluster analyses of extant literature on social networks. Int. J. Inf. Manag. 2017, 37, 390–399. [Google Scholar] [CrossRef] [Green Version]
  15. Sanchez-Lozano, J.M.; Teruel-Solano, J.; Soto-Elvira, P.L.; Socorro Garcia-Cascales, M. Geographical Information Systems (GIS) and Multi-Criteria Decision Making (MCDM) methods for the evaluation of solar farms locations: Case study in south-eastern Spain. Renew. Sustain. Energy Rev. 2013, 24, 544–556. [Google Scholar] [CrossRef]
  16. Quartulli, M.; Olaizola, I. A review of EO image information mining. Int. J. Photogramm. Remote Sens. 2013, 75, 11–28. [Google Scholar] [CrossRef] [Green Version]
  17. Liao, S.H.; Chu, P.H.; Hsiao, P.Y. Data mining techniques and applications—A decade review from 2000 to 2011. Expert Syst. Appl. 2012, 39, 11303–11311. [Google Scholar] [CrossRef]
  18. Tenney, I.; Das, D.; Pavlick, E. BERT rediscovers the classical NLP pipeline. arXiv 2019, arXiv:1905.05950. [Google Scholar]
  19. Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar]
  20. Piktus, A.; Petroni, F.; Karpukhin, V.; Okhonko, D.; Broscheit, S.; Izacard, G.; Lewis, P.; Oğuz, B.; Grave, E.; Yih, W.t.; et al. The Web Is Your Oyster—Knowledge-Intensive NLP against a Very Large Web Corpus. arXiv 2021, arXiv:2112.09924. [Google Scholar]
  21. Ribeiro, M.T.; Singh, S.; Guestrin, C. Model-agnostic interpretability of machine learning. arXiv 2016, arXiv:1606.05386. [Google Scholar]
  22. Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 657–668. [Google Scholar]
  23. Liu, Q.; Zheng, Z.; Zheng, J.; Chen, Q.; Liu, G.; Chen, S.; Chu, B.; Zhu, H.; Akinwunmi, B.; Huang, J.; et al. Health Communication Through News Media During the Early Stage of the COVID-19 Outbreak in China: Digital Topic Modeling Approach. J. Med. Internet Res. 2020, 22, e19118. [Google Scholar] [CrossRef]
  24. Chami, I.; Abu-El-Haija, S.; Perozzi, B.; Ré, C.; Murphy, K. Machine Learning on Graphs: A Model and Comprehensive Taxonomy. arXiv 2021, arXiv:2005.03675. [Google Scholar]
  25. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Wang, L.; Li, C.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
  27. Bandinelli, N.; Bianchini, M.; Scarselli, F. Learning long-term dependencies using layered graph neural networks. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
  28. Zhang, S.; Tong, H.; Xu, J.; Maciejewski, R. Graph convolutional networks: A comprehensive review. Comput. Soc. Netw. 2019, 6, 11. [Google Scholar] [CrossRef] [Green Version]
  29. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2018, arXiv:1710.10903. [Google Scholar]
  30. Song, W.; Xiao, Z.; Wang, Y.; Charlin, L.; Zhang, M.; Tang, J. Session-based Social Recommendation via Dynamic Graph Attention Networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, Melbourne, Australia, 11–15 February 2019; pp. 555–563. [Google Scholar]
  31. Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, S.H.; Savarese, S. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks. arXiv 2019, arXiv:1907.03395. [Google Scholar]
  32. Piao, J.; Zhang, G.; Xu, F.; Chen, Z.; Li, Y. Predicting Customer Value with Social Relationships via Motif-based Graph Attention Networks. In Proceedings of the Web Conference, Ljubljana, Slovenia, 19–23 April 2021; pp. 3146–3157. [Google Scholar]
  33. Baek, H.; Ahn, J.; Choi, Y. Helpfulness of Online Consumer Reviews: Readers’ Objectives and Review Cues. Int. J. Electron. Commer. 2012, 17, 99–126. [Google Scholar] [CrossRef]
  34. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the game of Go without human knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef]
  35. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
  36. Luo, C.; Ma, Y.; Jiang, P.; Zhang, T.; Yin, F. The construction and visualization of the transmission networks for COVID-19: A potential solution for contact tracing and assessments of epidemics. Sci. Rep. 2021, 11, 8605. [Google Scholar] [CrossRef]
  37. Bengtsson, M.; Kock, S. “Coopetition” in Business Networks—To Cooperate and Compete Simultaneously. Ind. Mark. Manag. 2000, 29, 411–426. [Google Scholar] [CrossRef]
  38. Klimas, P. Organizational culture and coopetition: An exploratory study of the features, models and role in the Polish Aviation Industry. Ind. Mark. Manag. 2016, 53, 91–102. [Google Scholar] [CrossRef]
  39. Roininen, S.; Westerberg, M. Network Structure and Networking capability among new ventures: Tools for competitive advantage or a waste of resources? (summary). Front. Entrep. Res. 2008, 28, 3. [Google Scholar]
  40. Krajewski, L.J.; Malhotra, M.K.; Ritzman, L.P. Operations Management. Processes and Supply Chains, 11th ed.; Pearson: Boston, MA, USA, 2016. [Google Scholar]
  41. Kim, J. Networks, Network Governance, and Networked Networks. Int. Rev. Public Adm. 2006, 11, 19–34. [Google Scholar] [CrossRef]
  42. Leicht, A.; Heiss, J.; Byun, W.J. Issues and Trends in Education for Sustainable Development; Education on the Move; UNESCO Publishing: Paris, France, 2018; p. 271. [Google Scholar]
  43. South, J.; Button, D.; Quick, A.; Bagnall, A.M.; Trigwell, J.; Woodward, J.; Coan, S.; Southby, K. Complexity and Community Context: Learning from the Evaluation Design of a National Community Empowerment Programme. Int. J. Environ. Res. Public Health 2020, 17, 91. [Google Scholar] [CrossRef] [Green Version]
  44. Wang, M.; Yang, S.; Sun, Y.; Gao, J. Discovering urban mobility patterns with PageRank based traffic modeling and prediction. Phys.-Stat. Mech. Its Appl. 2017, 485, 23–34. [Google Scholar] [CrossRef]
  45. Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Wang, R. Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 1774–1785. [Google Scholar] [CrossRef]
  46. Hosmer, D.W., Jr.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 398. [Google Scholar]
  47. Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
  48. Gilmer, J.; Schoenholz, S.S.; Riley, P.F.; Vinyals, O.; Dahl, G.E. Neural Message Passing for Quantum Chemistry. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 11 August 2017; pp. 1263–1272. [Google Scholar]
  49. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef] [Green Version]
  50. Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  51. Aliakbary, S.; Motallebi, S.; Rashidian, S.; Habibi, J.; Movaghar, A. Distance metric learning for complex networks: Towards size-independent comparison of network structures. Chaos Interdiscip. J. Nonlinear Sci. 2015, 25, 023111. [Google Scholar] [CrossRef]
  52. Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Network Representation Learning: A Survey. IEEE Trans. Big Data 2020, 6, 3–28. [Google Scholar] [CrossRef] [Green Version]
  53. Belle, V.; Papantonis, I. Principles and Practice of Explainable Machine Learning. Front. Big Data 2021, 4, 688969. [Google Scholar] [CrossRef] [PubMed]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.