A Thematic Similarity Network Approach for Analysis of Places Using Volunteered Geographic Information

: The research presented in this paper proposes a thematic network approach to explore rich relationships between places. We connect places in networks through their thematic similarities by applying topic modeling to the textual volunteered geographic information (VGI) pertaining to the places. The network approach enhances previous research involving place clustering using geo-textual information, which often simpliﬁes relationships between places to be either in-cluster or out-of-cluster. To demonstrate our approach, we use as a case study in Manhattan (New York) that compares networks constructed from three different geo-textural data sources—TripAdvisor attraction reviews, TripAdvisor restaurant reviews, and Twitter data. The results showcase how the thematic similarity network approach enables us to conduct clustering analysis as well as node-to-node and node-to-cluster analysis, which is fruitful for understanding how places are connected through individuals’ experiences. Furthermore, by enriching the networks with geodemographic information as node attributes, we discovered that some low-income communities in Manhattan have distinctive restaurant cultures. Even though geolocated tweets are not always related to place they are posted from, our case study demonstrates that topic modeling is an efﬁcient method to ﬁlter out the place-irrelevant tweets and therefore reﬁning how of places can be studied.


Introduction
Place and space are among the most fundamental concepts of geography [1,2]. Space is often considered to be points of locations represented by coordinates. Place, on the other hand, is an "experience-based dynamic construct" [3]. Compared to space, the concept of place emphasizes on the meaning-making process that is complex, dynamic, and individualistic [4]. In this paper, we study how different places are semantically similar, based on textual topics that appear in Volunteered Geographic Information (VGI) in these places. Our goal is to create a thematic similarity network that connects places of similar topics regardless of their physical distance. By applying a network clustering algorithm, we find groups of semantically similar places and analyze their topics and spatial autocorrelation qualitatively and quantitatively.
Analyzing and theorizing about places from a variety of perspectives, has a long history in geographical analysis-from social area analysis [5][6][7][8] to more recent geodemographic analysis that derives collective behaviors and characteristics from demographic data of geographic regions [9,10]. In the past decade, studies on place have taken advantage of a new data source, that of VGI [11]. VGI comes in many different forms, from that of maps created by users to text from Wikipedia which has a geographic component (e.g., place names) [12]. The research presented in this paper uses the textual form of VGI, specifically crowdsourced reviews from TripAdvisor and geolocated Twitter data (see Section 3). Such platforms provide large amounts of textual data with either explicit or implicit geographic information contributed by users [13,14]. Leveraging this unstructured geographical information found in such texts allows us to comprehend the complexity of places at scale [15][16][17].
Generally speaking, the most common method used by prior research to analyze geo-textual data is to structure the unstructured texts into themes (i.e., topics) through topic models (e.g., [18]). This is then often followed by applying clustering algorithms (e.g., k-means) to expose the underlying patterns of sentiments, experiences, or activities captured in the text (e.g., [12,[19][20][21][22][23][24]). When places are clustered for further analysis, however, those in the same cluster are assumed to be carrying similar characteristics. Relationships between places are reduced to being either in-cluster or out-of-cluster. However, we would argue that the connectedness and relationships of places, in reality, are more complex. For instance, when connecting places in a network, places at the edge of their own clusters still have relatively weak out-of-cluster connections. The network approach presented in this paper, recognizes them as places with both in-cluster and out-of-cluster connections. Thus, this approach does not limit us to only perform network-level place clustering, but also to discover unique places based on their positions in the networks. To highlight this we use a case study to demonstrate the approach in the context of Manhattan, New York. In the remainder of the paper, we will first discuss related research pertaining to topic modeling and thematic similarity network analysis (Section 2). This is followed by introducing the data (Section 3) and the methodology (Section 4) that we applied to our case study. The results are then presented in Section 5 and finally, the implications and conclusions of our research are presented in Section 6.

Related Work
The approach proposed in this paper involves two major steps, topic modeling using geo-textual data and thematic similarity network analysis. In what follows we review related work with respect to these steps. For step one, topic modeling is a widely used language model for understanding large amounts of unstructured textual data. Previous research has adopted generic topic modeling algorithms (e.g., [18]) to ones that incorporate geographic information (e.g., [25,26]). Using these geographical topic models, studies have been able to derive the topics from travel blogs and Flickr tags to specific geographical units, such as states [25,26]. Other work has analyzed the relationships between topics and countries from online news articles and blogs [27], or generated activity patterns from check-in data [23]. In addition, topic modeling has been used to recommend travel destinations using travel blogs [28,29], create location related question-answering systems using Twitter and blogs [30], and predict the future distribution of topics [31]. Despite research on innovating geographic topic models, many researchers often chose to use generic topic models (such as Latent Dirichlet Allocation, LDA [18]) to analyze geo-textual data. In such instances, the geolocational information in the text does not contribute to the results of the topic models but is used only after applying the model. For example, Adams et al. [20] explored the temporal themes related to places using travel blogs and applied a similarity score between places based on the topics. Jenkins et al. [32] compared themes of geographic areas from Twitter and Wikipedia whereas Xu et al. [33] model the topics of restaurants of a city.
In addition, previous work has defined "place" at various levels of aggregation-countries, cities, neighborhoods, buildings, but such aggregations artificially split geographical areas. For example, at the neighborhood level, Cranshaw et al. [34] detected boundaries of neighborhoods using check-in data and Foursquare venue descriptions in order to show that crowdsourced and official neighborhood definitions differed. At a more aggregated level, Preoţiuc-Pietro [35] viewed cities as collections of Foursquare venues and clustered cities hierarchically using venue descriptions to show that similarities between cities can be captured through crowdsourced data. Since Foursquare venue data also provides venue categories, Noulas et al. [36] clustered both geographic areas (in terms of 625 × 625 square meters) and users based on their visit history in order to enhance recommendation systems for different users. In another work, Crooks et al. [12] proposed a multi-level (individual building, streets, and neighborhoods level) approach for discovering social functions through mining place topics. Clustering at different aggregations also allows us to find places where people share similar experiences [20,37] along with places with similar functions [12]. When applying clustering, however, the relationship between places becomes binary, being similar or not similar, and thus the relationships between places in different clusters are often ignored.
Turing to work pertaining to thematic similarity network analysis (i.e., the second step of our approach), previous studies have analyzed place similarities but rarely used a network approach in the context of geo-textual data. For example, Janowicz et al. [38] used semantic similarity for developing geographic information retrieval applications. While Yan et al. [39] trained word embeddings for place types that was then used for exploring similarity and relatedness between point-of-interests types. In terms of using a similarity network-based approach, Quercini and Samet [40] created graph-based similarity measures to address spatial relatedness of a concept to a location using Wikipedia articles. In another work, Hu et al. [41] placed cities into networks based on their semantic relatedness (i.e., number of news articles which contain the co-occurrences of the two cities). Similarity networks, however, have seen much wider applications in domains outside of geography, ranging from analyzing protein sequences and structures [42], genome data [43] to that of hospital patients [44,45].
Methodologically, such studies have demonstrated that one of the most important analysis for similarity networks is clustering (i.e., community detection), which captures groups of nodes that are most similar to each other. Although place clustering does not require connecting places in networks, one of the advantages of conducting network-based clustering is that it enables for downstream node level analysis in relation to clusters. For example, Valavanis et al. [42] discovered structural similarities of protein folds and classes in the downstream analysis after carrying out network clustering. Similarly, in the case study presented throughout the rest of this paper, we will apply clustering to the similarity network as well as identifying special nodes (i.e., places) based on their positions in the network.

Data Collection
To apply our methodology (see Section 4) to showcase how a network approach can be used to study place, data was needed to be collected. In this study we used two geo-textual data sources: TripAdvisor and Twitter. The rationale for choosing these data sources are two-fold. First, they are open source that have been widely used by previous research (as discussed in Section 2). Therefore, future studies could use these data sources to extend the research presented in this paper. Secondly, most prior research using geo-textual data often choose only one data source. In the research presented in this paper, we aim to provide a thematic similarity network approach which can compare multiple geo-textual data sources. The TripAdvisor data was collected in September 2019 which included reviews for attractions and restaurants in New York City. For each attraction and restaurant, the addresses, neighborhood, and reviews were retrieved. An example of this is shown in Figure 1, in which we highlight content that was used in our analysis (i.e., locational information and reviews). With respect to Twitter, we were only interested in tweets that had a precise geographical coordinate. The Twitter data that was collected from 1 January 2015 to 31 December 2015 with a bounding box of latitude ranging from 40.481867 to 40.9325 and longitude between −74.2721 and −73.626201, which includes the New York City.

Data Pre-Processing and Aggregation
As the locational information from TripAdvisor attractions and TripAdvisor restaurants was in the format of addresses, the first step of data pre-processing was to geocode TripAdvisor data addresses using Google Maps Geocoding application programming interface (API) [46]. After all the data was geocoded, the second step was to define what a place is. As was discussed in Section 2, previous studies have treated places at various levels of aggregation. For this case study, a place is a census tract defined by the United States Census [47]. Although the aggregation may result in the modifiable areal unit problem that statistical summaries of the aggregated area are influenced by the shape and size of the area [48], the reason of using census tract in this research was to incorporate the census demographic data into the analysis. The rationale for this was to be able to explore the connection between the patterns found in crowdsourced reviews (or tweets) and the underlying geodemographics of an area. Furthermore, by using census data, while not only demonstrating how our case study allows for a novel approach to studying places through thematic similarity networks, but it also allows for others to use it in different areas within the United States or in other countries where census data is available (e.g., as in the United Kingdom). Int should be noted however, if readers are not interested in comparing the geo-textual data to census data, our approach could be applied to other levels of aggregation such as grids, road segments etc. (see [12]).
After aggregating data to the census tract level, the final step was to select only tracts that appeared in all three datasets to make them comparable. It was found that most of the attractions within New York City from TripAdvisor were located in Manhattan, and thus for the analysis in this paper, the tracts only in Manhattan were analyzed. Table 1 shows the number of restaurants/attractions, reviews/tweets, and census tracts after restricting the study to Manhattan. Furthermore, the texts (i.e., reviews and tweets) were filtered to only be those that were in English. Although special characters such as "@", emojis, and stop words may contribute to the meaning of the text [49], we do not consider them in this work as is common in text pre-processing (e.g., [50]). Next all words were converted into lower case to treat all words with the same text the same. Finally, in order to reduce the number of vocabularies (e.g., words with the same meaning such as walking, walk, walked), a stemmer (i.e., Porter Stemmer) was applied and only the stems of the words were retained [51].

Methodology
In this section, we will first introduce how the topic model was trained on each dataset and how the thematic similarity network is constructed based on the similarities between the derived topics (Section 4.1). Figure 2 illustrates the workflow from data input to thematic similarity network output including data collection and prepossessing which was described in Section 3. After the thematic similarity networks are constructed, we carried out the network community detection on these networks (Section 4.2) and node level network analysis (Section 4.3). Finally, the algorithm and implementation is presented in Section 4.4. Figure 2. Workflow from data input to the construction of the thematic similarity network and analysis (i.e., community detection and unique nodes discovery).

Topic Modeling and Thematic Similarity Networks
As noted in Section 2, The first major step towards gaining a meaning from large collections of text is topic modeling. Topic modeling is a type of statistical model that discovers latent topics in documents. For example, when writing a review, the reviewer might not be thinking of specific topics, but topic modeling assumes that there are underlying topics, which are known as latent topics. One of the most widely used topic models is Latent Dirichlet Allocation (LDA) [18], which is a generative probabilistic model that treats each document as a distribution of latent topics and each topic as a distribution of words. A document can be a news article, a review or a social media post. In this research, each document is comprised of all the texts for each census tract from one dataset. For example, for the TripAdvisor restaurant dataset, tract "36061000700" had 57 restaurants, which had total of 1951 reviews. The 1951 reviews were treated as one document in the LDA model.
One LDA model is trained for each of the three datasets. To train LDA models, we need to tune the hyper-parameters using the datasets to obtain models that have the best performance. A LDA model can predict the words in each topic and the topics are in each document. Therefore the hyper-parameters that need to be tuned include the a-priori probability vector α that maps each topic to a probability, and the a-priori probability vector β that maps each word to a probability. Moreover, the total number of topics (K) need to be learned from the data as well. The model was implemented using gensim LDA library, in which alpha and beta was set as "auto" so that both hyper-parameters can be learned from the data [52]. To ensure a better model interpretability, we detected the bi-grams in the texts, after which the corpus became a mixture of uni-grams and bi-grams. In addition, the vocabularies in the corpus was truncated since otherwise many of the most frequent words that bears less concrete meanings in the context of our tasks, such as "I" and "is", would become the top words of the topics. However, the threshold of the word frequency (top_n) to be truncated is a parameter of the model that needs to be tuned during training as well. To tune K and top_n, experiments on each dataset were carried out. Since there is no ground truth for topic models, the common model evaluation metrics are perplexity and coherence [53]. However, the experiments showed that optimizing coherence or perplexity scores in all three datasets did not generate models with better topic interpretability (code https://bitbucket.org/xiaoyiyuan/network_vgi/src/master/script/topic_model_results.ipynb? viewer=nbviewer). As a result, we adopted interpretability and manual observations as the metrics for evaluating the topic model quality that can be found. For instance, when k is too high, the model produces topics that have many common words, meaning that new topics are not contributing to generating new knowledge about the data. When top_n is too high, more documents have only one or two topics, which makes it hard to comprehend topic meanings. Table 2 shows the parameters that produces the most interpretable model for each dataset. Each of the experiments and the results with various values for the hyper-parameters can be found in the shared source code. When the LDA model is trained, each document is represented by a distribution of topics. The square root of Jensen-Shannon divergence is a commonly used metric of measuring distance between discrete distributions. The Jensen-Shannon distance between two (topic) probability distributions P and Q is defined formally as: The Jensen-Shannon divergence is symmetrical (i.e., JSD(P||Q) = JSD(Q||P)). As a result, the edges of the similarity networks are not directed but weighted and the weights are the similarity scores. Using the same method, three similarity networks were constructed from the three datasets. Since there is always a similarity score between each pair of tracts, there is always an edge between them in the networks as well, making the networks fully connected.

Community Detection
Discovering communities of a fully connected network requires network sparsification [54]. Network sparsification reduces number of edges while preserving structural and statistical properties of interest. Thus, the principle is to reduce the network size by retaining only the important edges and in a similarity networks, the edge weight (i.e., the similarity score) is an indicator of edge importance [55]. The cut-off value for edge weights to sparsify the networks depends on the data and the clustering algorithm. In this work, we used the Girvan-Newman algorithm [56] to conduct network community detection (i.e., clustering) on the sparsified networks. The Girvan-Newman algorithm is a hierarchical method of detecting communities in complex networks, which can also be applied to weighted networks [56]. Within each step of the Girvan-Newman community detection algorithm, it progressively removes edges with highest edge betweenness centrality (i.e., edges with highest number of shortest path passing through them) and recalculates edge betweenness after each iteration of removal. By removing these high betweenness edges, the communities are separated from each other and consequently, the underlying community structure of the network is revealed. In a weighted network, the Girvan-Newman algorithm calculates the edge betweenness as described above, ignoring the edge weights. Then it divides the edge betweenness by the weight of the corresponding edge. As with unweighted networks, the algorithm then removes edges of highest betweenness. The result of the algorithm is a dendrogram which repeats the steps until no edges can be removed or the most ideal communities have achieved (i.e., the highest modularity of clusters). Therefore, we need to cross validate two parameters, the edge weight cut-off threshold for sparsification and the number of iterations for the Girvan-Newman algorithm. The process of sparsifying fully connected network to community detection is shown in a stylized network in Figure 3. A common evaluation metric for network community detection quality is modularity, which is a measure of the strength of division of a network into communities [57]. High modularity means that within each community detected, there are dense connections within the community and sparse connections between nodes in different communities. However, using modularity as the sole metric for the Girvan-Newman algorithm is not sufficient for our task- Figure 4 shows that when modularity is at its highest value, the network has become too scattered with a large number of one-node communities, which hinders the downstream analysis of interactions between communities and relationships between nodes and communities. To mitigate this problem, we selected the set of parameters that has the highest modularity without generating many one-node communities. For instance, the highest modularity for TripAdvisor attraction network (Figure 4a, left, brown) is 0.8 at iteration 8 but it produced more than 50 one-node communities (Figure 4a, right, brown). However, choosing a model with a slightly lower modularity value, e.g., 0.7 (Figure 4a, left, purple) significantly reduced the number of one-node communities (Figure 4a, right, purple). Using the same heuristics, the best parameters for each network are presented in Table 3.

Discovering Unique Nodes
Since the edge weights represent similarity, a node (i.e., a place) with a low-degree centrality value means that it bears low similarity to other nodes in the network. It is straightforward, therefore, to discover the ends of the uniqueness spectrum-on the one end, the highly unique places are nodes with a low-degree centrality while at the other end high degree centrality nodes are the least unique places. Other than these two extremes, there are nodes that act as bridges between communities that carry their unique characteristics, i.e., the community boundary nodes. The concept of community boundaries is rarely applied in similarity network analysis but is often used in social network analysis. In social networks, the community boundaries are the people who convey outside information to those in the community with no out-of-community connections [58,59]. We adopted and modified the definition of community boundaries from the social network analysis by Guerra et al. [58]. This modified definition of a community boundary is that of a node v that is a boundary node of community C i for community C j when: 1. node v ∈ C i has at least one edge connecting to community C j and 2. all the neighborhoods of v have no edge connecting to community C j .
The community boundaries are identified in the three sparsified networks instead of the original full-connected ones because identifying boundary nodes relies on the community structures detected in the sparsified network. Finally, the last step in Figure 3 illustrates a stylized network with community boundaries. For community C i , node b and node c both have edges connecting to the outside community C j . Node b qualifies as a boundary node for community C i because it has a neighbor (node a) with no edges connecting to community C j . Since node c does not have a neighbor meeting this requirement, node c does not count as a community C i 's boundary node to C j . In social networks, the definition of boundary nodes guarantees that node b is the only node that brings outside information to node a. In the context of place similarity networks, a node with no connection with the outside community (i.e., node a) indicates it has characteristics that are unique to its own community and the boundary nodes (e.g., node b) are the ones that connect the uniqueness of the communities.

Algorithm and Implementation
To summarize what has been discussed above with respect to our methodology, the pseudo-code for it is described in Algorithm 1. The algorithm takes one data source (e.g., Twitter corpus or TripAdvisor) as an input. The loop in Lines 1-6 constructs topic models from the input texts and Lines 7-11 calculate similarities between each document of the input. Line 12 constructs a thematic similarity network (which was explained in Section 4.1). Finally, Lines 12-14 detect communities in the network (Section 4.2) and the loop from Lines 15-22 discover boundary nodes (Section 4.3). The complete Python code and information pertaining to the software versions is available at https: //bitbucket.org/xiaoyiyuan/network_vgi.

Results
Building upon our methodology, in this section, we will present the results for the thematic similarity network analysis of places in Manhattan, New York. Specifically, Section 5.1.1 maps and visualizes the major network communities and their topics and presents results from the spatial autocorrelation of these communities using Moran's I measure of spatial autocorrelation(Section 5.1.2). In Section 5.2, we enrich the network nodes with geodemographic data and finally in Section 5.3 we identify and analyze nodes by their degrees of uniqueness.

Major Network Communities and Their Topics
In this section, we evaluate the clusters found using our proposed community detection approach described in Section 4.2. For this purpose, we first visualize and qualitatively analyze the community clusters. Then to ensure that the communities that we found are clustered significantly, we test each community for spatial autocorrelation using Moran's I. The sizes of the communities are show in Figure 5. Even though we lowered the number of one-node communities in the community detection (as discussed in Section 4.2), the distributions of the community sizes still appear to be long-tailed. For the sake of clear visualization, only the major communities (i.e., communities with a size equal or larger than 5 nodes) from the community detection are presented for each network. The topic modeling results for all communities are https://bitbucket.org/xiaoyiyuan/network_vgi/src/master/ script/topic_model_results.ipynb?viewer=nbviewer.  (Figure 6(a)), some communities have tracts which are visually close to each other and their topics reflect the main characteristics of the attractions in these geographic regions. For instance, topics of Community 12 are about "broadway", "theater", "concert", and "venu" (venue) and most of these tracts are clustered around the Broadway theater district. Furthermore, as shown on the map (Figure 6(a)), not all communities are not clustered perfectly in a geographic region and some of the tracts of a community are in the same region. For example, even though most of the tracts of Community 12 are located Midtown, the rest of the tracts are scattered around the Downtown area. The reason is that the topics of Community 12 include not only Broadway but also more broadly "concert", "game", and "venu" (venue) (Figure 7(a)). A similar example is that of Community 13 that has a dominant topic with keywords "harlem", "church", and "theater" and most of the tracts of Community 13 are located in Harlem and tracts that are not in Harlem have church related attractions such as The New York Mosque in Midtown Manhattan and Mariners' Temple Baptist Church in Downtown Manhattan. Such findings indicates that the network communities are reflections of people's similar experiences of various attractions as they are mined from a large amount of crowdsourced reviews from individuals.  (Figure 6a), some communities have tracts which are visually close to each other and their topics reflect the main characteristics of the attractions in these geographic regions. For instance, topics of Community 12 are about "broadway", "theater", "concert", and "venu" (venue) and most of these tracts are clustered around the Broadway theater district. Furthermore, as shown on the map (Figure 6a), not all communities are not clustered perfectly in a geographic region and some of the tracts of a community are in the same region. For example, even though most of the tracts of Community 12 are located Midtown, the rest of the tracts are scattered around the Downtown area. The reason is that the topics of Community 12 include not only Broadway but also more broadly "concert", "game", and "venu" (venue) (Figure 7a). A similar example is that of Community 13 that has a dominant topic with keywords "harlem", "church", and "theater" and most of the tracts of Community 13 are located in Harlem and tracts that are not in Harlem have church related attractions such as The New York Mosque in Midtown Manhattan and Mariners' Temple Baptist Church in Downtown Manhattan. Such findings indicate that the network communities are reflections of people's similar experiences of various attractions as they are mined from a large amount of crowdsourced reviews from individuals.
For the restaurant thematic similarity network, communities show higher level of spatial proximity (Figures 6b and 7b). One of the most prominent of such is that of Community 3, which is shown in Figure 6b clustered in Downtown Manhattan. Primary topics of community 3 (Figure 7b) are "pub" and "eatali" (Eataly food market), and "financi district" (financial district). In addition, tracts of Community 8 have close geographic proximity as well. This is evident from the map of Figure 7b, where most of the tracts in Community 8 are located between Downtown and Midtown Manhattan. Community 8 has Topics 14 and 32 featuring word stems such as "greenwich_villag" (Greenwich Village), "west_villag" (West Village), "japanes" (Japanese), and "bagel". Similarly, Community 17 has Topic 36 that can be interpreted as Central Park related even though it has the common Topic 32 that shows up across many other communities (e.g., Community 2,8,10,14,and 17). Interestingly, communities from TripAdvisor attractions network have counterparts from the restaurants network communities. For example, Community 13 from attraction network and Community 44 from restaurant network are about Harlem, which can be seen from the geographic clusters on the map and their dominant topics. A similar finding is for the theater district, which appears in both Community 12 of attraction network and Community 23 of the restaurant network. This suggests that people's dining experiences can be intertwined with the characteristics of the surrounding attractions or vice versa.  Figure 6. Network visualization of all communities from the thematic similarity networks using Gephi [60] Fruchterman-Reingold layout with major communities highlighted. Only the major communities are shown on the map for the sake of clarity. Major communities in Network visualization and mapping for each network are colored the same and thus the legend applies for both.  Turning to the results of the Twitter thematic similarity network, one of the most noticeable pattern is that of Community 10 (i.e., the blue community in Figure 6c) which dominates this network. Unlike the communities in Trip Advisor attractions and restaurants where there is a more even distribution of community sizes. Furthermore, communities from the Twitter dataset have more diverse topics than that of restaurant and attraction networks from TripAdvisor (Figure 7c). This could partly due to the distinction between Twitter and TripAdvisor as data sources for studying places. In that TripAdvisor reviews are directly about places but this is not necessarily the case for Twitter, which is a more generic social media platform where users can contribute a whole variety of topics [13]. Therefore, some topics (e.g., Topic 54 and Topic 60 in Figure 7c) from Twitter are not about places but relate to news or social and political discussions. This indicates that although geolocated tweets can be used to study people's perceptions and experiences about places, it needs to be used with awareness that the texts may need to be filtered. The results in Figure 7c shows that it is viable to use topic modeling to filter out the non-related topics (e.g., Topic 54 which relates to police reporting and New Jersey). The reason could be that tweets pertaining to social discussions often use different vocabularies than texts directly about places. Since topic modeling is a bag-of-words approach, the model is sensitive to vocabularies and thus can "tell them apart" as separate topics.

Quantitative Test for Spatial Autocorrelation of Communities
In the previous section (Section 5.1.1), we discussed network communities and whether the communities have geographically proximate tracts. In this section, we will present the results of Moran's I measure of spatial autocorrelation to quantify geographical proximity of the major communities in Table 4 and the Moran's I results for all communities are found in Table A1 in the appendix. Moran's I is a measure for spatial autocorrelation that is often applied on continuous data. To measure each community's autocorrelation level, we therefore encoded tracts of a specific community as 1 and all the other tracts as 0. We defined neighborhood using Queen's contiguity, i.e., any polygons (i.e., tracts) that shares a point-length border are neighbors.
Among the three networks, all the major communities from the TripAdvisor restaurant network have statistical significance in their spatial autocorrelation results (Table 4). Communities 3, 8, 10, 23, and 44 of the restaurant network have spatial autocorrelation at the 99% confidence interval, which are generally about pubs, west village, vegetarian Indian, theater, and Harlem respectively. These results inform us that these geographic clusters in Manhattan have their own restaurant culture, manifested by the topical summaries from TripAdvisor reviews. Other major communities from the restaurant network (i.e., Communities 2, 13, 14, and 17) also show statistical significance (at a confidence interval of over 95%) in their spatial autocorrelation results. These communities have relatively lower scores from Moran's I test, which can be observed on the map as they are more spread out over Manhattan. For the attraction network, Communities 4, 6, 12, and 13 have spatial autocorrelation and have topics pertaining to zoo/High Line/kid, cathedral/gallery, Broadway/seat/venue, and Harlem/Broadway, which can summarized from word clouds in Figure 7a. Therefore, attractions from these communities are more of a mixture of many different topics, which also explained the reason of the Moran's I for the attraction network being relatively lower than that from the restaurant network. Similarly, for the Twitter thematic network, besides the biggest community (i.e., Community 10), the others have low Moran's I values which suggests that the topics discussed on Twitter are less correlated with their geographic locations.

Enriching Network Communities with Geodemographic Attributes
One advantage of examining places as the Census tracts is to combine Census demographic data with the results from the derived networks. If we were to use the demographic data from the US Census such as the American Community Survey (ACS), a tract can be described by multiple variables (e.g., total population, mean household income, education attainment, and marital status). An alternative is that proposed by Spielman and Singleton [61] who took the ACS data and clustered it to generate a single variable description known as a geodemographic classification (e.g., "Hispanic and Kids" and "Wealthy Nuclear Families"). We enriched our node attributes with this geodemographic classification at the tract level to explore the relationship between the network communities and their demographics.
Based on the results from Spielman and Singleton [61] shown in Table 5, most of the tracts that we study are classified as wealthy and the column "Percentage" shows percentages of tracts in that demographic classification. We use it as baseline to compare the percentage of each demographic classification for network communities. For instance, if a community has more than 22.90% of Demographic Type 8, based on Table 5, we define that community to have a high proportion of low-income residents. Using this baseline, we discovered that even though Manhattan tracts are mostly rich, and the majority of low-income tracts reside in a few network communities, Communities 5,8,and 44. From the topics of these communities, two of them are in Chinatown and Harlem, presented in Figure 8). This suggests that these low-income areas have a distinctive restaurant culture. When applying the same method to the communities from TripAdvisor attractions and Twitter thematic networks, we do not find communities with high percentages of demographic types. This implies that discussions on Twitter and TripAdvisor attractions in Manhattan do not have patterns that correspond to the characteristics of its residents.

Identifying Nodes with Degrees of Uniqueness
Besides network-level analysis, node level analysis allows us to identify important or interesting places. As discussed in Section 4.3, nodes with the lowest weighted centrality are the most unique ones and vice versa. In this Section, we will first examine the central nodes (top 5 highest weighted centrality nodes) and the outliers (top 5 lowest weighted centrality nodes), followed by exploring the community boundary nodes in the networks (i.e., nodes that act as bridges between communities that carry their unique characteristics). Table 6 shows the topics for the central and outlier nodes in the network of TripAdvisor restaurants. Observing the number of topics for the two kinds of nodes, central nodes tend to have more diverse topics than the outliers. The topics of the outlier nodes show that these are the tracts with attractions that are unique to Manhattan, including "Skylin" (Skyline), "rockefel_center" (Rockefeller Center), "time_squar" (Time Square), "grand_central" (Grand Central Station), "Statu" (Statue of Liberty), and "elli" (Ellis Island). Since they are unique and distinctive, the outlier nodes have very low weighted degree centralities. This pattern of low-degree centrality nodes with distinctive topics also applies to the thematic similarity network from Twitter and TripAdvisor restaurants. On the contrary, the central nodes have a combination of common topics that enable them to have connections with many other nodes.  Figure 9 shows the positions of community boundary nodes in the three networks, which to be expected are often at the edges of the communities. Identifying nodes with these special positions facilitates us to identify places with hybrid characteristics from both communities. Community boundary nodes connect the uniqueness between communities. To demonstrate the ways that the topics from community boundary nodes include topics from the communities, here we will show two examples from the TripAdvisor networks. First, Figure 10a shows an example of the topics and characteristics of a community boundary node and how the boundary node has the topics from two communities (i.e., Communities 14 and 12). However, when two communities have overlapping topics, the topics of the community boundary node are not always the perfect combination of topics from the two communities as shown in Figure 10b.

Conclusions
A place is a geographic location that has individuals' experiences and meaning-making processes from it. As a place has different meanings to different individuals, it is difficult to summarize the collective sense of place (e.g., [4,13,16]). As more and more textual data that describe people's experiences of places are now available online via social media etc., it is possible to study places by crowdsourcing the online textual data from place reviews and geolocated social media. Furthermore, through the use of network science and natural language processing of crowdsourced experiences collected from individuals we can also study the connections between places which reveals not only richer information about the place itself and how places relate to each other.
To this end, in this paper we used TripAdvisor reviews and geolocated Twitter data to understand the characteristics and the connectedness of places. The complex relationships between places were modeled via thematic (i.e., topical) similarity networks. While previous research using crowdsourced data allowed for the clustering of places (e.g., [32]), they do not explore the connections between places and clusters. The network approach developed in this paper enables us to perform clustering (i.e., network community detection) to discover the network and node level patterns of places. More specifically, the contributions of this research are as follows. First, similar to previous research with place clustering, the network approach of places also allow the performance of clustering on places using a network community detection algorithm. The case study in Section 5.1 show that community detected from the thematic similarity network from restaurant reviews tend to have higher Moran's I value (i.e., geographical proximity) than that of attraction reviews and tweets. It suggests that certain geographical clusters correspond to certain restaurant culture in Manhattan (e.g., the bar and pub area at Downtown). Second, by using the network approach (as discussed in Section 4), we can discover places of interest by exploiting the positions of the places in the network. In the case study shown in the paper, the places of interest are places of different levels of uniqueness (Section 5.3). Third, from the TripAdvisor restaurant network results, we found that even though most of the study area in Manhattan is high income, the low-income communities have a distinctive restaurant culture that the high-income areas do not have (Section 5.2). Fourth, by comparing different datasets (i.e., Trip Advisor restaurants and attractions reviews and Twitter), we show implications of using such data for studying places (Section 5.1). TripAdvisor review data represents experiences and perceptions people have directly about places, whereas geolocated Twitter data does not necessarily reflect places. However, as our case study shows, by using topic modeling one can overcome this challenge and filter out place-irrelevant topics, which do not require the time consuming hand labeling process as supervised learning (as shown in Section 5.1.1).
While this study has shown how places are connected through individuals' experiences and adds to the growing area of geographic data science [62], there are several limitations to this research. First, although the clustering algorithm used in this study (i.e., the Girvan-Newman algorithm [56]), produces deterministic results, it might not be an ideal choice when the networks become larger, say when expanding this research to larger areas. Therefore, researchers who expand this research might want to consider less computationally inexpensive algorithms such as Louvain community detection algorithm [63] which use modularity optimization and has been shown to be scalable [64]. Second, in this study, we define places as census tracts and further analysis is required to test whether some of the results still stand when places are defined otherwise (e.g., zip codes, city blocks etc.). Nonetheless, using census tracts in this research had the advantage of combining textual VGI data with Census data for further analysis (as shown in Section 5.2). Turning to future work, other centrality measures (e.g., betweenness centrality, eigenvector centrality) could be explored to discover places of interest other than degree centrality and boundary nodes. Additionally, topic models could be trained by merging data from three datasets so that the topics are comparable across networks. As this work does not take tokens such as emojis into consideration, future work could explore topic models by incorporating them (e.g., [65,66]). The network can also constructed differently with edges representing similarities measured by methods other than topic similarity such as similarity based on users' visit history, which has often been used in collaborative recommender systems [67]. Even with these limitations and potential areas of further work, the research presented in this paper demonstrates a novel approach of studying places and their connections by combining textual VGI with network analysis.
Author Contributions: All authors contributed to the design of the methodology and the writing of the paper. Xiaoyi Yuan collected the data and carried out the analysis. All the authors contributed to the preparation of the manuscript and approved the final version to be published. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.