Heri-Graphs: A Workflow of Creating Datasets for Multi-modal Machine Learning on Graphs of Heritage Values and Attributes with Social Media

Values (why to conserve) and Attributes (what to conserve) are essential concepts of cultural heritage. Recent studies have been using social media to map values and attributes conveyed by public to cultural heritage. However, it is rare to connect heterogeneous modalities of images, texts, geo-locations, timestamps, and social network structures to mine the semantic and structural characteristics therein. This study presents a methodological workflow for constructing such multi-modal datasets using posts and images on Flickr for graph-based machine learning (ML) tasks concerning heritage values and attributes. After data pre-processing using state-of-the-art ML models, the multi-modal information of visual contents and textual semantics are modelled as node features and labels, while their social relationships and spatiotemporal contexts are modelled as links in Multi-Graphs. The workflow is tested in three cities containing UNESCO World Heritage properties - Amsterdam, Suzhou, and Venice, which yielded datasets with high consistency for semi-supervised learning tasks. The entire process is formally described with mathematical notations, ready to be applied in provisional tasks both as ML problems with technical relevance and as urban/heritage study questions with societal interests. This study could also benefit the understanding and mapping of heritage values and attributes for future research in global cases, aiming at inclusive heritage management practices.


Introduction
In the context of UNESCO World Heritage (WH) Convention, "values" (why to conserve) and "attributes" (what to conserve) have been used extensively to detail the cultural significance of heritage [UNESCO, 1972[UNESCO, , 2008. Meanwhile, researchers have provided categories and taxonomies for heritage values and attributes, respectively [Pereira Roders, 2007, Tarrafa Silva and Pereira Roders, 2010, Veldpaus, 2015. Both concepts are essential for understanding the significance and meaning of cultural and natural heritage, and for making more comprehensive management plans [Veldpaus, 2015]. However, the heritage values and attributes are not only to define the significance of Outstanding Universal Value (OUV) in the particular context of World Heritage List (WHL), but all kinds of significance, ranging 3. Multi-graphs have been constructed to reflect the temporal, spatial, and social relationships among the data samples of collected User-Generated Content, ready to be further tested on several provisional tasks with both scientific relevance for Graph-based Multi-modal Machine Learning and Social Network research, and societal interests for Urban Studies, Urban Data Science, and Heritage Studies.
2 Materials and Methods

Selection of Case Studies
Without loss of generality, this research selected three cities in Europe and China that are related to UNESCO WH and HUL as case studies: Amsterdam (AMS), the Netherlands; Suzhou (SUZ), China; and Venice (VEN), Italy. All three cities either are themselves entirely or partially inscribed in the WHL, such as Venice and its Lagoon 2 and Seventeenth-Century Canal Ring Area of Amsterdam inside the Singelgracht 3 , or contain WHL in multiple spots of the city, such as Classical Gardens of Suzhou 4 , showcasing different spatial typologies of cultural heritage in relation to its urban context [Pereira Roders, 2010, Valese et al., 2020.
As shown in Table 1, the three cases have very different scales, yet they all strongly demonstrated the relationship of urban fabric and water system. Interestingly, Amsterdam and Suzhou have been respectively referred to as "the Venice of the North/East" by media and public. Moreover, the concept of OUV introduced in Section 1 reveals the core heritage values of WH properties. The OUV of a property would be justified with ten selection criteria, where criteria (i)-(vi) reflect various cultural values, and criteria (vii)-(x) natural ones [Jokilehto, 2007, UNESCO, 2008, Bai et al., 2021b, as explained in Appendix Table A11. The three selected cases include a broad variety of all cultural heritage OUV selection criteria, implying the representativeness of the datasets constructed in this study.

Data Collection and Pre-processing
Numerous studies have collected, annotated, and distributed open-source datasets from the social media platform Flickr due to its convenient Application Programming Interface (API), including MirFlickr-1M [Huiskes and Lew, 2008], NUS-WIDE [Chua et al., 2009], Flickr [Tang and Liu, 2009], ImageNet [Deng et al., 2009, Krizhevsky et al., 2012, Microsoft Common Object in COntext (MS COCO) [Lin et al., 2014], Flickr30k [Plummer et al., 2015], SinoGrids [Zhou and Long, 2016], and GRAPH Saint [Zeng et al., 2019], etc. These datasets containing one or more of the visual, semantic, social, and/or geographical information of UGC are widely used, tested, but also sometimes challenged by different ML communities including Computer Vision, Multi-modal Machine Learning, and Machine Learning on Graphs. However, they are more or less suitable for bench-marking general ML tasks and testing computational algorithms, which are not necessarily tailor-made for heritage and urban studies. The motivation of data collection in this research is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.
FlickrAPI python library 5 was used to access the API method provided by Flickr 6 , using the Geo-locations in Table 1 as the centroids to search a maximum of 5000 IDs of geo-tagged images in a fixed radius covering the major urban area, to make the datasets from the three cities comparable and compatible. To test the scalability of the methodological workflow, another larger dataset without ID number limit has also been collected in Venice (VEN-XL). Only images with a candownload flag indicated by the owner were further queried, respecting the privacy and copyrights of Flickr users. The following information of each post was collected: owner's ID; owner's registered location on Flickr; the 2 https://whc.unesco.org/en/list/394 3 https://whc.unesco.org/en/list/1349/ 4 http://whc.unesco.org/en/list/813 5 https://stuvel.eu/software/flickrapi/ 6 https://www.flickr.com/services/api/ title, description, and tags provided by user; geo-tag of the image; timestamp marking when the image was taken, and URLs to download the Large Square (150×150 px) and Small 320 (320×240 px) versions of the original image. Furthermore, the public friend and subscription lists of all the retrieved owners have been queried, while all personal information were only considered as a [semi-] anonymous ID as respect to the privacy policy.
The retrieved textual fields of description, title, and tags could all provide useful information, yet not all posts have these fields, and not all posts are necessarily written to express thoughts and share knowledge about the place (considered as valid in the context of this study). The textual fields of the posts were cleaned, translated, and merged into a Revised Text field as the raw English textual data, after recording the detected original language of posts on sentence level using Google Translator API from the Deep Translator python library 7 . Moreover, many posts shared by the same user were uploaded at once, thus having the same duplicated textual fields for all of them. To handle such redundancy, a separate dataset of all the unique processed textual data on sentence level was saved for each city, while the original post of each sentence was marked and could easily be traced back.
Detailed description of the data collection and pre-processing methods could be found in Appendix B. Table 2 shows the number of data samples (posts) and owners (users) for the three case study cities at each stage. To formally describe the data, define the problem, and propose a generalizable workflow, mathematical notations are used in the rest of this manuscript. Since the same process is valid for all three cities (and probably also for other unselected cases worldwide) and has been repeated exactly three times, no distinctions would be made among the cities, except for the cardinality of sets reflecting sample sizes. Let i be the index of a generic sample of the dataset for one city, then the raw data of it could be denoted as a tuple where K is the sample size of the dataset in a city (as shown in Table 2), I i is a three-dimensional tensor of the size of the image with three RGB channels, i , ..., ∫ (|Si|) i } or S i = ∅ is a set of revised English sentences that can also be an empty set for samples without any valid textual data, u i ∈ U is a user ID that is one instance from the user set U = {µ 1 , µ 2 , ..., µ |U | }, t i ∈ T is a timestamp that is one instance from the ordered set of all the unique timestamps T = {τ 1 , τ 2 , ..., τ |T | } from the dataset at the level of weeks, and l i = (x i , y i ) is a geographical coordinate of latitude (y i ) and longitude (x i ) marking the geo-location of the post. A complete nomenclature of all the notations used in this paper can be found in the Appendix Tables A9 and A10. Figure 1 demonstrates the workflow of one sample post in Venice, which will be explained in the following sections.

Visual Features
Places365 is a dataset that contains 1.8 million images from 365 scene categories, which includes a relatively comprehensive collection of indoor and outdoor places [Zhou et al., 2014[Zhou et al., , 2017. The categories can be informative for urban and heritage studies to identify depicted scenes of images and to further infer heritage attributes [Veldpaus, 2015, Ginzarly et al., 2019. A few Convolutional Neural Network (CNN) models have been pretrained by Zhou et al. [2017] using state-of-the-art backbones to predict the depicted scenes in images, reaching a top-1 accuracy of around 55% and top-5 accuracy of around 85%. Furthermore, the same set of pretrained models have been used to predict 102 discriminative scene attributes based on SUN Attribute dataset Hays, 2012, Patterson et al., 2014], reaching top-1 accuracy of around 92% [Zhou et al., 2017]. These scene attributes are conceptually different from heritage attributes, as the former are mostly adjectives and present participles describing the scene and activities taking place, therefore both heritage values and attributes could be effectively inferred therefrom. Figure 1: The workflow of multi-modal feature generation process of one sample post in Venice, while graph construction requires all data points of the dataset. The original post owned by user 17726320@N03 is under CC BY-NC-SA 2.0 license. The question marks in the right part indicate some provisional tasks for this dataset, which will be discussed in Section 4.1 and Table 8. This study used the openly-released ResNet-18 model [He et al., 2016] pretrained on Places365 with PyTorch 8 . This model was adjusted to effectively yield three output vectors: 1) the last softmax layer of the model l s 365×1 as logits over all scene categories; 2) the last hidden layer h v 512×1 of the model; 3) a vector l a 102×1 as logits over all scene attributes. Such a process for any image input I i could be described as: or preferably in a vectorized format: where (3) Considering that the models have reasonable performance in top-n accuracy, to keep the visual features explainable, a n-hot soft activation filter σ (n) is performed on both logit outputs, to keep the top-n prediction entries active, while smoothing all the others based on the confidence of top-n predictions (n = 5 for scene categories L s and n = 10 for scene attributes L a ). Let max(l, n) denote the n th maximum element of a d-dimensional logit vector l (the sum of all entries of l equals 1), then the activation filter σ (n) could be described as: where m is a mask vector indicating the positions of top-n entries, and l T m is effectively the total confidence of the model for top-n predictions. Note that this function could also take a matrix as input and process it as several column vectors to be concatenated back.
Furthermore, as Places365 dataset is tailor-made for scene detection tasks rather than recognizing faces [Zhou et al., 2017], the models pretrained on it may get confused when a new image is mainly composed of faces as "typical tourism pictures" and selfies, which is not uncommon in the case studies as popular tourism destinations. As the ultimate aim of constructing such datasets is not to precisely predict the scene each image depict, but to help infer heritage values and attributes, it would be unfair to simply exclude those images with significant proportion of faces on them. Rather, the existence of human on the images showing their activities would be a strong cue of intangible dimension of heritage properties. Under such consideration, an Inception ResNet-V1 model 9 pretrained on the VGGFace2 Dataset [Schroff et al., 2015, Cao et al., 2018 has been used to generate features about depicted faces in the images. A three dimensional vector f i was obtained for any image input I i , where the non-negative first entry f 1,i ∈ N counts the number of faces detected in the image, the second entry f 2,i ∈ [0, 1] records the confidence of the model for face detection, and the third entry f 3,i ∈ [0, 1] calculates the proportion of area of all the bounding boxes of detected faces to the total area of the image. Similarly, the vectorized format could be written as F := [f i ] 3×K over the entire dataset.
Finally, all the obtained visual features were concatenated vertically to generate the final visual feature X vis 982×K : where [·, ·] denotes the horizontal concatenation of matrices.
This final matrix is to be used in future MML tasks as the vectorized descriptor of the uni-modal visual contents of the posts, with both more abstract hidden features, and more specific information about predicted categories, which is a common practice in MML literature [Baltrusaitis et al., 2019]. All models are tested on both the 150 × 150 and 320 × 240 px images to compare the consistency of generated features. The workflow of generating visual features is illustrated in the top part of Figure 1.

Textual Features
In the last decade, attention-and Transformer-based models have taken over the field of Natural Language Processing (NLP), increasing the performance of models in both general machine learning tasks, and domain-specific transfer-learning scenarios [Vaswani et al., 2017]. As an early version, the pretrained Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2019] is still regarded as a powerful base model to be fine-tuned on specific downstream datasets and to perform various NLP tasks. Specifically, the output on the [CLS] token of BERT models is regarded as an effective representation of the entire input sentence, being used extensively for classification tasks [Clark et al., 2019, Sun et al., 2019. In heritage studies domain, Bai et al. [2021a] fine-tuned BERT on the dataset WHOSe Heritage they constructed from UNESCO World Heritage inscription document, followed by a Multi-Layer Perceptron (MLP) classifier to predict the OUV selection criteria a sentence is concerned with, showing top-1 accuracy of around 71% and top-3 accuracy of around 94%.
This study used the openly-released BERT model fine-tuned on WHOSe Heritage with PyTorch 10 . The BERT model took both the entire sentence sets S i and individual sentences of the sets {∫ (1) i , ..., ∫ (|Si|) i } as paragraph-level and sentence-level inputs, respectively, for the comparison of its consistency on their predicted outputs on this new dataset. Furthermore, taking the entire sentence sets S i as input, the 768-dimensional output vector h BERT 768×1 of the [CLS] token was retrieved on samples that have valid textual data: or preferably in a vectorized format: Moreover, the original language of each sentence may provide additional information to the understanding of the verbal context of posts, which can also be informative to effectively identify and compare locals and tourists. A three dimensional vector o i ∈ {0, 1} 3 was obtained with Google Translator API. The three entries respectively marked whether there were sentences in English, local languages (Dutch, Chinese, or Italian, respectively), and other languages in the set S i . The elements of vector o i or the matrix form O := [o i ] 3×K could be in a range from all zeros (when there were no textual data at all) to all ones (when the post was composed of different languages in separate sentences).
Similar to visual features, final textual features X tex 771×K could be obtained by concatenation: The workflow of generating textual features is illustrated in the bottom part of Figure 1. 9 https://github.com/timesler/facenet-pytorch 10 https://github.com/zzbn12345/WHOSe_Heritage 2.3.3 Contextual Features As mentioned in Section 2.2, the user ID u i and timestamp t i of a post are both an instance from their respective set U and T , since multiple posts could be posted by the same user, and multiple images could be taken during the same week. To help formulate and generalize the problem under the practice of relational database [Reiter, 1989], both information could be transformed as one-hot embeddings U := [u j,i ] |U |×K ∈ {0, 1} |U |×K and T := [t k,i ] |T |×K ∈ {0, 1} |T |×K , such that: and Furthermore, Section 2.2 also mentioned the collection of the public contacts and groups of all the users µ j from the set U. To keep the problem simple, only direct contact pairs were considered to model the back-end social structure of the users, effectively filtering out the other contacts a user µ j has that were not in the set of interest U, resulting an adjacency matrix among the users A U := [a U j,j ] |U |×|U | ∈ {0, 1} |U |×|U | , j, j ∈ [1, |U|] marking their direct friendship: Let I(µ j ) denote the set of public groups a user µ j follows (can be an empty set if µ j follows no group), and let IoU(A, B) denote the Jaccard Index (size of Intersection over size of Union) of two generic sets A, B: then another weighted adjacency matrix among the users A U := [a U j,j ] |U |×|U | ∈ [0, 1] |U |×|U | , j, j ∈ [1, |U|] could be constructed marking the mutual interests among the users in terms of group subscription on Flickr: To further simplify the problem, although the geo-location l i = (x i , y i ) of each post was typically distributed in a continuous range in the 2D geographical space, it would be beneficial to further aggregate and discretize the distribution in a topological abstraction of spatial network [Batty, 2013, Nourian, 2016, which has also been proven to be effective in urban spatial analysis, including but not limited to Space Syntax [Hillier and Hanson, 1989, Penn, 2003, Ratti, 2004, Blanchard and Volchenkov, 2008. The OSMnx python library 11 was used to inquire the simplified spatial network data on OpenStreetMap including all means of transportation [Boeing, 2017] in each city with the same centroid location and radius described in Section 2.2. This operation effectively saved a spatial network as an undirected weighted graph is the set of all links possibly connecting two spatial nodes (by different sorts of transportation such as walking, biking, and driving), and w 0 ∈ R |E0| + is a vector with the same dimension as the cardinality of the edge set, marking the average travel time needed between node pairs (dissimilarity weights). The distance.nearest_nodes method of OSMnx library was used to retrieve the nearest spatial nodes to any post location l i = (x i , y i ). By only keeping the spatial nodes that have at least one data sample posted nearby, and restricting the link weights between nodes so that the travel time on any link is no more than 20 minutes, which ensures a comfortable temporal distance forming neighbourhoods and communities [Howley et al., 2009], a subgraph G = (V, E, w) of G 0 could be constructed, so that V ⊆ V 0 , E ⊆ E 0 , and w ∈ [0, 20.0] |E| . As the result, another one-hot embedding matrix S := [s l,i ] |V |×K ∈ {0, 1} |V |×K could be obtained: The contextual features constructed as matrices/graphs would be further used in Section 2.5 to link the posts together.  [Pereira Roders, 2007, Jokilehto, 2007, Tarrafa Silva and Pereira Roders, 2010. To keep the initial step simple, this study arbitrarily applied the value definition in UNESCO WHL with regard to ten OUV selection criteria, as listed in Appendix Table A11 with an additional class Others representing scenarios where no OUV selection criteria suit the scope of a sentence (resulting a 11-class category). A group of ML models have been trained and fine-tuned to make such predictions by Bai et al. [2021a] as introduced in Section 2.3.2. Except for BERT already used to generate textual features as mentioned above, a Universal Language Model Fine-tuning (UMLFiT) [Howard and Ruder, 2018] has also been trained and fine-tuned, reaching a similar performance in accuracy. Furthermore, it has been found that the average confidence by both BERT and ULMFiT models on the prediction task showed significant correlation with expert evaluation, even on social media data [Bai et al., 2021a]. This suggests that it may be possible to use the both trained model to generate labels about heritage values in a semi-supervised active learning setting [Prince, 2004, Zhu andGoldberg, 2009], as this is a task too knowledge-demanding for crowd-workers, yet too time-consuming for experts [Pustejovsky and Stubbs, 2012].
The pseudo-label generation step could be formulated as: where g * is an end-to-end function including both pre-trained models and MLP classifiers; and y * i is an 11-dimensional logit vector as soft-label predictions. Let argmx(l, n) denote the function returning the index set of the largest n elements of a vector l, together with the previously defined max(l, n), the confidence and [dis-]agreement of models for top-n predictions could be computed as: This confidence indicator matrix K HV could be presumably regarded as a filter for the labels on heritage values Y HV , to only keep the samples with high inter-annotator (model) agreement [Nowak and Rüger, 2010] as the "ground-truth" [pseudo-] labels, while treating the others as unlabeled [Lee et al., 2013, Sohn et al., 2020.

Heritage Attributes as Depicted Scenery
Heritage attributes (HA) also have multiple categorization systems [Veldpaus and Roders, 2014, Veldpaus, 2015, Gustcoven, 2016, Ginzarly et al., 2019, UNESCO, 2020, and are arguably more vaguely defined than HV. For simplicity, this study arbitrarily combined the attribute definitions of Veldpaus [2015] and Ginzarly et al. [2019] and kept a 9-class category of tangible and/or intangible attributes that were visible from an image. More precisely speaking, this category should be framed as "depicted scenery" of an image [Ginzarly et al., 2019] that heritage attributes could possibly be induced from. The depicted scenes themselves are not yet valid heritage attributes. This semantic/philosophical discussion, however, is out of the scope of this paper. The definitions of the nine categories are listed in Appendix Table A12.
An image dataset collected in Tripoli, Lebanon and classified with expert-based annotations presented by Ginzarly et al. [2019] was used to train state-of-the-art ML models to replicate the experts' behaviour on classifying depicted scenery with Scikit-learn python library [Pedregosa et al., 2011]. For each image, a unique class label was provided, effectively forming a multi-class classification task. The same 512-dimensional visual representation H V introduced in Section 2.3.1 was generated from the images as the inputs. Classifiers including Multi-layer Perceptron (MLP) (shallow neural network) [Hinton, 1990], K-Nearest Neighbour (KNN) [Altman, 1992], Gaussian Naive Bayes (GNB) [Rish et al., 2001], Support Vector Machine (SVM) [Platt et al., 1999], Random Forest (RF) [Breiman, 2001], and Bagging Classifier [Breiman, 1996a] with SVM core (BC-SVM) were first trained and tuned for optimal hyperparameters using 10-fold cross validation (CV) with grid search [Arlot and Celisse, 2010]. Then the individually-trained models were put into ensemble-learning settings as both a voting [Zhou, 2012] and a stacking classifier [Breiman, 1996b]. All trained models were tested on validation and test datasets to evaluate their performance. Details of the machine learning models are given in Appendix C.
Both ensemble models were further applied in images collected in this study. Similar to the HV labels described in Section 2.4.1, the label generation step of HA could be formulated as: where h * is an ensemble model taking all parameters Θ M from each ML model in set M; and y * i is a 9-dimensional logit vector as soft-label predictions. Similarly, the confidence of models for top-n prediction is: This confidence indicator matrix K HA could also be the filter for heritage attributes labels Y HA .

Multi-Graph Construction
Three types of similarities/ relations among posts were considered to compose the links connecting the post nodes: temporal similarity (posts with images taken during the same time period), social similarity (posts owned by the same people, by friends, and by people who share mutual interests), and spatial similarity (posts with images taken at the same or nearby locations). All three could be deduced from the contextual information in Section 2.3.3. As the result, an undirected weighted multi-graph (also known as Multi-dimensional Graph in Ma and Tang [2021]) with the same node set and three different link sets could be constructed as otherwise.
The three weighted adjacency matrices could be respectively obtained as follows: Temporal Links Let T |T |×|T | denote a symmetric tridiagonal matrix where the diagonal entries are all 1 and off-diagonal non-zero entries are all α T , where α T ∈ [0, 1) is a parametric scalar: then the weighted adjacency matrix A TEM K×K for temporal links could be formulated as: where T |T |×K is the one-hot embedding of timestamp for posts mentioned in Equation 11. For simplicity, α T is set to 0.5. With such a construction, all the posts from which the images were originally taken in the same week would have a weight of w TEM e = 1 connecting them in G TEM , and posts with images taken in nearby weeks in a chronological order would have a weight of w TEM e = 0.5.
Social Links Let U |U |×|U | denote a symmetric matrix as a linear combination of three matrices marking the social relations among the users: where Equation 14 for the common-interest relation above a certain threshold β U ∈ (0, 1), and α (1) U ∈ R + are parametric scalars to balance the weights of different social relations. The weighted adjacency matrix A SOC K×K for social links could be formulated as: where U |U |×K is the one-hot embedding of owner/user for posts mentioned in Equation 10. For simplicity, the threshold β U is set to 0.05 and the scalars α U are all set to 1. With such a construction, all posts uploaded by the same user would have a weight of w SOC e = 1 connecting them in G SOC , posts by friends with common interests (of more than 5% common groups subscriptions) would have a weight of w SOC if the e th element of E is (υ l , υ l ), 0 otherwise.
The weighted adjacency matrix A SPA K×K for spatial links could be formulated as: where S |V |×K is the one-hot embedding of spatial location for posts mentioned in Equation 15. With such a construction, posts located at the same spatial node would have a weight of w SPA e = 1 in G SPA , and posts from nearby spatial nodes would have a weight linearly decayed based on distance within a maximum transport time of 20 minutes.
Additionally, the multi-graph G could be simplified as a simple composed graph G = (V, E ) with a binary adjacency matrix A ∈ {0, 1} K×K , such that: which connects two nodes of posts if they are connected and similar in at least one contextual relationship.
All graphs were constructed with NetworkX python library [Hagberg et al., 2008]. The rationale under constructing various graphs has been briefly described in Section 1: the posts close to each other (in temporal, social, or spatial senses) could be arguably similar in their contents, and therefore, also similar in the heritage values and attributes they might convey. Instead of regarding these similarities as redundancy and e.g., removing duplicated posts by the same user to avoid biasing the analysis, such as in Ginzarly et al. [2019], this study intends to take advantage of as much available data as possible, since similar posts may enhance and strengthen the information, compensating the redundancies and/or nuances using back-end graph structures. At later stage of the analysis, the graph of posts could be even coarsened with clustering and graph partitioning methods [Karypis and Kumar, 1995, Lafon and Lee, 2006, Gao and Ji, 2019, Ma and Tang, 2021, to give an effective summary of possibly similar posts. Table 3 shows the consistency of generated visual and textual features. The visual features compared the scene and attribute predictions on images of difference sizes (150×150 and 320×240 px); and the textual features compared the OUV selection criteria with aggregated (averaged) sentence-level predictions on each sentence from set {∫

Generated Visual and Textual Features
i , ..., ∫  For both scene and attribute predictions, the means of top-1 Jaccard index were always higher than that of top-n, however, the smaller variance proved the necessity of using top-n prediction as features. Note the attribute prediction was more stable than the scene prediction when the image shape changed, this is probably because the attributes usually describe low-level features which could appear in multiple parts in the image, while some critical information to judge the image scene may be lost during cropping and resizing in the original ResNet-18 model. Considering the relatively high consistency of model performance and the storage cost of images when the dataset would ultimately scale up (e.g., VEN-XL), the following analyses would only be performed on smaller square images of 150×150 px.
The high Jaccard index of OUV predictions showed that averaging the textual features derived from sub-sentences of a paragraph would yield a similar performance of directly feeding the whole paragraph into models, especially when the top-3 predictions are of main interests. Note the higher consistency in Suzhou was mainly a result of the higher proportion of posts only consisting one sentence.  Table 4 gives descriptive statistics of results that were not compared against different scenarios as in Table 3. Only a small portion of posts had detected faces on them. While Amsterdam has the highest proportion of face pictures (17.9%), Venice has larger average area of faces on the picture (i.e., more selfies and tourist pictures). These numbers are also assumed to help associate a post to human-activity-related heritage values and attributes. Considering the languages of the posts, Amsterdam showed a balance between Dutch-speaking locals and English-speaking tourists, Venice showed a balance between Italian-speaking people and non-Italian-speaking tourists, while Suzhou showed a lack of Chinese posts. This is consistent with the popularity of Flickr as social media in different countries, which also implies that data from other social media could compensate this unbalance if the provisional research questions would be sensitive to the nuance between local and tourist narratives.
3.1.2 Pseudo-Labels for Heritage Values and Attributes As argued in Section 2.4.1, the label generation process of this paper did not involve human annotators. Instead, it used thoroughly trained ML models as machine replica of annotators and considered their confidences and agreements as a filter to keep the 'high-quality' labels as pseudo-labels. Similar operations could be found in semi-supervised learning [Zhou and Li, 2010, Lee et al., 2013, Sohn et al., 2020.  Table 1 could all be observed as significantly present (e.g., criteria (i),(ii),(iv) for Amsterdam) except for criterion (v) in Venice and Suzhou, which might be caused by the relatively fewer examples and poorer class-level performance of criterion (v) in the original paper. Remarkably, criterion (iii) in Amsterdam and criterion (vi) in Amsterdam and Suzhou were not officially inscribed, but appeared to be relevant inducing from social media, inviting further heritage-specific investigations. The distributions of Venice and Venice-large were more similar in sentence-level predictions (Kullback-Leibler Divergence D KL = .002, Chi-square χ 2 = 39.515) than post-level (D KL = .051, χ 2 = 518.895), which might be caused by the specific set of posts sub-sampled in the smaller dataset. For heritage attributes, Table 5 shows the performance of ML models mentioned in Section 2.4.2. The two ensemble models with voting and stacking settings performed equally well and significantly better than other models (except for CV accuracy of SVM), proving the rationale of using both classifiers for heritage attribute label prediction. An average top-1 confidence of κ HA(0) > 0.7 and top-1 agreement of κ HA(1) = 1 was used as the filter for Y HA . This filter resulted around 35-50% images in each city as 'labelled', and the rest as 'unlabelled'. Figure 3 demonstrates the distribution of 'labelled' data about heritage attributes in each city. It is remarkable that although the models were only trained on data from Tripoli, they performed reasonably well in unseen cases of Amsterdam, Suzhou, and Venice, capturing typical scenes of monumental buildings, architectural elements, and gastronomy etc., respectively. Although half of the collected images were treated as 'unlabelled' due to low confidence, the negative examples are not necessarily incorrect (e.g., with Monuments and Buildings). For all cities, Urban Form Elements and People's Activity and Association are the most dominant classes, consistent with the fact that the most Flickr images are taken on the streets. Seen from the bar plots in Figure 3, the classes were relatively unbalanced, suggesting that more images from small classes might be needed or at least augmented in future applications. Furthermore, the distributions of Venice and Venice-large are similar to each other (D KL = .076, χ 2 = 188.241), suggesting a good representativeness of the sampled small dataset.

Back-end Geographical Network
The back-end spatial structures of post locations as graphs G = (V, E, w) were visualized in Figure 4. Further graph statistics in all cities were given in Table 6. The urban fabric is more visible in Venice than the other two cities, as there is always a dominant large component connecting most nodes in the graph, leaving fewer unconnected isolated nodes alone. While in Amsterdam, more smaller connected components exist together with a large one, and in Suzhou, the graph is even more fragmented with smaller components. This is possibly related to the distribution of tourism destinations, which is also consistent with the zoning typology of WH property concerning urban morphology [Pereira Roders, 2010, Valese et al., 2020: for Venice, the Venetian islands are included together with a larger surrounding lagoon in the WH property (formerly referred to as core zone), and are generally regarded as a tourism destination as a whole; for Amsterdam, the WH property is only a part of the old city being mapped where tourists could freely wander and take photos in areas not listed yet interesting as tourism destinations; while for Suzhou, the WH properties are themselves fragmented gardens distributed in the old city, also being the main destinations visited by (foreign) tourists. Furthermore, the two types of rank-size plots showing respectively the degree distribution and the posts-per-node distribution showed similar patterns, the latter being more heavy-tailed, a typical characteristic of large-scale complex networks [Barabási, 2013, Eom andJo, 2015], while the back-end spatial networks are relatively more regular. Table 7 shows graph statistics of three constructed sub-graphs G TEM , G SOC , G SPA with different link types within the multi-graph G, and the simple composed graph G for each city, while Figure 5 plots their [weighted] degree distributions, respectively. The multi-graphs are further visualized in Appendix Figure A6.  The three link types provided heterogeneous characteristics: 1) the temporal graph is by definition connected, where the highest density in Amsterdam suggested the largest number of photos taken in consecutive times, while the largest diameter in Venice suggested the broadest span of time; 2) the social graph is structured by the relationship of users, where the largest connected components showed clusters of posts shared either by the same user, or by users who are friends or with mutual interests, the size of which in Suzhou is small because of the fewest users shown in Table 1; 3) the spatial graph shows similar connectivity pattern with the back-end spatial graphs, where the extremely small diameter and the largest density in Suzhou reassured the fragmented positions of posts; 4) although the degree distribution of three sub-graphs fluctuated due to different socio-economic and spatio-temporal characteristics of different cities, that of the simple composed graph showed similar elbow-shape patterns, with similar density and diameter. Moreover, the heterogeneous graph structures suggest that different parameters and/or backbone models need to be fit and fine-tuned with each link type, a common practice for deep learning on multi-graphs. Using visual features to infer categories induced from (possibly missing) texts with co-training [Blum and Mitchell, 1998] in fewshot learning settings [Wang et al., 2020].

Provisional Tasks for Urban Data Science
As the latest advances in heritage value assessment have been discovering the added value of inspecting texts [Tarrafa Silva and Pereira Roders, 2010], can values also be seen and retrieved from scenes of images? 1 X tex → Y HA |K HA Text Classification (semi-supervised) Using textual features to infer categories induced from images possibly with attention mechanisms [Vaswani et al., 2017].
How to relate the textual descriptions to certain heritage attributes [Gomez et al., 2019]? Are there crucial hints other than appeared nouns? 2 Using multi-modal (multi-view) features to make inference, either with training joint representations or by making early and/or late fusions [Blum andMitchell, 1998, Baltrusaitis et al., 2019].
How can heritage values and attributes be jointly inferred from the combined information of both visual scenes and textual expressions [Ginzarly et al., 2019]? How can they complement each other?

Test-beds for different graph filters such as Graph Convolution
Networks [Kipf and Welling, 2016] and Graph Attention Networks [Veličković et al., 2017].
How can the contextual information of a post contribute to the inference of its heritage values and attributes? What is the contribution of time, space, and social relations [Miah et al., 2017]?
Test-beds for link prediction algorithms [Adamic and Adar, 2003] considering current graph structure and node features. What is the probability that other links also should exist?
Considering the similarity of posts, would there be heritage values and attributes that also suit the interest of another user, fit another location, and/or reflect another period of time [Majid et al., 2013]? 5 X, Y , A →X,Ŷ ,Â Graph Coarsening (unsupervised) Test-beds for graph pooling [Ma and Tang, 2021] and graph partitioning [Karypis and Kumar, 1995] algorithms to generate coarsened graphs [Pang et al., 2021] in different resolutions.
How can we summarize, aggregate, and eventually visualize the large-scale information from the social media platforms based on their contents and contextual similarities [Cho et al., 2022]?
Test-beds for graph classification algorithms [Zhang et al., 2018] when more similar datasets have been collected and constructed in more case study cities.
Can we summary the social media information of any city with World Heritage property so that the critical heritage values and attributes could be directly inferred [Monteiro et al., 2014]? 7 X, Y , A → I, S Image/Text Generation (supervised) Using multi-modal features to generate the missing and/or unfit images and/or textual descriptions, probably with Generative Adversarial Network [Goodfellow et al., 2014].
How can a typical image and/or textual description of certain heritage values and attributes at a certain location in a certain time by a certain type of user in a specific case study city be queried or even generated [Gomez et al., 2019]?
Respectively generating a universal embedding and a contextspecific embedding for each type of links in the multi-dimensional network [Ma et al., 2018], probably with random walks on graphs.
How are heritage values and attributes distributed and diffused in different contexts? Is the First Law of Geography [Tobler, 1970] still valid in the specific social, temporal and spatial graphs?

Dynamic Prediction (self-supervised)
Given the current graph structure and its features stamped with time steps, how shall it further evolve in the next time steps [Nguyen et al., 2018, Ren et al., 2019?
How are the current expressions of heritage values and attributes in a city influencing the emerging post contents, the tourist behaviours, and the planning decision making [Bai et al., 2021c, Zhang andCheng, 2020]?
The datasets introduced in this paper could be used to answer questions from the perspectives of machine learning and social network analysis as well as heritage and urban studies. Table 8 gives a few provisional tasks that could be realised using the collected datasets of this paper, and further datasets to be collected using the same introduced workflow. These problems would use some or all of extracted features (visual, textual, contextual), generated labels (heritage values and attributes), constructed graph structures, and even raw data as input and output components to find the relationship function among them. Some problems (such as 0, 1, 2, and 6) were marked in Figure 1 as question marks. Some problems are more interesting as ML/SNA problems (such as 4, 7 and 8), some are more fundamental for urban data science (such as 0, 1 and 6). While the former tends towards the technical and theoretical end of the whole potential range of the datasets, the latter tends towards the application end. However, to reach a reasonable performance during applications and discoveries, as is the main concern and interest for urban data science, further technical investigations and validations would be indispensable.
In summary, the core question from collecting such datasets could be formulated as: while heritage values and attributes have been historically inspected from site visiting and document reviewing by experts, can computational methods and/or artificial intelligence help accelerate and aid the process of knowledge documentation and comparative studies by mapping and mining multi-modal social media data? Even if acceleration of the processes is not a priority, the provision of such a workflow is aimed to encourage consistency and inclusion of communities in the discourse of cherishing, protecting, and preserving cultural heritage. In other words, the machine can represent the voice of the community.
Note a further distinction needs to be made within the extracted heritage values and attributes, as basically they may be clustered into three categories: 1) core heritage values and attributes officially listed and recognized that thoroughly defined the heritage status; 2) values and attributes relevant to conservation and preservation practice; 3) other values and attributes not specifically heritage-related yet being conveyed to the same heritage property by ordinary people. This distinction should be made clear of for practitioners intending to make planning decisions based on the conclusions drawn from studying such datasets.

Limitations and Future Steps
No thorough human evaluations and annotations have been performed during the construction of the datasets presented in this paper. This manuscript provides a way to walk around that step by using only the confidence and [dis-]agreement of presumably well-trained models as the proxy for the more conventional 'inter-annotator' agreement to show the quality of datasets and generate [pseudo-]labels [Nowak and Rüger, 2010]. This resembles the idea of using the consistency, confidence, and disagreement to improve the model performance in semi-supervised learning [Zhou and Li, 2010, Lee et al., 2013, Sohn et al., 2020. For the purpose of introducing a general workflow that could generate more graph datasets, it is preferable to exclude humans from the loop as it would be a bottleneck limiting the process, both in time and monetary resources, and in demanded domain knowledge. However, for applications where more accurate conclusions are needed, human evaluations on the validity, reliability, and coherence of the models are still needed. It is suggested to inspect some predicted results to have a clear sense of the performance before use. As the step of [pseudo-]label generation was relatively independent from the other steps introduced in this paper, higher-quality labels annotated and evaluated by experts and/or crowd-workers could still be added at a later stage as augmentation or even replacement, as an active learning process [Settles, 2011, Prince, 2004, Zhu and Goldberg, 2009. Moreover, generating labels of heritage values and attributes was only one arbitrary choice to showcase the label generation process. Yet, it is also possible to apply the same workflow while only replacing the classifiers mentioned in Section 2.4 with specific topics related to the interests of future researchers, to answer questions from urban data science and computational social sciences.
While scaling up the dataset construction process, such as from VEN to VEN-XL, a few changes need to be adopted. For data collection, an updated strategy is described in Appendix B. For feature and label generation, mini-batches and GPU computing significantly accelerated the process. However, the small graphs from case study cities containing around 3000 nodes already contained edges at scale of millions, making it challenging to scale up in cases such as VEN-XL, the adjacency list of which would be at scale of billions, easily exceeding computer memory. As a result, VEN-XL has not yet been constructed as a multi-graph. Further strategies such as using sparse matrices [Yuster and Zwick, 2005] and parallel computing should be considered. Moreover, the issue of scalability should also be considered for later graph neural network training, since the multi-graphs constructed in this study can get quite dense locally. Sub-graph sampling methods should be applied to avoid 'neighbourhood explosion' [Ma and Tang, 2021].
Although the motivation of constructing datasets about heritage values and attributes from social media was to promote inclusive planning processes, the selection of social media platforms already automatically excluded those not using, or even not aware of, the platform, let alone those not using internet. The scarce usage of Flickr in China, as an example, also suggested that conclusions drawn from such datasets may reflect perspectives from 'tourist gaze' [Urry and Larsen, 2011] rather than local communities, therefore losing some representativeness and generality. However, the main purpose of this paper is to provide a reproducible workflow with mathematical definitions, not limited to Flickr. Images and descriptions from other platforms such as Weibo, Dianping, RED, and TikTok that are more popular in China could also add complementary local perspectives. With careful adaptions, archives, official documents, news articles, academic publications, and interview transcripts could also be constructed in similar formats for fairer comparisons.

Conclusions
This paper introduced a novel workflow to construct graph-based multi-modal datasets HeriGraph concerning heritage values and attributes using data from social media platform Flickr. State-of-the-art machine learning models have been applied to generate multi-modal features and domain-specific pseudo-labels. Full mathematical formulation is provided for the feature extraction, label generation, and graph construction processes. Three case study cities Amsterdam, Suzhou, and Venice containing UNESCO World Heritage properties are tested with the workflow to construct sample datasets, being evaluated and filtered with the consistency of models and qualitative inspections. Such datasets have the potentials to be applied by both machine learning community and urban data scientists to answer interesting questions with scientific/technical and social relevance, which could also be applied around the globe.
introduced in Section 2.3.1 was generated from the images as the input of ML models, while the class label was used as an categorical output as multi-class single-label classification task.
For each of the selected ML models, GridSearchCV function with 10-fold validation was used to wrap the model with a set of tunable parameters in a small range to be selected, while the average top-1 accuracy was used as the criterion for model selection. All 812 samples were input to the CV to tune the hyper-parameters, after which the trained models with their optimal hyper-parameters were tested on the 203 validation data samples and the unseen test dataset with the remained 90 samples. For the latter steps, top-1 accuracy and macro-average F1 score (harmonic average of the precision and recall scores) of all classes were used as the evaluation metrics. All experiments were conducted on 12th Gen Intel(R) Core(TM) i7-12700KF CPU. The implementation details of the models are as follows:

RF
The model did not restrict on the maximum depth of the trees. It was tuned on the class weight in settings of uniform, balanced, and balanced over sub-samples, and the minimum samples required to split a tree node in {2, 7, 12, ..., 97}. The best model had balanced class weight and minimum of 17 samples to split a tree node.
Bagging The model had 10 base estimators in the ensemble. It was tuned on the base estimator in SVM, Decision Tree, and KNN classifiers, and the proportion of maximum features used to train internal weak classifiers within range [0.1, 1.0] ⊂ R. The best model used maximum 50% of all features to fit SVM as internal base estimator.
Voting The model took the first six aforementioned trained models as inputs in the ensemble to vote for the output and was tuned on the choice of hard (voting on top-1 prediction) and soft (voting on the averaged logits) voting mechanism. The best model used the soft voting mechanism.
Stacking The model stacked the outputs of the first six aforementioned trained models in the ensemble followed by a final estimator and was tuned on the choice of final estimator among SVM and Logistic Regression. The best model used Logistic Regression as the final estimator. Table A9 and Table A10 give an overview of the mathematical notations used in this paper.

E Definition of Categories for Heritage
Values and Attributes Table A11 and Table A12 respectively give a detailed definition of heritage values (in terms of Outstanding Universal Value selection criteria) and heritage attributes (in terms of depicted scenes) categories applied in this paper.

F Multi-Graph Visualization
The connected components of each type of temporal, social, and spatial links of each case study city are visualized in Figure A6, respectively. The spring_layout algorithm of NetworkX python library with optimal distance between nodes k of 0.1 and random seed of 10396953 are used to output the graphs. Table A9: The nomenclature of mathematical notations used in this paper in alphabetic order. All superscripts of matrices are merely tags, not to be confused with exponents and operations, with the exception of transpose operator T .

Matrix of Boolean
The adjacency matrix of all post nodes in the set V that have at least one link connecting them as a composed simple graph.
The weighted adjacency matrix of each of the three sub-graphs G ( * ) of the multigraph G, '(*)' represents one of the link types in {TEM, SOC, SPA}.
The adjacency matrix of all unique users U marking their direct friendship which also included the relationship among themselves.
The weighted adjacency matrix of all unique users U marking their mutual interest in terms of the Jaccard Index of the public groups that they follow.
Parameters adjusting the weights of linear combination in relationship matrices T and U. β U Float scalar β U ∈ (0, 1) The threshold to define mutual interest of two users as the Jaccard Index of public groups.
The tuple of all raw data (image, sentences, user ID, timestamp, and geo-location) from one sample point. D KL Float Scalar The Kullback-Leibler (KL) divergence of two distributions. F

Matrix of Integers and Floats
The face recognition result of an image sample in terms of the number of faces detected f1,i, the model confidence for the prediction f2,i, and the proportion of total area of bounding boxes of detected faces to the total area of images f3,i.

G0
Undirected weighted graph G0 = (V0, E0, w0) The complete spatial network in a city weighted by the travel time with all sorts of transportation between spatial nodes.
The spatial network in a city weighted by the travel time between spatial nodes (no more than 20 minutes) that have at least one sample posted near them.
The graph including the temporal, social, and spatial links E * among the post nodes from set V, weighted by the respective connection strengths w * .
The sub-graph of the multi-graph G, while '(*)' represents one of the link types in {TEM, SOC, SPA}.
The last hidden layer for [CLS] token of BERT model pretrained on WHOSe_Heritage.
The last hidden layer of ResNet-18 model pretrained on Places365. i, i Integer Indices i, i ∈ {1, 2, .., K} ⊂ N The index of samples in the dataset D of one case city.

Ii
Tensor of Integers within [0, 255] ∈ N of size 150 × 150 × 3 or 320 × 240 × 3 The raw image data of one sample post with RGB channels. The confidence indicator matrix for heritage attributes labels including the top-n confidence and agreement between VOTE and STACK models.
The confidence indicator matrix for heritage values labels including the top-n confidence and agreement between BERT and ULMFiT models. l, l Integer Indices l, l ∈ {1, 2, .., |V |} ⊂ N The index of nodes in the ordered set V of all spatial nodes from one case city. The language detection result of the original language appearance of the sentences in each sample, in terms of English o1, local language o2, and other languages o3.
The weight vector of spatial network G and post graphs G TEM , G SOC , G SPA , these weights are directly interchangeable with the adjacency matrices.

X vis
Matrix of Floats and Integers X vis The final visual feature concatenating the hidden layer H v , the face detection results F , the filtered top-5 scene prediction σ (5) (L s ), and the filtered top-10 attribute prediction σ (10) (L a ). X tex Matrix of Floats and Integers X tex The final generated label of heritage attributes on 9 depicted scenes, as the average of prediction from VOTE and STACK models.
The final generated label of heritage values on 10 OUV selection criteria and an additional negative class, as the average of prediction from BERT and ULMFiT models. The ResNet-18 model pretrained on Places365 dataset with the model parameters Θ ResNet-18 that can process the image tensor I into the predicted vectors of scenes l s , predicted vectors of attributes l a , and the last hidden layer h v . g BERT (S|Θ BERT ) Function inputting a sentence/paragraph or a batch of sentences/paragraphs, outputting a vector or a matrix of vectors The end-to-end pretrained uncased BERT model fine-tuned on WHOSe_Heritage with the model parameters Θ BERT together with the MLP classifiers that can process some textual inputs into the logit prediction vector y BERT of 11 heritage value classes concerning OUV. g ULMFiT (S|Θ ULMFiT ) Function inputting a sentence/paragraph or a batch of sentences/paragraphs, outputting a vector or a matrix of vectors The end-to-end pretrained ULMFiT model fine-tuned on WHOSe_Heritage with the model parameters Θ ULMFiT together with the MLP classifiers that can process some textual inputs into the logit prediction vector y ULMFiT of 11 heritage value classes concerning OUV. h VOTE (h v |Θ VOTE , M, Θ M ) Function inputting a vector or a batch of vectors, outputting a vector or a matrix of vectors The ensemble Voting Classifier with model parameter Θ VOTE of machine learning models from M with their respective model parameters Θ M , which processes the visual feature vector h v into the logit prediction vector y VOTE of 9 heritage attribute classes concerning depicted scenes. h STACK (h v |Θ STACK , M, Θ M ) Function inputting a vector or a batch of vectors, outputting a vector or a matrix of vectors The ensemble Stacking Classifier with model parameter Θ STACK of machine learning models from M with their respective model parameters Θ M , which processes the visual feature vector h v into the logit prediction vector y STACK of 9 heritage attribute classes concerning depicted scenes.

I(µj )
Function outputting an ordered Set of Objects The set of public groups that are followed by user µj .

IoU(A, B)
Function outputting a positive float The Jaccard Index of any two sets A, B as the cardinality of the intersection of the two sets over that of the union of them. max(l, n) Function outputting a Float The n th largest element of any float vector l. σ (n) (l) Function both inputting and outputting a Logit Vector The activation filter to keep the top-n entries of any logit vector l and smooth all the others entries based on the total confidence of top-n entries. Testimony To bear a unique or at least exceptional testimony to a cultural tradition or to a civilization which is living or which has disappeared; (iv) Typology To be an outstanding example of a type of building, architectural or technological ensemble or landscape which illustrates (a) significant stage(s) in human history ; (v) Land-Use To be an outstanding example of a traditional human settlement, land-use, or sea-use which is representative of a culture (or cultures), or human interaction with the environment especially when it has become vulnerable under the impact of irreversible change; (vi) Associations To be directly or tangibly associated with events or living traditions, with ideas, or with beliefs, with artistic and literary works of outstanding universal significance; (vii) Natural Beauty To contain superlative natural phenomena or areas of exceptional natural beauty and aesthetic importance;   Veldpaus [2015], Gustcoven [2016], and Ginzarly et al. [2019].

Monuments and Buildings Tangible
The exterior of a whole building, structure, construction, edifice, or remains that host(ed) human activities, storage, shelter or other purpose; Building Elements Tangible Specific elements, details, or parts of a building, which can be constructive, constitutive, or decorative; Urban Form Elements Tangible Elements, parts, components, or aspects of/in the urban landscape, which can be a construction, structure, or space, being constructive, constitutive, or decorative; Urban Scenery Tangible A district, a group of buildings, or specific urban ensemble or configuration in a wider (urban) landscape or a specific combination of cultural and/or natural elements; Natural Features and Landscape Scenery

Tangible
Specific flora and/or fauna, such as water elements of/in the historic urban landscape produced by nature, which can be natural and/or designed; Interior Scenery Tangible/ Intangible The interior space, structure, construction, or decoration that host(ed) human activity, showing a specific (typical, common, special) use or function of an interior place or environment; People's Activity and Association

Intangible
Human associations with a place, element, location, or environment, which can be shown with the activities therein; Gastronomy Intangible The (local) food-related practices, traditions, knowledge, or customs of a community or group, which may be associated with a community or society and/or their cultural identity or diversity ; Artifact Products Intangible The (local) artifact-related practices, traditions, knowledge, or customs of a community or group, which may be associated with a community or society and/or their cultural identity or diversity. Figure A6: The subgraphs of the multi-graphs in each case study city visualized using spring layout in NetworkX. The node size and colour reflect the degrees, and link thickness the edge weights.