Inferring urban social networks from publicly available data

The emergence of social networks and the definition of suitable generative models for synthetic yet realistic social graphs are widely studied problems in the literature. By not being tied to any real data, random graph models cannot capture all the subtleties of real networks and are inadequate for many practical contexts -- including areas of research, such as computational epidemiology, which are recently high on the agenda. At the same time, the so-called contact networks describe interactions, rather than relationships, and are strongly dependent on the application and on the size and quality of the sample data used to infer them. To fill the gap between these two approaches, we present a data-driven model for urban social networks, implemented and released as open source software. Given a territory of interest, and only based on widely available aggregated demographic and social-mixing data, we construct an age-stratified and geo-referenced synthetic population whose individuals are connected by"strong ties"of two types: intra-household (e.g., kinship) or friendship. While household links are entirely data-driven, we propose a parametric probabilistic model for friendship, based on the assumption that distances and age differences play a role, and that not all individuals are equally sociable. The demographic and geographic factors governing the structure of the obtained network, under different configurations, are thoroughly studied through extensive simulations focused on three Italian cities of different size.


Introduction and Background
Defining accurate models for real-world social networks is instrumental in several research fields, e.g., in sociology [1], epidemiology [2] or marketing [3]. In combination with computer simulations these models may represent a valuable tool to understand social phenomena, along with classic analytical studies. Dynamic processes, such as the spread of a disease or a rumour, can be represented upon suitable networks that encode the patterns of connection and interaction among the individuals of a population. Moreover, the comparison of synthetic networks produced by different generative models helps to infer how each factor contributes to the emergence of experimentally measured properties of real networks [4].
In this paper, we present a novel computational model for urban social networks, that combines a data-driven framework with a set of adjustable parameters. A fully operational open source implementation of the model is available under the GPL v3 at gitlab.com/cranic-group/usn. The software allows to generate a synthetic social network of "strong ties" [5] among geo-referenced and age-stratified individuals. The graph encodes information on the urban social fabric and, as such, it increases the plausibility of dynamic (e.g., transmission) processes that may be influenced by preferences and actions of agents and groups of related agents. On the one hand, our social graph may be used to simulate the fact that friends and relatives may go out together, organize public or private meetings, and are, in general, more likely to interact. This can be achieved by using proximity, measured on the network, as a driving factor in the simulation of • their age group, whose role in the edge creation process can be described by means of a n × n social mixing matrix S = {s i,j }, where s i,j measures the frequency of relationships between individuals of age groups i and j; • their geographical distance, whose impact can be controlled through a suitable non-increasing function D (u, v) of the distance d(u, v) between u and v; • the sociability of the two individuals (i.e., the propensity of each of them to have friends), measured in our model by a social-fitness score f u associated to each agent u; While f u and D might be adjusted to the type and strength of the social ties that one aims at reproducing, the coefficients s i,j should instead be derived from social survey data, when possible. Even in the absence of information about social relationships, publicly available data regarding patterns of physical contacts and interactions should be regarded as a valuable source to estimate the ratios of intra-and inter-age group connections. In the proposed model, D and f u thus only impact on the selection of edges that exist between any two groups i and j, whereas the overall social mixing structure defined by the data-driven S is preserved, at least on average. For the scope of the present paper, we estimate the coefficients s i,j based on data from Polymod [14], extracted through the recently released SOCRATES [15] Data Tool 1 . The tool allows to easily specify parameters such as age breaks, gender, day of the week, duration or location of the contacts, and it produces a social contact matrix drawing from the best public survey datasets for the country of interest. Any suitable mixing matrix can, however, be fed to the simulator through a dedicated configuration file.
Besides the available real-world data, the resulting network depends on a number of design choices. Based on the related literature, we identified a set of minimum requirements to guide the definition of suitable range of values for the parameters of our model, which can be summarized as follows: Tie strength Our model aims at representing strong ties as the union of two layers: the layer of households and the layer of friendship. The former is entirely data-driven, and so is its average degree ν. Friendship ties, on the other hand, are generated using a probabilistic model and, to retain control over the density of this layer, the average number of friends µ is an input parameter. There are at least two reasons to only consider fairly small values of µ: (i) friendship links should represent social ties comparable to kinship and, as such, rare [5], with acquaintances modeled by short, but > 1, distances on the graph; (ii) a small value of µ ensures that our network is quite sparse, a necessary feature in most practical applications. Minimum connectivity Recent work highlighted that spatial social networks are generally well connected, with a single giant component covering the great majority of the graph, at least for urban areas [12,10,16,11]. To represent social relationships in a city in a faithful way, we therefore argue that, while small enough to mimic strong ties, µ must also be large enough so that the combination of household and friendship edges guarantees connectivity. Heterogeneity The parameters f u and D are meant to break the homogeneity of the network, in such a way to mimic well known features of real-world spatial social and contact networks. In particular, previous works agree on a heavy-but not fat-tailed degree distribution [17,10,16,18] and on an inverse-power-law dependence on the distance with exponents in the range [0.5, 2] [17,19,20]. Special attention will be therefore paid to the combination of f u following a (shifted) Lognormal distribution with limited skewness and D = d(u, v) −β for suitable β.
Our network model is described in details in Section 2. The model is experimentally evaluated by looking at a set of empirical regularities often observed in related real-world social networks. In Section 3 we address the impact of single parameters and provide evidence in support to the main design choices -including using all the available data. In Section 4 we instead focus on a set of selected configurations and we provide insights into the structural properties of the resulting social graph. Finally, a summary of the results, a few guidelines for the users of our simulator and suitable directions for future research activities are discussed in Section 5. For an overview of the model, that includes the analysis of an epidemic use case, we refer the reader to [21].
1.1 Related Work

Synthetic Population
The approaches used in the literature for the definition of a synthetic population can be broadly classified in two major groups: Synthetic Reconstruction (SR) and Combinatorial Optimization (CO). According to the SR [22] approach, the population is generated by giving to each agent some relevant socio-demographic variables drawn from suitable joint-distributions deduced by putting together aggregate data covering the whole population with disaggregated data from a sample (usually gathered from surveys). In the CO [23] approach, instead, the area of interest is divided into mutually exclusive zones for which a set of marginal distributions is available, then a sample of real individuals from the target population is directly used to generate the whole population through replication/resampling methods. A comparison between SR and CO can be found in [24].
A peculiar shortcoming of the SR approach in its basic implementation [22] is that it is not possible to satisfy joint distributions of attributes at either household level or individual level simultaneously. A way to overcome this problem is the reconstruction algorithm proposed by Guo and Bhat [25]. Ye et al. [26] proposed to extend the usual Iterative Proportional Fitting method (usually employed in the SR approach) with the Iterative Proportional Updating (IPU) method where the algorithm adjusts and reallocates weights among households until both household and individual-level attributes match. Unfortunately both CO and SR approaches require the collection of several kind of data which can be difficult to obtain. To avoid this problem, a few sample-free models were proposed. The main objective of these methods is to achieve a synthesis of the population starting from the most disaggregated level that is actually available. Barthelemy and Toint [27] extended the SR approach by generating a synthetic population based on households' and individuals' attributes for 589 municipalities of Belgium. For a comparison between sample-free and sample-based methods, see Lenormand and Deffuant [28].
Barrett et al. [8] proposed a "first principle" method for synthesizing populations on both urban and national scale. Their method is capable of reconstructing the contact network of the United State by simulating individuals in the population, their household structure, demographics attributes and a 24-hour activity sequence. To do so, they relied on a large data set, including census data, activity location, traffic data and roads network as well as several thousand responses to an activity, time-use survey, etc. The definition of our synthetic population, described in Section 2.1 and analyzed in Appendix A, follows a partially similar approach to [8], but it requires significantly less data.

Social and Contact Network Models
A number of random graph models, proposed across decades of research, have been widely used in computational social sciences [6]. The lack of available data about real-world social ties fostered the choice of rather simple models (e.g., Erdös-Rényi [29], Barabasi-Albert [30] and Watts-Strogatz [31]) which rely on just a few, easy to explain, assumptions. While these models capture in a simple and elegant way some essential features of different kinds of complex networks, it is well know that, in practice, they have significant limitations. Models designed to mimic the scale-free degree distribution emerging in many real networks, for instance, may fail to yield the expected clustering structure [32]. Exponential random graphs have been shown to overcome some of these limits [33,34].
Generally speaking, a possible adjustment to those models consists in introducing a homophily principle [35], according to the widely acknowledged insight that individuals tend to socialize with their peers [36,37]. Among other aspects, such as education or economy, age emerged as a critical element in the formation of social ties [38,9,39,40], possibly thanks to the availability of age-related data at different spatial scales [41]. As far as we know, however, there are no quantitative studies that report the relative frequency of social links (i.e., relationships) by age, and previous network models that incorporate real data on social mixing by age are designed to generate synthetic physical interactions. Another widely studied type of homophily is spatial proximity, which gives rise to the so-called spatial networks.
Most authors considered variations of well known random network models obtained by embedding the vertices in a metric space. The imposed spatial constraints influence the topological properties of the network [42,43] and the imposed penalty on "long" edges causes the spatial distribution of the vertices to impact on clusters, path lengths, degree distribution, and more [44,4]. Again, there is no much research about data-driven spatial social network models, in which the location of the individuals can be retrieved from real data.
A somehow symmetrical problem consists in inferring social or contact networks from real data. For virtual populations, e.g., online social networks, relationships and actually occurred interactions may be directly retrieved [11,16,45,46], at least to some extent. The dependability of this information is however debatable, because the virtual population may convey significantly different information with respect to self-reported friendships [47]. When the focus is, instead, on the physical interactions among the individuals of a population, a synthetic version of the population may be created on the basis of census and/or survey data [48]. The final goal is the definition of a set of human mixing patterns (i.e., the frequency of contacts among people of different ages and/or in different places) which allow to reproduce the dynamics of the network, for instance to model the diffusion of a disease in real populations [49,50,51,48]. This approach usually requires to focus on a set of primary social settings (e.g., households, schools and workplaces) and to collect sample data (e.g., surveys, questionnaires, diaries, mobility, etc.) [14,52,50], possibly integrated with mobile/traffic/wearable sensors data [7,53,47,54] or online tools [51]. These data may be used to directly extract setting-specific contact matrices [14,8,52,50], or to recreate realistic instances of such social settings and synthetic agendas used to feed an agent-based simulator by which agent-to-agent interactions are reproduced [55,56,8,53,57]. It is worth noting that these studies target physical contacts rather than relationships. While they provide valuable information about social mixing patterns, they do not aim to define a network of strong social ties. Moreover, in a recent call to action several experts expressed criticisms about most already existing tools and raised the attention towards the need for accurate yet flexible and replicable approaches [58].

Properties of Empirical Spatial Networks
Spatial social networks have been found to form a large connected component on both mobile phone networks [12,10,16] and online social networks [11]. The average degree is typically small, albeit dependent on the type of network considered: it ranges from ≈ 5 in mobile phone and online social networks [12,11,16], to ≈ 10 (when only strong ties are considered) or ≈ 20 when the network is inferred from social surveys [59,60,61]. Despite the limited average degree, most spatial social networks show the typical small-world effect, with a short average path length close to the well-known "6 degrees of separation" rule [11,10]. Another typical feature of social networks is the high clustering coefficient (i.e., number of triangles) compared to random ER-like networks. Indeed, a clustering value of ≈ 0.2 has been found in mobile phone [11,10] online social [16] and survey-based [18] networks. A clear peculiarity of geographical social networks is instead the absence of very large hubs, which are widely observed across multiple fields, including web graphs [62], the Internet [63] and online social networks [46]. The maximum degree in geographical social networks is apparently limited to a few hundreds, as emerged from both mobile phone networks [12,17,10] and the LiveJournal Blog with geo-localization [16], and in line with sociological studies [64,65,5]. In [12,17] the authors found a power-law distribution with ≈ 8 and ≈ 5 exponent, respectively, whereas [10] and [16] only highlighted the occurrence of a long-tailed distribution. In [18] the degree distribution is well approximated by a lognormal with σ = 0.9 and µ = 2.6 A widely studied problem is characterizing the role of spatial proximity and population density in the edge creation process. The dependence of friendship on distance has been found empirically as an inverse power-law whose exponent β usually lies in the range [0.5, 2] [59,61,17,11,10,20,16,66,18,67,68]. The measured edge length distribution is however strongly dependent on the data source and on the size of the considered territory, e.g., it may be almost flat up to 10 km [11] or present a spike for lengths < 5 km [68]. Mobile phone networks, in particular, may cover hundreds or even thousands of kilometers and generally provide geographic locations, and thus distances, with a very variable granularity -from 1 km in densely populated urban areas to 10 km in rural regions [10]. Using the network reconstructed from a survey, in [18] the edge length distribution shows two distinct regimes: β = 0.6 for short range contacts (< 20km) and β = 1.8 for long range contacts. In general, a value β < 1 is not rare [16,66] and possibly associated with urban networks, as is the case for the value β = 0.44 obtained for the city of Dublin in [67]. Furthermore, many studies agree on the existence of a significant correlation between geographical proximity and community structure [10,67,68,45,9]. A recent analysis of the urban area of New York based on the geo-localization of Facebook accounts shows that most of the connections occur between adjacent zip codes and that the underlying transportation infrastructure has a visible impact [45]. Studying a phone call network, however, in [20] the authors found a threshold effect, with a regular increase in the geographical extension of the communities only observable for small clusters with up to 30 members. This was confirmed by a later work, that detected highly clustered connected components spanning across very large areas of the city [10]. Population density is also often considered a factor in the formation of social bonds. In his seminal work [64], Granovetter argued that personal networks in the cities -with respect to hinterland and peripheries -are characterized by low clustering values and by a higher number of weak ties. In [68] an extensive empirical phone call network has been analyzed finding that, while the density of individuals has a negligible impact on the size of each person's network, it induces a higher number of close-range contacts.
Despite the previous findings, there is wide evidence that the geographical factors alone cannot explain the structure of spatial social networks [11]. Other forms of social proximity were estimated to be responsible of ≈ 1/3 of friendships in the LiveJournal Blog network [16]. Studying a phone call network, the authors of [10] found that the number of intra-tower connections observed was higher than a pure geographical model would generate, and that the small-world nature of the network was mostly related to other forms of social cohesion, rather than geographical proximity.

Materials and methods
Our urban social network is obtained as the combination of a household network with a friendship network. The construction of these networks requires to first map census data to reconstruct the population and then model interpersonal connections. Formally, the network is represented by an unweighted undirected graph G = (V, E), where V is the vertex set of size N = |V | and E is the edge set. In particular, we have E = E H E F , where E H is the set of household edges, E F is the set of friendship edges and denotes the disjoint union. In the following, we explain how V , E H and E F are defined in our model. We will often use the expression household graph to denote the subgraph G H = (V, E H ) and the expression friendship graph to denote the subgraph G F = (V, E F ). The notation and the parameters used throughout this paper are summarized in Table 1.
The urban social graph whose edges represent generic "strong" social ties Obtained flattening the GH and GF layers onto a singlelayer graph N = |V | The number of nodes in the graph, equal to the population size Data-driven, based on the WorldPop database; see Table 3 for the population of all cities ν The average degree of the GH layer Data-driven, ≈ 2 for all cities (see Table 3) µ The average degree of the GF layer An input parameter, set to 1, 5 or 10 in the experiments (see Sections 3 and 4) The average degree of the graph G See above gu The age label of node u, taking value in {0, . . . , n − 1}, which determines a partition of the population based on age Drawn from a data-driven age distribution; the list of agebreaks is parametric; in the experiments we use ISTAT data with age-breaks (0, 18,35,65) Vi (and |Vi|) The set of nodes having age label i (and the number of such nodes) Deduced from the age labels assigned to the nodes of the graph mi,j The number of pairs of vertices (u, v), with u = v, such that u ∈ Vi and v ∈ Vj Computed based on |Vi| and |Vj| (see (2.1)) tu The tile label of node u, taking value in {1, . . . , T }, which determines a partition of the population based on the place of residence Drawn from a data-driven population density; the tessellation is parametric, the tile side is l, the grid is composed of T = T lat · T lon tiles; in the experiments we used WorldPop data, l = 1 and grids as reported in Table 3 d(u, v) The approximated euclidean distance between u and v Computed as max l 2 , d * (tu, tv) , where d * (tu, tv) is the euclidean distance between the center of the two tiles

D(u, v)
A non-increasing function of the distance d(u, v) An input parameter; in the experiments, we use The real-valued fitness score of node u, that quantifies its sociability Drawn from a probability density function φ specified as an input parameter; in the experiments, we consider a shifted Lognormal 1 + LN (ln(2), 0.25) or a constant distribution The real-valued social mixing matrix whose element si,j measures the frequency of social ties between age groups i and j Data driven; in the experiments, we compute S based on contact data from [14] extracted using [15]; we also consider a constant S

Territory, Population and Vertex Set
We define the territory of interest in terms of an origin and of a regular lattice of T = T lat · T lon square tiles, i.e., by establishing the South-West corner of our bounding box, the side l of each tile and the number of tiles T lat and T lon along the latitudinal and longitudinal axes, respectively. For the scope of this paper, l = 1 km for all cities. We then extract a geo-referenced population for the whole area of interest from the WorldPop Project 2 , which provides data of the world population for 100m × 100m square cells. We resort to the overpass API of the well known OpenStreetMap database 3 to find the minimal grid that contains the city's boundary; we then select only the tiles of the grid whose center lies inside it (see Appendix A). By mapping the WorldPop data to our tiles, we obtain a very precise estimate of the real population living in each tile.
Albeit WorldPop -as well as other data sources -directly provides population densities for different age groups, we believe that all demographic parameters should be acquired from the same data source in order to guarantee the intrinsic consistency of the model. Any desired age-stratification can thus be easily specified in the simulator's configuration file in the form of a set of age groups and their respective frequencies. For the scope of this paper, we decided to rely on the Italian Institute of Statistics (ISTAT) 4 , which provides age distribution at the provincial level. The United Nations Statistics Division (UNSD) makes available similar data for many other countries 5 .
Each vertex u ∈ V of our graph, which represents an individual of the population, is thus characterized by three attributes: Tile The tile label t u ∈ {0, . . . , T − 1} is set equal to the unique index of the tile where u resides.
Age The age label g u ∈ {0, . . . , n − 1} for vertex u is drawn independently at random from the given age-stratification. These labels determine a partition of the vertex set V into n disjoint subsets V 0 , . . . , V n−1 .
Fitness Inspired by previous work, that modelled degree heterogeneity by means of a vertex-intrinsic fitness [13], we extract a sociability fitness attribute f u > 0 for each u. The probability of a friendship edge between u and v is then set proportional to f u and f v (see Section 2.3).
Approximating the position of each vertex with its tile is instrumental in simplifying the computation of the household structure (see Section 2.2) and of pairwise distances. To this end, we use the approximation d(u, v) = max l 2 , d * (t u , t v ) , where t u is the tile of u and d * (t u , t v ) denotes the distance between the center of t u and the center of t v .
Our model does not put restrictions upon the choice of f u ; yet, the distribution of f u shall be chosen considering its impact on the degree distribution of the friendship graph. Possible choices include a Lognormal, a Pareto, a uniform and a constant distribution -all of which are already supported by our simulator.

Household edges
We group individuals into households according to a heuristics that uses the distribution of "household roles" per age and the distribution of the number of children per family, under the assumption that: (i) all members of a household live in the same tile; (ii) children are younger than their parents; (iii) partners have, on average, a similar age. The algorithm is based on the concept of household role, represented as a pair of the form (household-type, role), whose possible values, for Italy, are reported in Table 2. For instance, r u =(single-parent, parent) means that u is a parent in a household of type single-parent, where r u [0]=single-parent is the household-type and r u [1]=parent is the role. We make use of two conditional distributions: Pr[r | g] is the probability that an individual has role r given that she/he belongs to age group g; Pr[k | h] is the probability that a household of type h has k members. This information is made available, as aggregate national data, by the ISTAT for Italy and, e.g., by the UNSD for many other countries.
At a high level, the heuristics works as follows: • Based on the distribution Pr[r | g u ], extract a role r for each vertex u.
• For all u such that r u [0] ∈ {single-parent, two-parents}: ], extract the total number of members k u for the household of u (and, in case, of v) and compute their total number of children c u .
and assign w to the household of u.
• based on the distribution Pr[k | various], randomly compose the households of type various.
In our simulations, the number of individuals not assigned to any household by the heuristics is negligible, and the empirical distributions of household types and members per type almost perfectly match the expected ones (see Appendix A for details). The household edges E H are finally obtained as the union of all the cliques that connect all members of the same household. The average degree of the household edges, denoted ν, is thus entirely data-driven.

Friendship Edges
As said, the age labels define a partition of the vertex set V into n disjoint subsets V 1 , . . . , V n . For each possible pair of groups For the scope of the present paper, we estimate the coefficients s i,j relying on the recent SOCRATES project [15] that makes it possible to easily extract aggregated social contact matrices from public available and well-established datasets. In Appendix B, we describe how we constructed the matrix S = {s i,j } and we show that any constant S corresponds to the condition of age-homogeneous mixingi.e., to edge occurrence being independent of the age. To each pair (u, v) ∈ V × V , we finally associate a distance d(u, v) computed based on the tile labels t u and t v as explained in Section 2.1.
, is defined as follows: where the parameter µ is the average number of friends, i.e., the average degree of the graph G F .
A few features of our friendship graph deserve additional remarks: Mixing and attraction Let us define the mixing between age groups i and j as M (i, j) = m i,j · s i,j and the total mixing of the given population as M = i≤j M (i, j). Similarly, let us define the attraction between u and v as a(u, v) = D(u, v) · f u · f v and the total attraction between groups i and j as A(i, j) = u∈Vi,v∈Vj a(u, v). It is trivial to rewrite (2.2) as where D and f denote the mean value of the fitness score f u and of the distance function D(u, v), respectively. The approximation in (2.3) is obtained under the assumption that A(i, j) ≈ m i,j · a(u, v) = m i,j · D · f 2 for all i, j, which is accurate if all age groups are large enough.
Edge density and social mixing To each pair of age groups (i, j) with i ≤ j we can associate a subgraph as follows: if is the bipartite subgraph of G F induced by the vertices in the disjoint union V i V j and where only edges (u, v) such that u ∈ V i and v ∈ V j are admissible. It is easy to verify that G F is the disjoint union of these subgraphs, that is, The coefficient m i,j equals the maximum possible number of edges in each subgraph G F (i, j), whereas the mixing M (i, j) is the expected number of such edges in a network where the edge probability is uniquely determined by the cohesion s i,j . The expected value of E F,i,j in our networks is instead for all i ≤ j. The expected number of edges in the entire friendship graph is thus This shows that, thanks to the normalization by A(g u , g v ), the age-based social mixing is respected, up to a scaling factor. In particular, the dependence of Pr[u, v] upon d(u, v) does not affect the (expected) density of the friendship graph, which is entirely controlled by µ. By tuning D(u, v) we may adjust only the physical length distribution of the edges -arguably, in order to penalize long edges.
Expected degree Thanks to the relation u deg F (u) = 2 · |E F |, it is easy to verify that the expected friendship degree of a random u ∈ V is exactly equal to µ. To guarantee that In words, since by increasing µ we evenly increase the probability of all edges, the choice of µ is possibly constrained by the existence of pairs (u, v) for which the combined effect of the mixing M (i, j) and the attraction a(u, v) makes the existence of the edge (u, v) significantly more likely than the average. The expected degree of a vertex u having fitness f u , age g u and tile t u can instead be computed as µ u is thus proportional to u's social fitness f u and to µ, as expected. Equation (2.7) also shows that µ u depends upon the tile t u through a factor given by the average distance of t u from all other tiles (included t u itself), weighted by the (group-weighted) total sociability of each tile.
Scale invariance and parameter range It is easy to verify that Pr[u, v] is invariant under multiplication of s i,j , D(u, v), f u and/or f v by any positive constant. This means that any two approaches to derive the coefficients s i,j from real data that only differ by a constant factor are equivalent for our model. Now, for (2.2) to be meaningful, we need D(u, v) to be upper-bounded by a suitable constant D max . In our model, this is achieved by imposing that all vertices belonging to the same tile are at distance d min = l 2 > 0 and that (with a slight abuse of notation) D max = D(d min ) < +∞. On the other hand, to prevent the occurrence of vertices with expected degree ≈ 0 -at least, in a sufficiently large network -f u must be bounded away from 0. Summing up, without loss of generality, we may assume that D(u, v) ∈ (0, 1] for all u, v ∈ V , that f u is drawn from a probability distribution with support [1, +∞), and that i≤j s i,j = 1. The latter condition suggests to interpret the coefficients s i,j as the probability that a randomly chosen edge of the graph connects groups i and j.

Selected Configurations
Although our network model does not impose restrictions upon the choice of f u and of the function D, we believe that the following configuration is of special interest: The choice for the distribution of f u is motivated by the observation that the degree distribution of our friendship graph will, at least partially, follow the fitness distribution. Lognormally distributed data occur across different domains [69] and recent work suggests that the degree distribution of real-world social networks makes no exception [16,18]. In particular, a Lognormal degree distribution would fit the intuition that only a few people have very few social links. The specific parameters have been chosen to enforce limited skewness and variance, so as to guarantee that most vertices will have a degree close to the average and that the hubs will be limited in both number and size. It goes without saying that other choices may be preferred, some of which (e.g., a Pareto, a uniform and a constant distribution) are already supported by our simulator.
As to D, as extensively discussed in Section 1.1, there is a vast body of research supporting the choice of a power-law decay for the dependence of social ties upon the geographical distance between two individuals. There is no equal agreement regarding the exponent β, with empirical findings suggesting a significant variability according to the geographical scale of the analysis and to the type of interaction object of the study. We decided to compare β = 0.5 and β = 2, which seem to lie at the two ends of the spectrum of the values observed in the literature. In both cases, as already discussed, since d(u, v) is bounded, so is D.
It is worth to highlight that, for the considered set of models, the upper bound (2.6) for the choice of µ is indeed rather loose in most practical cases, as we experimentally verified.

Impact of System Parameters
To sort out the role of the different system parameters in shaping our urban social network we resort to an extensive experimental campaign focused on three Italian cities of different size: Florence (363060 residents), Viterbo (66598) and Sabaudia (21274). For the sake of readability, the results for Viterbo and Sabaudia are reported in Appendix C and Appendix D, respectively.
For each city we selected the administrative boundaries from OpenStreetMap and filtered the population inside the city shape as described in Section 2.1. For Sabaudia and Viterbo the smallest boundary available is the municipality which encompasses a large rural area with small urban agglomerates around the city. For Florence we were able to select the actual city boundaries which do not include neighboring areas. For each city, we extracted the age labels based on ISTAT data aggregated at the provincial level. We then built the households as described in section 2.1. The obtained territories are depicted in Figure 1 (and 12 in Appendix A). An overview of the population and the territory is reported in Table 3 for all cities. For further details see Appendix A. Table 3: Reconstructed parameters for the three cities. For the city of Florence we considered only the city shape whereas for Viterbo and Sabaudia all the municipalities, which include rural areas and small villages, have been used. ν is the average degree of the reconstructed household network G H , T lat and T lon are the number of tiles along the two axes of the grid used to cover the city territory, l is the side of each tile.

City
Boundary N children% young% adults% elderly% In this Section, we focus on a few basic properties of the network and on the design choices needed to achieve them. As a side result, we show that any attempt to significantly simplify our network model downgrades the resulting social graph, especially in terms of heterogeneity and connectivity. It is worth to underline that, to account for variance, hereafter any configuration is evaluated by considering the results of 10 independent runs.

Safeguarding connectivity
There are two main factors that impact on the overall connectivity of the graph: the average number of friends µ and the reconstructed households structure. In Section 2.3 we found that our efforts to introduce heterogeneity in the model  while preserving certain structural properties inferred from contact data yield an upper bound for µ. Albeit there is no analogous strict lower bound, µ must be chosen large enough to guarantee that the graph has the expected level of connectivity. To provide insights into the choice of µ, we compare the graphs obtained with µ = 1 and µ = 5, measuring the percentage of nodes belonging to the giant component for different configurations. We remind that the household layer G H is composed of a set of cliques and has average degree ν ≈ 2 for all cities -the average degree of the whole graph being instead K = ν + µ. In the following, we will focus on the friendship layer G F alone and on the entire urban social network G.
To prevent the results from being influenced by the composition of the population, we: (i) set the matrix S to a constant, to force homogeneous mixing between the different age groups (see Appendix B); (ii) ignore the data-driven population density, considering equally frequent age groups and distributing the individuals uniformly at random in the given territory 7 . For what concerns the parameters f u and D, we focus on the special configurations highlighted in Section 2.3.1 -i.e., f u ∼ 1 + LN (ln(2), 0.25) and D(u, v) = d(u, v) −β with β ∈ {0.5, 2} -and we additionally compare them with the choice of a constant fitness f u ≡ 1 for all u.
Ideally, we would like a graph where (almost) all nodes belong to a single giant component, in accordance with empirical findings from the literature [10]. In Table 4 we present the case of Florence -similar results hold for the other cities and are listed in Appendix C and Appendix D. Let us first focus on the friendship graph G F alone (i.e., no households). The case µ = 1 clearly shows that this graph is very poorly connected if the average degree is very low. When the fitness is Lognormally distributed, thanks to the increased degree variability, the giant component includes ∼ 20% of the network. When the fitness is constant, the giant component is instead significantly smaller, reaching a critical 1.9% in the configuration that maximizes the homogeneity of the network. When µ = 5, the giant component on the other hand, includes almost the entirety of the network. Quite interestingly, in this regime switching from a constant fitness to a Lognormal one yields a slight decrease in the coverage of the giant component of the friendship network. The rationale is that, in the latter case, most hubs occur in denser areas of the graph, thus increasing the internal cohesion of the giant component to the detriment of its size. By comparing the friendship graph with the whole graph, we finally see that when µ = 1 the households are capable to partially compensate for the limited number of friendship edges. When µ = 5, finally, the households make it possible to even improve the already almost perfect coverage of the giant component.

Accounting for age in friendships
One fundamental design choice of our friendship graph model consists in enforcing that the proportion of intra-and inter-age group connections adheres to publicly available contact data. This information is encoded in the age-mixing matrix S and, as emerged in Section 2.3, imposing this condition comes at the cost of additional constraints to the model. To verify the importance of having age-dependent edges, we compare three different configurations: (i) a homogeneous population connected using a constant matrix S; (ii) a homogeneous population connected based on the real (i.e., data-driven) S; (ii) a real (i.e., data-driven) population connected based on the real S. With homogeneous population we mean that both the age distribution and the spatial density are taken to be uniform. The constant matrix S corresponds to age-homogeneous mixing, as explained in Appendix B. Again, we consider the combinations of parameters selected in Section 2.3.1, plus a fitness-neutral model with f u ≡ 1 for all u. Finally, to guarantee that age-dependent preferences and patterns in the friendship edge distribution are well visible, we only consider our friendship graph and we fix µ = 5, to obtain a network that is well connected but not too dense.
For each of the four age groups, we computed both the average degree of the members of each group in the whole friendship graph G F , and their average degree considering friendship connections with their peers only, that is, within the subgraph of G F induced by the set of individuals belonging to that group. In Figure 2 we report these data for the city of Florence -the other cities show analogous patterns, see Appendices C and D for details. With homogeneous population and constant S (left panel) all age groups are, de facto, equivalent and the age-stratification is irrelevant for friendships. Introducing real nonuniform age-mixing coefficients s i,j , even with a homogeneous population (central panel), ensures that the data-driven internal cohesion of younger age groups is preserved, but adults are more likely to establish links with other age groups. Finally, when the population and the matrix S are both data-driven (right panel), we notice an increment in the overall average degree of children and young people, which are < 25% in the real population, and a decrease in the overall average degree of adults, which instead constitute ≈ 43% of the population of Florence. Significantly, Figure 2 confirms that neither f u nor β have any impact on the age-mixing patterns of the graph, as imposed by construction. For the sake of completeness, Figure 3 shows the percentage of the total number of edges that link each age group with all others, for the friendship graph and for the entire social graph obtained for the city of Florence using the real population and real S -see the Appendix, Figures 16 and 26, for Viterbo and Sabaudia. Figure 3a is essentially a visualization of expression (2.4), and the value in each cell is thus proportional to the size of the two groups and to their mixing coefficient s i,j . We see that, albeit the adults have a lower average number of friends than the younger groups, they are still involved, in total, in the majority of friendship edges.

Using real population density and distances
One of the main elements of novelty of our model resides in the use of a data-driven population, in contrast with a body of research work on spatial networks that assumes a certain regularity in the spatial distribution of the population (e.g., uniform or Gaussian). The positioning of the individuals of the population and the level of penalization imposed to geographically long edges impact on the final shape of our social graph at multiple levels. To assess the relevance of using real data for the spatial density of the population, we study the friendship network G F (i.e., no household edges, ν = 0 and K = µ) obtained with either a data-driven population or a homogeneous population where individuals are distributed uniformly at random in the given territory, but fixing in both cases the size of the groups to be uniform, i.e., |V i | = N n for all age groups. We consider three different options for D(u, v): D(u, v) = 1 for all u, v, D(u, v) = d(u, v) −0.5 and D(u, v) = d(u, v) −2 . Since one of the criteria for comparison will be the connectivity of the network, we let µ vary as µ = 5 or µ = 10. Conversely, to isolate the dependency of the edge distribution upon the spatial density, we take both S and f u to be constant and equal to 1. Table 5 and Figures 4, 5, 6 show the results of our tests for the city of Florence -for the other two cities, see Appendices C and D. We repeated the construction of the social graph 10 times for each configuration. Since the variance is negligible and adds no valuable element of discussion, for the sake of clarity we only included in Table 5 the average value for each metrics over each set of runs. Figures 4, 5 instead report the mean distributions over each set of runs with a 95% confidence interval. To measure the distance between the data distributions and the Poisson distribution, we compute the relative entropy or Kullback-Leibler (KL) distance -the values are reported in the caption of the figures. The KL distance D(p||q) between distribution p and q measures the amount of information lost when the distribution q is used to approximate the distribution p [70].
We can divide our tests in three main categories, based on the used combination of parameters: 1. Homogeneous/real population and D(u, v) = 1: Geographical distances and the distribution of the population over the territory do not affect the graph construction if we set D(u, v) = 1 and we impose a uniform size for all age groups. In this case, we expect to create an ER graph, because we are actually fixing the edge probability P r[u, v] to be the same for any node pair u, v. In an ER graph [71,63] the expected clustering coefficient is C = µ N −1 , where µ is the mean degree, the degree distribution is a Poisson distribution (for large N ), and the expected average shortest path is dist ≈ ln N ln µ . The graphs of our tests have 363060 nodes (the population of Florence), thus we expect the following values: C = 1.38e−0 for µ = 5, C = 2.75e−05 for µ = 10, dist ≈ 7.95 for µ = 5 and dist ≈ 5.56 for µ = 10. Our results (see Table 5, Figures 4 and 5) confirm the expected values and degree distributions. Therefore, if we do not consider real population density and distances we are de facto building an ER graph. Main features of the friendship graph G F as the population type, D(u, v) and µ vary, for constant f u and S. dist is the average path length, C is the global clustering coefficient, ρ is the degree assortativity, "# comp." denotes the number of connected components, "giant %" denotes the percentage of nodes in the giant component.     , even if we impose a uniform distribution of the population over the territory, other than equal size age groups. Our results (see Table 5, Figures 4  and 5) show that β = 0.5 has a major effect on the assortativity of the network, whereas β = 2 affects both the global clustering coefficient and the assortativity. The latter is always positive and for β = 2 the global clustering coefficient is about 5 times greater than in all graphs built in the previous cases. Again, these are desirable features in a social network model [72,43]. Geographical distances have also an effect on the degree distribution, but it is appreciable only for β = 2 (see Figures 4, 5).

Real population and D(u, v)
Finally, the combined use of the real population and of a penalization for long edges is clearly visible in our simulation, for both values of β, but especially when β = 2. Real density and distances mainly influence the number of connected components and the assortativity of the network. Notice that the number of connected components is also affected by the average degree of the network: the higher the degree, the lower is the number of connected components. Specifically, the use of real data seems to favor higher assortativity values and a higher number of connected components. While a uniform population density creates a giant component whose size is in line with the giant component of an ER graph, a real density creates a smaller giant component and a higher number of connected components (see Figure 6). For uniform densities the size of the connected components is almost always 1 (the same holds when we set D(u, v) = 1), instead for real densities we have more components whose size is more variable. Finally, as shown in Figures 4 and 5, the degree distribution is also affected by real densities and distances. Preventing a Poisson degree distribution is a further step toward the definition of a suitable model, because social networks are usually characterized by skewed degree distributions [43].
Summarizing, our tests show that penalizing long edges in (2.2) while using the real population density allows to construct a friendship graph deprived of the typical characteristics of ER-like homogeneous graphs. Even more, using real data we are able to simulate some of the features usually observed in social networks. These findings support the rationale that a data driven approach is important: real population density and distances must be taken into consideration for building urban social graphs. Finally, we observe that the impact on graph construction of geographical distances, when β = 0.5, is negligible if not combined with real densities, as shown by the degree distributions (Figures 4, 5) and values in Table 5.

Characterization of the Urban Social Network
In Section 3 we verified that each and every parameter of our model has a role in the definition of the resulting urban social graph, and that making proper use of the available data is paramount in order not to end up with a too simple network model (e.g., a ER-like graph). In the following, we therefore consider the fully data-driven version of our model: the synthetic population is constructed as described in Section 2.1 based on real age-stratification data, real spatial density estimates, and real household composition data; the dependence of the friendship edges upon the age of the individuals is inferred from publicly available contact matrices. We study the topological features of the graph and their relation with the imposed geographical constraints, as the remaining parameters vary as follows: (i) µ = 5 or µ = 10; (2), 0.25). Notably, we already know from Section 3 that all these configurations yield, albeit to different extents, desirable properties of social networks, including good connectivity, decent network assortativity and global clustering coefficient, and a non-Poisson degree distribution.

Global metrics
We assessed the impact of using the real population density and an inverse-power-law distance penalization in Section 3.3. In this section, we focus again on the main global features of the graph, but considering the fully data-driven model that includes the household layer G H and the age-based social mixing imposed through the matrix S (see Appendix B).
By using the data-driven values of S, we are adding to the model an age-specific homophily and we expect an increase in the assortativity of the resulting graph. Indeed, for the city of Florence, considering the other parameters as in Section 3.3 with β = 2 and µ = 10, the assortativity grows from 0.2 to 0.48. Moreover, by connecting with higher probability nodes belonging to specific subsets (i.e., those in the same age groups) we report a four-fold increase of the global clustering coefficient from 0.00012 to 0.0004. On the other hand, the layer G H introduces, by construction, a large amount of cliques -albeit small -so we can expect a great impact on the clustering coefficient (i.e., the number of triangles). In fact, considering again the city of Florence, with the same parameters of the previous example and adding the households, the global clustering coefficient increases ∼ 40 times from 0.0004 to 0.0151.
An overview of the resulting graph for the city of Florence with the parameter chosen as described at the beginning of Section 4 is reported in Table 6 -for Viterbo and Sabaudia see Appendices C and D. For each set of parameters, we repeated the construction of the social graph 10 times. The values reported in the table are the mean computed over each set of runs, the variance is negligible for all the values and thus omitted. By looking at Table 6 we see that the contribution of the fitness to the clustering coefficient, the assortativity and the connectivity is only marginal compared to S and to the households (for a direct comparison with the configurations considered in Section 3.3, see Table 5). The value of the global clustering is several orders of magnitude greater than the corresponding ER graphs, it shows a small decrease with the fitness and for higher values of β (and µ). Both circumstances favor the creation of hubs that attract edges from the rest of the graph thus making "local" triangles less likely to occur. Besides the presence of hubs, there will be a higher number of nodes at a lower degree and therefore a higher probability of having many connected components. The degree heterogeneity also penalizes the assortativity which is consistently lower for the run with the fitness but increases with β. It is likely that, on average, the higher value of β reinforces the group-to-group assortativity of S. On the other hand, the average local clustering has the opposite behaviour and constantly increases with the fitness, β and µ. Local connections are favored by a stronger dependence on distance and by the presence of local hubs. Almost regardless of the configuration, the average shortest path length is comparable to ln(N ) ln(K) , where K = µ + ν is the average degree of the graph G. This value is typical for small world networks. Main features of the social graph as f u , β and µ vary, for data-driven population and S. K = µ+ν is the average degree, dist is the average path length, C and C loc are the global and average local clustering coefficients, ρ is the degree assortativity, "# comp." denotes the number of connected components, "giant %" denotes the percentage of nodes in the giant component.

Connectivity, communities, degree
Hereafter, we focus on the graph we obtained for the city of Florence, for which we provide an overview of the main characteristics in Figure 7. Similar results hold for the other two cities and are reported in Appendix C and Appendix D.
First of all, by looking at Figure 7a we notice that the fragmentation of the graph into connected components is very stable across different simulations and different configurations. For all eight considered combinations of parameters, a single giant component exists, covering at least 97.3% of the whole graph, with a few additional components having tenths to hundreds of nodes and a constellation of isolated vertices.
We then focus on the modularity-based clustering obtained by applying to our network the well-known Louvain algorithm. From Figure 7b, we see that some parameters do have an impact on the organization of the graph into densely connected communities. In particular, a steeper cluster size distribution is associated with a greater average degree and with a stronger dependence of friendships on distance. These two conditions, in fact, favor the emergence of just a few giant clusters and of a multitude of clusters of variable size. When µ = 10 and β = 0.5 we notice a sudden drop of the cluster size, with no clusters of size ≈ 100. Quite surprisingly, taking f u ∼ 1 + LN (ln(2), 0.25) seems to have a minor impact from this point of view, albeit this choice is supposed to guarantee the existence of large hubs. For the sake of completeness, in Figure 8 we compare the modularity of the obtained clustering structure for different configurations. We see that the modularity ranges from less than 0.3, when µ = 10 and β = 0.5, to ≈ 0.5 when µ = 5 and β = 2. Not surprisingly and in line with previous findings, the density of the network has a major impact on the quality of the obtained communities. Especially with denser networks, the penalty applied to long edges also has a visible effect, contrarily to the fitness score.
In Figure 7c we finally show the degree distribution of the graph (we remind that household edges are included) for all configurations, in a log-log scale. At least when f u ∼ 1 + LN (ln(2), 0.25), we expect the right tail of the degree distribution of the graph to be heavy but not fat (i.e., subexponential but not power-law). To verify this insight, we included in the plot the result of a Lognormal fit of the portion of the distribution corresponding to degrees ≥ µ. The fit looks indeed accurate when f u ∼ 1 + LN (ln(2), 0.25), whereas when f u ≡ 1 the tail of the distribution is much shorter. The impact of the exponent β is slightly but consistently visible in all plots: when D = d −2 the distribution is more skewed than when D = d −0.5 , with a larger portion of loosely connected vertices compensated by the presence of greater hubs. This phenomenon is more visible in the other cities (see Appendices C and D), where the territory is proportionally wider. The rationale is that a weaker dependence on the distance pushes individuals living in central and denser areas to connect to peripheral vertices, that would otherwise remain isolated.

Geography of the graph
Since our model includes a factor purposely designed to penalize long friendship edges, it is especially worth assessing the correlation between topological properties and population density. To this end, we first show in Figure 7d the distribution of the edges' physical length. We immediately see that, as expected, the fitness function and the average degree appear to have a negligible impact on the distribution, contrary to the distance function. Our graph model indeed guarantees that both µ and f u only impact on the number of friends of u, whereas, regardless of µ and f u , is what ultimately determines the ratio of friends v that u will have at any given geographical distance. Of course, this works as long as the fitness is distributed independently of the location, which is true by design on a probability base. For what concerns D, with respect to β = 2, setting β = 0.5 significantly favors the creation of long edges at the expenses of the very short ones.
We now look into the obtained communities to see if they are geographically concentrated and/or bounded. To this end, Figure 7e shows the mean and max intra-cluster distance for the first 50 clusters of the graph. By comparing Figure 7e with Figure 7b, we realize that, albeit long intra-cluster edges tend to disappear as the cluster size gets smaller, only clusters that are at least 3 orders of magnitude smaller than the whole graph are geographically bounded -a phenomenon that is especially visible when µ = 10, that is, when the cluster size distribution is steeper. This is partially in line with a previous work showing a sudden rise in the geographical extension of the communities of empirical networks [20]. All sufficiently large clusters behave very similarly to each other and to the whole graph: most intra-cluster edges connect nearby vertices, yet very long edges do exist in each cluster. This plot underlines that the parameter β has a paramount role even in the formation of clusters, with both the mean and max intra-cluster distance being consistently greater when β = 0.5 compared to the case β = 2. In particular, the mean distance when β = 2 is often less than the tile side l (set to 1 km as per Section 2.1), meaning that most adjacent vertices are at one tile of distance or less; the mean distance is instead between 2l and 3l when β = 0.5.
Finally, in Figure 7f we plotted the average geographical distance of a vertex u from all others against the degree of the vertex u. Since we introduced a penalization for long edges, it is reasonable to expect that vertices that occupy a favorable position, closer, on average, to the other vertices, will generally have a greater degree. However, setting D = d −0.5 , seems to be enough to significantly dissipate this effect, which is instead clearly visible when D = d −2 . Moreover, we also see that all large hubs have a very small average distance from all other vertices, regardless of the specific configuration. In particular, when f u ∼ 1 + LN (ln(2), 0.25) the greatest hubs correspond always to vertices having both a large f u and a favorable position in the territory. Albeit the introduction of a social fitness in the model allows to have medium-large hubs in sparsely populated regions, the greater prevalence, on average, of hubs in densely populated areas may exacerbate the tendency of other vertices to establish links with individuals in those areas.

Network adjacency matrices and degree density plot
In Figure 9 for Florence (and in Figures 22 and 32 in the Appendix for the other cities) we report the scatter plot of the adjacency matrices relative to the obtained social graph. In the plots, nodes are ordered by their age and, within each age-group, by tile.  In all matrices the prevalence of intra-groups edges (assortativity by age) over inter-groups edges is clearly visible and in qualitative agreement with previous works on social mixing patterns [56,50,51,49,48]. The white stripes, clearly visible for β = 2 (Figure 9b), are a consequence of the sorting by tile (and of the use of tile-to-tile distances) and show the impact of distance on connection patterns. In [56,48] the authors also identified sub-diagonals which account for parent-children contacts. These sub-structures cannot be seen in our matrices because we used only four age-groups, while the aforementioned studies relied on a stratification of the population into 5-year segments. In contrast, we see that the adult group, which is the largest group in the population, dominates the inter-groups contacts.  To further investigate the role of population density in shaping the graph, in Figure 10 (and 23, 33 in the Appendix) we show the number of individuals living in each tile and their average degree, for two selected configurations both with µ = 10, so that the average degree of the whole graph is K ≈ 12. It is apparent that the average degree per tile is strongly influenced by the choice of the distance function. While in Figure 10b the situation is mostly homogeneous with few tiles above average, in Figure 10c most tiles are far below average while the tiles surrounded by a densely populated area have a high average degree. The increase of β has the effect of decreasing the probability of long contacts, a condition that favors the concentration of hubs in high density areas.
Finally, to assess the impact of f u , we consider two different configurations having both µ = 10 and β = 2, but one with f u ∼ 1 + LN (ln(2), 0.25) and the other with f u ≡ 1. Figure 11a shows the difference of the mean degree per tile between these two configurations, which results being negligible across the whole city. Figure 11b instead shows the difference of the maximum degree per tile, which is huge for central and densely populated areas. In essence, for fixed µ, the distance function D governs the average degree of each tile, whereas the fitness scores f u has a major impact on its variance, as already emerged from Figure 7c for the whole graph. If f u ≡ 1 the degree distribution within a single tile is almost flat, except for the age-based variance induced by the social mixing matrix S. With a long-tailed distribution for f u we instead obtain individuals with different sociability inside each tile.
(a) Difference of the mean degree of each tile between the configuration with fu = 1 + LN (ln(2), 0.25) and the configuration with fu ≡ 1.

Discussion and Conclusions
We have defined and implemented a probabilistic model of the strong social ties binding the population of a given territory, organized into households. Despite the abundance of related empirical and modeling studies, there is no general agreement concerning the mechanisms behind the formation of urban social networks. Our model is therefore designed on top of just a few clear assumptions: (i) not all individuals are equally sociable; (ii) the geographical distance and the age difference play a role in the probability that two individuals become friends; (iii) we shall make use of all available data, bearing in mind their shortcomings. An overview of the main features of the model can be found in [21], where the potential of the proposed framework is confirmed by the results of a set of epidemic simulations on the obtained network. In this paper we instead provided a detailed analysis and an extensive experimental validation to assess the robustness and the flexibility of our model.
The main goal of this work is permitting the recreation of synthetic social networks in the common circumstance where aggregated demographic data and some estimate of the age mixing patterns are the only available information. The ubiquitous use of online social networks and wearable sensors offers the opportunity to analyze networks that span the globe, but the extent to which they can be used to track real geographical networks and infer relationships is still an open question. On the other hand, social surveys are rarely available and intrinsically limited in the size of the reconstructed network. Data related to mobile (and/or landline) phone calls may seem a good compromise, but they are also difficult to acquire, often disaggregated among several operators, and variable in terms of pervasiveness and geographical resolution. Our tool addresses these issues providing a way to simulate a population within an arbitrary territory, whose individuals may be positioned with (almost) arbitrary precision, and for which the social mixing patterns can be inherited from any already existing dataset.
We extensively evaluated the resulting social network and its dependence on system parameters, considering three Italian cities that differ for both the size of the population and the geography of the territory. We found that setting the average number of friends to 5 is sufficient to obtain a giant component that almost spans the entire network, even in the absence of household edges. Age and proximity based homophily do provide the intended benefits: imposing data-driven age-based mixing patterns is critical to guarantee the internal cohesion of single age-groups -in particular, of young people; using the real population density and penalizing physically long edges prevents a Poisson-like degree distribution; both improve the assortativity and the transitivity of the network. The clustering coefficient, however, is rather low in the friendship layer of our graph, albeit the whole social network has strong clustering thanks to the complete subgraphs used to represent data-driven households. If we introduce a variable (specifically, Lognormally distributed) sociability we obtain the often desired heavy-tailed degree distribution. This helps connecting peripheral areas to the core of the network, especially when the average degree is small. However, sociable hubs tend to concentrate in densely populated areas, with the combined effect of exacerbating the correlation between favorable positioning and degree. The configurations that favor the rise of large hubs (i.e., Lognormal social fitness and strong penalization of long edges) slightly worsen the global transitivity, but they improve the average local transitivity. If we increase the average number of friends, most of the network distributes in just a few giant communities, a phenomenon that is amplified by a weak penalization of long edges and not affected by the distribution of the social fitness. Almost regardless of their size, the communities tend to have a large spatial extension, even though the average distance of their members is small. Finally, by controlling the penalization of long edges we can not only control the distribution of the physical distance between adjacent nodes, but also the dependence of the average degree in a tile upon the position of the tile. The variability of the degree internal to a tile is however entirely controlled by the social fitness.
Our model is intrinsically conditioned by the meaning we give to the word "friendship", which is not only related to the value we assign to the average number of friends µ. For instance, the experimentally measured dependence of u's average degree upon the density of vertices at different distances from u was already predicted by (2.7). However, we mostly designed this simulator as a tool to study and possibly forecast interactions, e.g. for use in computational epidemiology, even when accurate data upon such interactions are not available. When the friendship network is used to infer the likelihood of physical contacts, it may be reasonable to assume that living in a high density area implies being part of a dense subnetwork of friends that have a greater chance to physically meet. In any case, (2.7) also suggests that corrections to our model are possible to make it less sensitive to population density, for instance by imposing a limit to the total sociability of each tile or using a density-aware D such as the rank-based model defined in [16]. Another related element of discussion is the identification of the most suitable distribution for the social fitness parameter, or even whether this parameter should be used in the first place or not. While a heavy-tailed degree distribution characterizes many real-world networks [63], it is widely acknowledged that even sociable human beings can establish a limited amount of strong relationships [65,5]. In other words, since our urban social graph models strong ties, whether a heavy-tail effect should be present and to what extent, is open to interpretation. A Lognormal fitness seems to yield the typical degree distribution of many real-world networks, but setting f u ≡ 1 still guarantees that the maximum degree of the network is 5 to 6 times greater than the mode of the distribution, which may be preferable in many practical cases. Finally, by not incorporating any preferential attachment mechanism other than proximity and age-based homophily, our friendship graph model may fail to capture some of the typical features of friendship networks. In particular, our simulations highlight a limited tendency of friends to create triangles, which can be only partially ascribed to the low average degree considered. In this sense, the model may benefit from the usage of a fine-grained age-stratification, to intensify the internal cohesion of all age groups, or from the definition of age-specific penalties for long edges, to foster triangles in certain (e.g., school age) groups. Exploring all possible corrections to the model and providing a final answer to these and other similar questions is way beyond the scope of the present work. By making the simulator parametric and releasing it to the public as open source software, we hope to stimulate the interest of a wide audience of users which may adjust the simulator to their needs and possibly contribute to its further development.

Availability of data and materials
The code and data used in this paper for the creation of the urban social network are publicly available under the GPL v3 at gitlab.com/cranic-group/usn.

A Population Synthesis: Territory and Households
The first step to building a synthetic, but realistic, population for a territory of interest consists in the definition of the territory itself. We resorted to the well known OpenStreetMap database by means of the overpass API: for each city/municipality of interest we download the shape file with its boundary. This is used, at first, to find out the territory grid as the minimal rectangle that contains the shape and, later, to select only the tiles of the grid whose center actually falls inside the shape. This simple technique allows to reproduce with a high accuracy the actual number of people living in the selected area. Comparing our reconstructed population with the ISTAT data we observed a difference of ≈ 1% for all the three cities of Florence, Sabaudia and Viterbo. Figure 12a is the analogous to Figures 1a and 1b for Viterbo and shows the selected area with the city shape and a sample population of 1000 individuals.   As mentioned in Section 2.1, we aim at recreating a synthetic population that is statistically indistinguishable from the real one. To this end, we selected the following sources of information: (i) population density data from the WorldPop project [41]; (ii) ISTAT age distribution data aggregated at the provincial level (for Sabaudia, we used the Province of Latina, to which the city belongs); (iii) due to the absence of analogous data at any greater resolution, we use ISTAT household composition and frequency data aggregated at the national level. The reconstructed population density for the city of Florence is depicted in Figure 12b. The set of available household types and roles based on ISTAT data are reported in Table 2, where "various" comprehends all possible cases not included in the previous instances (e.g., non-partner adults living together). It is worth to underline that acquiring all data from the same source was unfortunately impossible. In particular, albeit WorldPop makes available an age-stratified population density, we verified that using those data induces an age distribution that is inconsistent with ISTAT data on household composition.
Based on the collected data we extract the tile label t u and the age label g u for each individual u of the population, as defined in Section 2.1. We also assign a role to each agent based on the joint-distribution of age and household roles provided by ISTAT, i.e., conditioning on g u when we draw u's role r u . We then generate the households with the algorithm described in Section 2.2.
We assess the robustness of the algorithm under perturbations to the age distribution, by considering independent relative variations -either positive or negative -to the probability of each age-group, controlled by a real-valued parameter ω. Formally, if π i = |Vi| N is the data-driven probability of age group i, the perturbed value π * i is obtained as follows: we draw ∼ 1 2 N (ω, ω 2 ) + 1 2 N (−ω, ω 2 ), where N (λ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 ; we set π * i = π i · (1 + ); we normalize π * i so that n i=1 π * i = 1. For both ω = 0.01 and ω = 0.1 we pick 20 different perturbed age distributions and we evaluate the quality of the resulting households under perturbation to the input data.
First, we address the problem that our heuristics does not guarantee that all individuals of the population are assigned to some household. For instance, this may happen if in any of the tiles the number of "single parents" exceeds the  number of "children of a single parent". We therefore count the number of unassigned people in two different sets of simulations: (i) we consider a single tile and vary the number of individuals in the tile (see Figure 13a, with the range for the population size taken from the city of Florence); or (ii) we consider the whole city and vary the side of the tiles used to define the grid (see Figure 13b, again for the case of Florence). Figure 13a shows that the number of people left out of all households is negligible, provided that the number of individuals in each tile is large enough to guarantee that all roles are sufficiently represented. Accordingly, Figure 13b shows that the number of unassigned people in the city drops very fast as soon as the tiles are large enough to contain, on average, a sufficient number of people; this number approaches zero as the tile size grows, i.e., as the population of each tile approaches the total population N . The quality of the heuristics decreases when the noise volume ω grows, thus highlighting the importance of internal consistency in the data. However, the plots show that the number of unassigned people stays safely below 1% of the total population even in the presence of (reasonable) fluctuations in the input data.  Next, we focus on the distribution of households by type and by number of children. In Figure 14a we plot the number of households of each type obtained for the city of Florence with our heuristics, on both clean and noisy data, compared with an estimate based on ISTAT data. In the plot, "ISTAT data" refers to the weighted average of the conditional probability of being the head of a specific household type given the age, computed through the law of total probability from ISTAT data. For instance, the number of households of type "1 Parent w/Children" is computed as the weighted average of the role (single-parent, parent). This approach allows to gain an estimate based on ISTAT data that takes into account the specific age distribution of Florence. The figure shows that our results do match the expected distribution almost perfectly and that the heuristics is very robust with respect to noise. In Figure 14b we instead show the distribution of the number of children for the types of household composed of parents and children. In this case, "ISTAT data" is obtained from the national aggregate distribution of the number of children per family; to the best of our knowledge, equivalent data are unfortunately unavailable for the province of Florence. The two distributions might significantly differ in light of the demographic differences between the residents of the city and the entirety of the Italian population. This difference may explain, at least in part, the divergence between the ISTAT data and the results of our simulations. In any case, the results look reasonable and are, again, stable with respect to perturbations in the input data.

B Social Contact Data
In our urban social graph, the probability Pr[u, v] of an edge connecting individuals u and v is given by (2.2). The dependence of Pr[u, v] upon the age groups g u and g v is determined by the matrix S which is defined by using the available social contact data as follows.
Initial data We assume that social contact patterns among individuals of our age-stratified population are available in the form of a n × n social contact matrix Γ = {γ i,j } i,j∈{0,...,n−1} , where n is the number of considered age groups. Following a systematic literature review, the project SOCRATES [15] recently released an online tool 8 to extract and analyze social contact patterns for a wide range of countries based on the best publicly available survey datasets. The tool allows to select a number of parameters upon which social contact matrices are usually dependent, such as age breaks, gender, day of the week, duration or location of the contact. Once the preferred dataset and the aforementioned parameters are selected, the tool produces the desired square matrix with γ i,j measuring the average number of daily contacts. For the scope of this paper, we relied upon the SOCRATES tool (v1.32), using the Polymod [14] dataset for Italy and the following set of parameters: • While we have no particular reason to recommend the use of the SOCRATES tool to determine Γ, we believe it is a valuable resource that perfectly fits our needs. More generally, the construction of a social contact matrix Γ is a typical problem in computational social science and our simulator simply assumes that γ i,j is an estimate of the volume of interactions/relationships that an individual of age group i has with any other individual of age group j within a given time span. It goes without saying that the type of contacts described by Γ is partially reflected on the final structure of the graph.
Data preprocessing In general the matrix Γ may be not symmetric, because γ i,j depends upon the habits of the individuals of group i, whereas γ j,i on the habits of group j. Using graph terminology, in the bipartite graph composed of individuals of groups i and j, γ i,j and γ j,i are an estimate of the average degree of vertices of type i and j, respectively. Even if asymmetric, the matrix should be consistent, in the sense that its entries should guarantee the reciprocity of contacts. At the population level, reciprocity means that in the bipartite graph the total number of edges that "exit" group i must be equal to the total number of edges that "enter" group j, i.e., that |V i | · γ i,j = |V j | · γ j,i . Since survey data rarely meet this requirement, it is standard practice to introduce a reciprocity correction [73, 51] by taking the arithmetic mean 1 2 (γ i,j · |V i | + γ j,i · |V j |) as an estimate of the total number of edges in the bipartite graph. When i = j, instead, the total number of intra-group contacts for group i is given by 1 2 (γ i,i · |V i |), because the graph is not bipartite. This leads to a group-adjacency matrix A = {α i,j } defined as follows: Finally, we divide each α i,j by the total number of potential edges m i,j (see (2.1) in Section 2.3) to obtain the matrix S = {s i,j } of inter-group edge frequencies: if i = j 8 https://lwillem.shinyapps.io/socrates_rshiny/ It is worth underlining that different sub-territories of the same country may have different age-group ratios and therefore require different corrections. The reciprocity correction, implicit in (B.1), is therefore necessary regardless of whether the matrix Γ had already been corrected beforehand. For instance, the SOCRATES tool implements the reciprocity correction but with national age-group statistics. If the local and national age-group statistics are identical, it is easy to verify that a double correction is useless yet harmless.
Age-homogeneous mixing In the absence of any age-based homophily, we have where h i is the average number of (e.g., daily) contacts of the individuals of group i. This leads to s i,j = hi N −1 for all j. If, additionally, all age-groups are "equally sociable", we have h i ≡ h for all i, leading to a constant s i,j = h N −1 for all i, j. This scenario represents a condition of age-homogeneous mixing, in which the age groups have no impact whatsoever on the edge probability -this condition is used as a benchmark in part of the analysis presented in Section 3.
As highlighted in Section 2.3, our model is invariant under multiplication of S by any positive constant. This means that the value h is irrelevant and that age-homogeneous mixing can be obtained by taking S to be any constant matrix, such as s i,j ≡ 1. If we choose to have s i,j represent the probability that a randomly chosen edge of the graph connects groups i and j, age-homogeneous mixing corresponds to s i,j = 1 n 2 for all i, j, so that i,j s i,j = 1.

C Results for the City of Viterbo
The same analysis discussed in Sections 3 and 4 for the city of Florence has been carried out for the cities of Viterbo and Sabaudia. Except for a few details, mostly ascribable to differences in the demographics and in the geography of the considered territories, the same observations and conclusions drawn for the case of Florence apply to these other two cities, as confirmed by the results shown in this and the next appendix. In the following, we report all plots and tables for the city of Viterbo, which are by all means analogous to those already presented for Florence.        Main features of the social graph as f u , β and µ vary, for data-driven population and S. K = µ + ν is the average degree, dist is the average path length, C and C loc are the global and average local clustering coefficients, ρ is the degree assortativity, "# comp." denotes the number of connected components, "giant %" denotes the percentage of nodes in the giant component.

D Results for the City of Sabaudia
In this section we report all plots and tables for the city of Sabaudia, which are by all means analogous to those already presented for Florence and Viterbo.     Main features of the friendship graph G F as the population type, D(u, v) and µ vary, for constant f u and S. dist is the average path length, C is the global clustering coefficient, ρ is the degree assortativity, "# comp." denotes the number of connected components, "giant %" denotes the percentage of nodes in the giant component.     Main features of the social graph as f u , β and µ vary, for data-driven population and S. K = µ + ν is the average degree, dist is the average path length, C and C loc are the global and average local clustering coefficients, ρ is the degree assortativity, "# comp." denotes the number of connected components, "giant %" denotes the percentage of nodes in the giant component.