Entropy as a Measure of Attractiveness and Socioeconomic Complexity in Rio de Janeiro Metropolitan Area

Defining and measuring spatial inequalities across the urban environment remains a complex and elusive task which has been facilitated by the increasing availability of large geolocated databases. In this study, we rely on a mobile phone dataset and an entropy-based metric to measure the attractiveness of a location in the Rio de Janeiro Metropolitan Area (Brazil) as the diversity of visitors’ location of residence. The results show that the attractiveness of a given location measured by entropy is an important descriptor of the socioeconomic status of the location, and can thus be used as a proxy for complex socioeconomic indicators.


INTRODUCTION
While cities have long been recognized as the cradle of modern civilization by providing a safe place for cultural development, the inequality distribution of wealth and services remain the main pressing issue threatening the sustainability of modern societies.Despite the large technological advances making our life apparently easier, economic inequality has been on the rise worldwide since 1980.This has become such an issue that most recent datasets show that the top 1% of the wealthy population capture twice as much of the global income growth as the bottom 50% [1].While such distribution disparity among urbanites and social stratification is currently under deep scrutiny among economists, including the spatial components to such descriptions, it imposes additional methodological difficulties given the vagility of human nature and the heterogeneity of the spatial distribution of resources.
While different views exist regarding the origins of socio-spatial inequalities across cities [2], the consequences of poorly integrated societies deeply affect opportunities in key realms of social life that hamper social cohesion at a local and societal levels [3][4][5].While some discuss causal factors behind sociospatial inequalities, evidence coming from natural experiments have shown direct impacts on particularly vulnerable groups [6].Such evidence, among others, has tied inequalities to societal imbalances leading to critical states in terms of security, health, and wealth distribution [2,[6][7][8][9] dreading social cohesion and precluding possibilities of enriching the social capital at particular locations [10][11][12][13].Defining and measuring spatial inequalities remains a complex and elusive task for which scientists have recognized several dimensions that are, so far, poorly integrated with a general conceptual framework [4,14,15].For instance, its precise understanding is often linked to the study objects at hand and the particular methodology employed to study them.Dimensions of inequalities often include the localized concentration of particular groups within cities, the spatial homogeneity of social groups, their accessibility, or more particularly, their distance to downtown [16].Hence, devising appropriate tools to characterize the spatial distribution of complex socioeconomic factors may contribute to the urgently needed development of integrative urban planning.
The explosive use of Information and Communication Technologies (ICT), such as cellphones and large databases of user spending behavior, has made huge volumes of non-conventional data available for urban research purposes [17][18][19][20][21]. Knowing the cellphone tower to which we connect permits the reconstruction of our daily trajectories, providing a surprisingly high spatio-temporal resolution of our social interactions [22,23].This approach has been widely used recently to assess a variety of topics going from individual mobility patterns [24] and land use patterns [25], to the detection of relevant places of high social activity within the city [26], thereby unveiling the structure and function of cities [25,27,28].Devising an efficient mobility infrastructure has long been known as a means for city integration and the increasing availability of ICT data allows for a new understanding of spatial integration patterns and its relationship to mobility, socioeconomic and ethnic stratification [29].Such highly resolved datasets provide a contextual understanding of land use that is readily available to derive new measures of social integration in its spatial context, thereby contributing to accurate, and near-realtime, descriptions of urban dynamics [30][31][32][33][34][35].Many of these studies are based on the concept of activity space [31,36,37], defined as the set of locations visited by a traveler throughout their daily activities.Different measures describing the activity space have been studied to understand daily mobility patterns [21,38].Among these metrics, metrics based on the Shannon entropy are particularly interesting to study human mobility patterns.Indeed, the concept of "Mobility Entropy" indicators has been widely used to measure the diversity of users' movement pattern [39][40][41][42].It can be used at different scales to evaluate the diversity of trips made by an individual [40,43], the diversity of locations visited by an individual [39,42] or a group of individuals [29,44].
In this work, we rely on the concept of "Mobility Entropy" from the point of view of visiting locations in order to deepen our understanding of human mobility in the context of urban computing by focusing on the concept of attractiveness.We particularly look into mapping the entropy of urban structure using increasingly available mobile phone datasets as a tool to provide highly resolved descriptions of the relationship between attractiveness and several key aspects of the urban environment such as productivity, education and ethnic origin in the Rio de Janeiro Metropolitan Area of Brazil.We focus here on the diversity of visitors' residence to measure the attractiveness of a location and then compare our results to economic and social indicators to assess how entropy effectively relates to socioeconomic indicators.We show that entropy is an important descriptor of socioeconomic complexity across this vastly populated area.

The study area and dataset
The study area is the Rio de Janeiro Metropolitan Area (RJMA), the second largest urban area in Brazil with 12,145,734 inhabitants.Administratively, the RJMA is a part of the Rio de Janeiro State, of which Rio de Janeiro city (Rio for short) is the state Capital and the largest municipality with 6,320,446 inhabitants and 1,200,177 km 2 .
The organization responsible for the demographic census in Brazil is the Institute of Geography and Statistics (IBGE) who follows global standards to aggregate census tracts in sub-district, district, city, state, and country levels such that this partitioning can be used for most regions in the world and at different scales.This study relies on such partitioning, dividing the study area in 49 locations (Figure 1) where the city of Rio is divided in 33 sub-districts aggregated into 5 districts as shown in Figure 1.Districts are called Planning Areas (AP) and represent macro zones of the city with AP1 the center; AP2, the southern zone; AP3, the northern zone; AP4 Barra-Jacarepaguá; and AP5 depicts the western zone.
Our analysis is based on mobile phone data provided by a Brazilian telecommunication operator.The dataset was collected during 363 days between Jan-uary and December 2014 across the phone area code 21.We use 2.1 × 10 9 call records originated from 2.9 × 10 6 anonymized subscribers.Only outgoing voice call data were made available for this work.We first focused on the identification of user's residence.The algorithm to detect places of residence is based on the analysis of the most frequently visited locations on evenings and weekends (see the Appendix for more details).This step allows us to discard users not living in the RJMA and remove users with no significant activity for the analysis.350, 685 residences were identified.
We then aggregate the data in space and time.Aggregated records represent the number of users v ij (t) living in the location i ∈ [1, N ] and visiting the location j ∈ [1, N ] at time t.We spatially aggregate the antennas' Voronoi polygons in order to obtain N = 49 locations matching the 49 locations composing the RJMA shown in Figure 1.We also divide each day in four 6-hours shifts (Morning, Work, Afternoon and Night) and label each time period t ∈ [1,1452] as either weekday or weekend, including holidays.More details regarding the data preprocessing are available in Appendix.

Entropy as a measure of attractiveness
For each time interval t, there is a probability that a user living in i, will visit location j described by: This probability describes the production of visitors and is normalized by the total number of users living in location i.In this study, we are interested in the diversity of visitors' location of residence as a measure of the attractiveness of the destination.We therefore need to compute the probability p j←i that a user visiting location j lives in location i.To do so, we combine the probability p i→j with census data to estimate V ij (t), the number of users living at location i and visiting the location j at time t using the following equation.
O i is the population of location i according to the 2010 IBGE census.We can now compute the probability p j←i (t) for an individual visiting j at time t that lives in i (Equation 3).
This second probability is thus related to the attraction of visitors, normalized at destination, and allows us to compute the normalized Shannon entropy as follows, Large entropy values (S j (t) ≈ 1) mean that people visiting location j at time t are evenly distributed among all 49 locations, whereas a smaller values of entropy means that people visiting location j at time t tend to be mostly concentrated among few residence locations.The entropy has been widely used to analyze and model human mobility patterns.It can be used in spatial analysis to describe the diversity of individual movement patterns [41] or in spatial interaction modeling to estimate trip distributions by entropy maximization [45] to name a few.It is worth noting that we focus in this work on the analysis of entropy as a measure of attractiveness that can be used as a proxy for complex socioeconomic indicators.
It is important to keep in mind that a given entropy value can cover a large variety of situations regarding the distance traveled by visitors.Here, we characterize the relationship between traveled distance and entropy by computing the radius of attraction of a location j as the average distance traveled by people visiting j at time t: where d kj is the distance from location k to j along the road network between the locations' centroids computed using the Google Maps API [46].This calculation is particularly important in the case of Rio due to the presence of mountains, lakes and the Guanabara Bay, which makes road distances between certain locations very different from the Euclidean distances.
Finally, we also consider the ratio between the number of visitors divided by the population as a complementary measure of attractiveness.

Entropy, economic and sociodemographic indicators
Because our entropy index represents a synoptic representation of mobility across the RJMA, we finally seek to describe its impact in terms of well known economic, social and demographic indicators as collected by the IBGE.We therefore evaluated how the diversity of visitors relates to the economic performance of the city by plotting the number of jobs and income levels against our entropy estimation.Sociodemography, in turn, was assessed by establishing the relationship between education levels in primary and secondary (high school) education among the population resident in each partitioned area.Finally, two developmental indices were chosen to evaluate entropy performance across the RJMA.

Classification of locations according to their attractiveness
We start our analysis by performing a clustering analysis to group together locations exhibiting similar features regarding their attractiveness.As a first step, we focus on two features across the urban landscape, the diversity at the origin location and the attractiveness at work locations.This led us to average the three indicators for each location (Equations 4, 5 and 6) over the work shifts time periods on weekdays.Locations are clustered using the k-means algorithm based on the three standardized averaged metrics.The number of clusters were chosen based on the ratio between within-group variance and the total variance (see the Appendix for more details).We obtained four clusters.Clustering results and the relationships between the different metrics are shown in Figure 2. We observe a positive relationship between metrics, in which attractiveness and radius of attraction tend to increase with the entropy.There is nevertheless a strong dispersion around these tendencies with an attractiveness and radius of attraction values that can double for a given entropy value.
Figure 3 shows the spatial distribution of the four resulting clusters across the whole studied area.Clusters are determined by a certain level of attractiveness and can be described as follows: • C1 (red) represents a low attractive cluster composed of 17 locations.It is characterized by a low entropy, an attractiveness ratio lower than one and a low radius of attraction.Locations in C1 are far from the Rio city center or segregated areas inside the Capital; • C2 (green) is a cluster of 22 locations, mostly located inside the city.This cluster is character-ized by medium values of entropy of visitors and radius of attraction, while having an attractiveness ratio close to one; • C3 (blue) is an attractive group with 8 locations mostly near to the sea inside Capital.This cluster shares high entropy values, attractiveness ratio between 1 and 2 and a large radius of attraction; • C4 (orange) is composed of only one location that can be considered as an outlier due to its very high attractiveness.The remaining three clusters do not change if this outlier is removed before clustering.This location is the business center (Centro) of the city, and is a very attractive cluster with a very large entropy (S C4 ≈ 0.9), attractiveness ratio and radius of attraction (δ C4 ≈ 12).This location concentrates most of jobs and visitors from all the RJMA.
Our methodology allows us to detect segregated areas with a very low diversity of visitors and attractiveness.Figure 4 shows the comparison of the clustering results with two social development indexes.We focus the discussion in five locations shown in Figure 4a The term "favela" is used here in the sense of subnormal agglomerate as defined by IBGE [48]: "a form of irregular occupation of land usually characterized by an irregular urban pattern, with scarce essential public services and located in areas not proper or allowed for housing use".In a broad sense, favela also includes urbanized areas, areas that were once subnormal agglomerates but have been urbanised, and also housing estates.The favela sub-districts assigned in purple in Figure 4a are defined according to Rio City  Hall, as the locations with more that 50% of population living in subnormal agglomerates.In Cidade de Deus, only 13% of the population is living in subnormal agglomerates as it is mostly composed by housing estates building, while its socioeconomic indexes are similar to the favela sub-districts.attractive cluster (C1) as did Cidade de Deus.Complexo da Maré has also many housing estates building and 54% of its population living in subnormal agglomerate.It was classified in the medium attractive cluster (C2), maybe because it is crossed by two of the main expressways that lead to the exit of the city.In the dataset used in this work, a visitor is detected in a given location by a call recorded within the location, such that some detected visitors may be passing-by the location to reach another destination.
Figure 4c and 4d show two social development indexes.In Figure 4c the Municipal Human Development Index (MHDI), which is an adaptation of the Human Development Index (HDI) for municipalities.The MHDI data were obtained from the Atlas of Human Development in Brazil [49], where the MHDI computed in 2013 is available at the census track level, as so as aggregated values for all municipalities and for district level in metropolitan areas.In Rio, the MHDI is available for the macro zones shown in Figure 1 and the value for the five locations of interest in Figure 4a were obtained from the census track level.The classes and colours used in Figure 4c were suggested by the Atlas.All five locations assigned in Figure 4a were classified as medium MHDI and many locations classified in the high attractive cluster (C3) have very high MHDI.
The MHDI is a global index intended to compare the social development in the whole country.The Rio City Hall has adopted the Social Progress Index (IPS), which is more focused on the city characteristics and is based in 32 indicators in three dimensions.The data used in this work were computed in 2016 and obtained from the open data portal of Rio City Hall [50].The colours and levels presented in Figure 4d are the ones used by the Rio City Hall.It can be seen from Figure 4d that all four locations assigned in low attractive cluster (C1) have low IPS (IPS ≤ 50).The Complexo da Maré sub-district has medium IPS (50 ≤ IPS ≤ 60) and was assigned to the medium attractive cluster (C2).Moreover, most locations assigned to high attractive cluster (C3) have a very high IPS (IPS ≥ 70).There is a very good agreement between the clusters computed from mobility and IPS, as cluster C1 correspond to IPS ≤ 50, cluster C2 corresponds to 50 ≤ IPS ≤ 70 and cluster C3 corresponds to IPS ≥ 70.
In the next section, we discuss the relationship between the mobility indicators and the economic and social indicators selected for this study.

Economic activity and sociodemographic factors
While transportation mobility has largely been recognized as a major player in the urban economy [51], the recent scrutiny of Call Detail Records (CDR) have expanded our understand of how mobility relates to economic activity across cities [42,52].We here evaluated how entropy relates to officially reported job numbers and income levels (Figure 5).In spite of the large informal job market known to occur in RJMA, our analysis shows a positive and exponential relationship between formal jobs and entropy (Figure 5a).Similar patterns emerge when relating income level with entropy (Figure 5b) as well as with Gross Domestic Product (GDP) (see Appendix).
Interestingly, opposite trends emerge when entropy is plotted against demographic indices such as the percentage of the population having completed primary education and high school degrees.In Figure 6, "primary school" refers to the percentage of individuals having primary school or lower education level and "high school" refers to individuals having high school or higher education level.School degrees are positively correlated with income, meaning that higher income locations tend to have higher education levels.In the same way race is negatively correlated with income, there is indeed a prevalence of white skin individuals in higher income locations and the prevalence of black skin individuals in lower income locations.As entropy is related to income (Figure 5), locations having a large fraction of its population with a completed primary school diploma or lower exhibit lower entropy values (Figure 6a), while locations with a large pro- portion with high school or higher education level is positively associated to entropy (Figure 6b).This is strikingly similar to the pattern exhibited by ethnic origin.Black skin population, as well as the percentage of primary school, also shows a negative relation to entropy (Figure 6c), while areas with a larger percentage of white skin population tend to exhibit higher entropy values (Figure 6d).
The entropy of visitors, computed from CDR, reflects the complexity of indicators usually computed using classical approaches.In fact, entropy seems to be positively associated with socioeconomic indicators such as MHDI and IPS (Figure 7), as shown in Figures 5 and 6.

Temporal evolution of the attractiveness
To study the temporal evolution of entropy, attractiveness and radius of attraction, we plot the normalized average metric values for each cluster across time shifts (Figure 8).Normalizations are performed using the reference values obtained for the work shift time period on weekdays.We decided here to consider relative, instead of absolute, values in order to make average attractiveness of clusters of locations comparable over time.Entropy tends to globally decrease along the day on both weekdays and weekends for every location whatever the cluster it belongs to.It is, however, interesting to note that the entropy is relatively higher during weekday night and weekends for locations classified as low attractive during weekday work shifts compared to highly attractive locations.Indeed, while locations of cluster C4 exhibits an entropy index 50% lower than the reference value, it actually represents 80% for cluster C2/C3, and more than 90% for locations belonging to cluster C1.A similar behavior is observed for the radius of attraction.
The situation is slightly different, however, for the attractiveness with an increase of the metrics during afternoons and night shifts on weekdays for the low attractive cluster C1.It further reaches a plateau during the weekend days.The location of cluster C4 shows an opposite behavior with a decreasing attractiveness along the day to reach a plateau during weekend days.The attractiveness remains more or less constant for locations belonging to cluster C2 and C3.

DISCUSSION
The impact of socio-spatial inequalities on urban systems has largely been treated in the urban economics and sociological literature, but the increasing availability of large mobile phone databases has open the possibility to provide a clearer picture of how different aspects of urban life impact economic and sociodemographic aspects of cities [19].Going into this direction, this work presents the results of the processing of 2.1 Billion records collected from 2 million users in the Rio de Janeiro Metropolitan Area, Brazil, during the whole year of 2014, placing this research among the largest analysis, to our knowledge, used to relate mobility and its link to socioeconomic complexity in Brazil.We hereby illustrate the potential of combining mobile phone data with entropy-based metrics to measure the attractiveness of a location.This may prove useful to urban planners and managers when it comes to describe and plan for complex socioeconomic indicators.While it is known that mobility is in fact related to economic activity, this work presents an effective and simple way to measure such relationships from increasingly available ICT data such as mobile phone datasets.
While most capital cities in South America suffer from a disproportionate growth compared to other urban settlements [53], common patterns of spatial inequalities show that underprivileged populations establish themselves away from highly productive central zones [30,54], with often clear differences among the usage of urban infrastructure [55].In this sense, the particular and complex topography of Rio de Janeiro would suggest the existence of shared usage patterns of the city among urbanites coming from different social contexts.The spatial partitioning employed in our study closely matches IBGE delineation, we are therefore able to compare official statistics with measures derived from CDR data and offer specific insights regarding the usage of ICT as proxies for the spatial distribution of complex socioeconomic indicators derived from mobile phone datasets.Our analysis shows that the attractiveness of a district measured with the diversity of visitors' place of residence is correlated with the income and the number of jobs in spite of the large informal job market of Rio [32].
We also show that the attractiveness is lower in areas hosting a large percentage of the population with African descent and/or locations where primary school training is prevalent (Figure 6a,c).While this points to previous descriptions showing how available schooling options closely reproduce residential patterns of socio-spatial segregation [56,57], the spatial mismatch and highly productive Centro area, where work opportunities are concentrated in the RJMA, leads us to think that residential segregation of the poorest is reinforced by new inequalities when taking into account daily mobility opportunities.Unfortunately, and in spite of using state-of-the-art descriptors of urban diversity, we are able to corroborate a well-known trend in which areas with large African descendant populations are still syndicated as an indicator of social inequality.This poses important planning challenges to historical areas such as the RJMA, where almost one million enslaved Africans were estimated to arrive in the XVII th century [58].
The observed results concur on recent developments in the scientific literature that show how mobile phone information can be used to evaluate the socioeconomic state of spatially heterogeneous regions [43,59,60], especially in developing countries.Moreover, the RJMA is a very particular case study where socioeconomic isolated districts are placed in-between richer areas, as well as in the periphery, which is more common in greater cities of developing countries.This particular characteristic of the city allows to validate the results, as the clusters accurately identified favelas and other socioeconomic isolated districts, as shown in Fig. 4.
In summary, this manuscript serves to illustrate the potential of mobile phone data combined with entropy-based metrics for measuring the attractiveness of a location that can be used as a proxy for complex socioeconomic indicators.Even if the spatial partitioning used in this study tends to reduce the level of spatial uncertainty inherent in this type of data sources [61], it would be interesting to reproduce the results with different datasets coming form different sources of mobility information.

Identification of the users place of residence
The presumed residence of each user was computed as the most visited Voronoi cell between 08:00PM and 06:00AM during workdays and the entire day on Sundays and holidays.We additionally required that the user to be regularly detected in this cell (at least five times) and that the number of visits at the most frequented cell is always greater than the number of visits at the second most frequented cell.The final dataset containing only users with an identified residence ended up to be 350, 685 mobile phone users.As mentioned above, the data were aggregated spatially by assigning each Voronoi cell to one of the 49 districts.The identification of the users place of residence was then evaluated using data from the IBGE 2010 census.As it can be observed in Figure S4 we obtained a good match between the census data and the residence identified with mobile phone data with a Pearson correlation coefficient equal to 0.9.

Figure 1 .
Figure 1.Rio de Janeiro Metropolitan Area (RJMA).The RJMA is composed of 49 locations, 16 municipalities outside the Capital represented in grey and 33 sub-districts inside the Capital, grouped into 5 districts.

Figure 2 .
Figure 2. Results of the clustering analysis.Log-log scatter plot of (a) the attractiveness and (b) the radius of attraction in terms of the entropy index.The inset in (a) shows the relationship after removing one outlier (cluster C4).Each dot represents a location within the study area.Indicators have been averaged over the work shift time period during weekdays.

Figure 3 .
Figure 3. Map of the RJMA that display the spatial distribution of four clusters.

FigureFigure 5 .
Figure4bshows a zoom in the clustering results.The main favela sub-districts were classified in low

Figure 6 .Figure 7 .
Figure 6.Sociodemographic analysis.Percentage of primary school level education (a), high school level education (b), black people (c), and white people (d) as a function of the entropy index.The entropy have been averaged over the work shift time periods on weekdays.

Figure 8 .
Figure 8. Temporal evolution of the three metrics.From the top to the bottom, entropy, attractiveness and radius of attraction as a function of time by cluster.The values are averaged by cluster and normalized by the value obtained for the work shift during weekdays.A similar plot displaying boxplots instead of average values is available in Appendix.

Figure S3 .
Figure S3.Number of calls per hour and the partition of time shifts.Total number of calls made in the RJMA in 2014 (including weekdays and weekends).

3 )R 2
Figure S4.Number of mobile phone users with an identified residence in the RJMA as a function of the number of inhabitants in the 49 locations.

Figure S5 .Figure S6 .Figure S7 .
Figure S5.Ratio between the within-group variance and the total variance as a function of the number of clusters.