Compositional Data Analysis Approach in the Measurement of Social-Spatial Segregation: Towards a Sustainable and Inclusive City

: The location and context in which people live influences and conditions their opportunities in life. This becomes relevant in a world subject to rapid urban and demographic growth, in which different economic, social, and political forces generate and accentuate disparities in cities. The foregoing generates an unequal distribution of the different social groups in the territory known as socio-spatial segregation. The study of this phenomenon incorporates a large number of variables belonging to different dimensions. Nonetheless, few studies have addressed socio-spatial segregation with a multivariate analysis approach. In addition, the existing studies may have obtained misleading outcomes by not acknowledging the inherent compositional nature of their variables. The objective of the present study is twofold: (i) To assess whether the phenomenon of socio-spatial segregation in Guadalajara, Mexico exists; and (ii) to introduce and stress the use of compositional techniques for the study of socio-spatial segregation. The study applied principal component analysis and cluster analysis considering the compositional nature of census variables, particularly from economic and educative indicators. In addition, the study used geographical information tools to depict and interpret the results. The results are intended to serve in the fulfillment of the Sustainable Development Goals towards inclusive and sustainable cities.


Introduction
In September 2015, the member states of the United Nations approved 17 objectives and 169 goals for the sustainable development and committed to reach them for 2030. These objectives, also known as the Sustainable Development Goals (SDGs) seek equality among people, to protect the planet and to ensure prosperity through several actions under the motto of "leaving no one behind" [1]. In this sense and recognizing the SDGs as interdependent, the present study focuses on objectives 10 and 11 (reduced inequalities and sustainable cities and communities, respectively). In a very broad sense, these objectives seek to mitigate and to eradicate problems, such as poverty, inequality, and social exclusion. These problems are usually visible in the urban space, and the way in which the different social strata are grouped in the territory (depending on their capacity to acquire housing) influences their accentuation or eradication [2].
In this context, the existence of differentiation or unequal distribution of certain social groups within the urban space is known as urban segregation [3]. The term segregation emerged from urban These studies yielded a reasonable interpretation (e.g., References [26][27][28]). Lloyd and co-workers [21] identified problems in the analysis of population studies using standard statistical methods and encouraged the use of log-ratio approach to overcome the different shortcomings of compositional data in population studies. Currently, examples of the application of compositional data methods can be found in a variety of fields, such as in the study of ecosystem services [29], microbiomes [30], tourism [31,32], and business and finance [33], among others.
Therefore, the objective of this study is twofold: (i) To assess whether socio-spatial segregation exists in the city of Guadalajara, Mexico; and (ii) to introduce and stress the use of compositional techniques for the study of socio-spatial segregation. Specifically, this approach has been applied, considering the socio-economic and socio-educative dimensions. Through these two dimensions, we aim to characterize the colonias of the city, which would help implement anti-segregation measures in the most vulnerable areas for compliance with the SDGs. In addition, the present study represents a quantitative contribution to the study of Guadalajara addressing issues that have been little addressed in the city from a quantitative approach.

Case study: Guadalajara, Mexico
Guadalajara is the capital of the state of Jalisco. The city has the largest number of inhabitants in the state and is the second city with the largest number of inhabitants in Mexico, with a population of 1.4 million of which the 52% are women, 48% are men and 64% of its inhabitants range from 15 to 59 years old. Moreover, 25% of the population lives in multidimensional poverty [34]. It covers a surface of 13,421 hectares, giving an average density of 104 inhabitants per hectare [35]. Most of its soil is for urban use, land tenure is mostly private property, and the city is totally urbanized. Industry and commerce stand out as the main economic activities of the city [36].
Urban policies, characterized by the hegemony of certain economic and political forces, have made Guadalajara a place where the interests of class, and the logistics of wealth and accumulation, have interfered with rational urban planning. As mentioned by Marcuse and Van Kempen [37,38], cities are not naturally divided; rather, division of a city is the product of an intentional and active act by those who have the power to do it. For Guadalajara, the division of space dates back to its foundation in 1542. The Spanish Crown, through its urban ordinances and know-how, established a defined geometric scheme within the urban fabric and established a clear social hierarchy in the city [39].
Among the criteria of the Spanish Crown, cities should be located in the proximity of a river in order to guarantee access to water [40]. The city of Guadalajara settled on one side of the San Juan de Dios River, a natural border that internally divided the city [41]. The local bourgeoisie, the rich and renowned people, would concentrate on the west side of the river (except for the 200 indigenous allies who were responsible for the defense of the city). In contrast, the indigenous population (with no nobility titles) was located east of the river, installing a clear division of the urban space ( Figure 1A,B): A Spanish city versus the city of the indigenous [40]. Even though the San Juan de Dios River was forced underground in 1896, and the Independencia Causeway was built over it, the planned urban growth continued on the west side of the river, as exemplified by the so-called colonias in the period 1894-1924. The product of foreign capital and inhabited by foreigners and wealthy families from its beginning, colonias were homogeneous subdivisions that responded to commercial interests, which sought to the increase of the value of the land and that accentuated the east-west and poor-rich dichotomy of the city [39,40,[42][43][44][45][46][47][48].
In addition, structural adjustment policies were deepened in the country from 1980 onwards through constitutional reforms in Articles 27 and 115, as well as institutional reforms in housing agencies. The foregoing granted to municipal governments more authority in the management and administration of their territories, including urban planning. Moreover, through these reforms, communal lands were incorporated into the market under the logic of accumulation and wealth. With regard to the national housing agencies, the state stopped providing housing to the popular sectors, and the role was adopted by the private sector. This has created an even more significant social division and differentiation of the urban space in Mexican cities [44,49,50].
Accordingly, the city government has adopted the SDGs to its most recent development plans 2016-2018 and 2018-2021 in order to mitigate and reverse the existing social division. According to the city government, the Guadalajara 500/2042 Vision Municipal Development and Governance Plan will guide the transformation of the city through six main axes: A prosperous and inclusive city; in the community; quality in public services; orderly and sustainable; well managed; and safe and peaceful [34,51].

Compositional Data Analysis
The census information obtained for the realization of the present study has been converted to percentages. This represents the main feature of many quantitative analyses conducted in population, urban and geography studies [52]. The set of values normally used in such studies are parts of some whole and are commonly (although not necessary) expressed in a closed form, such as 100 (i.e., percentage) or 1 (i.e., proportion). Hence, the compositional character of these data should be taken into account.
Compositional data describe parts of a whole. They are commonly presented as vectors of proportions, percentages, concentrations or frequencies [53]. Compositional data are inherently multivariate. Anomalies may appear if compositional data are treated using standard statistical techniques, the most stagnant of which is the appearance of spurious correlations. As several multivariate analysis techniques are based on the variance-covariance structure of data, this may lead to wrong conclusions.
The initial discussion about this and other consequences of using compositional data can be found in Chayes [54], Aitchison [55,56], Rock [57], and Rollinson [58]. The standard statistical techniques for unrestricted random values cannot be used to analyze compositional data in their natural or raw form, including univariate and multivariate statistical analyses [21,53,59,60].

Log-Ratio Approach
To overcome the consequences of working with compositional data, and to validate statistical analyses, the log-ratio approach proposed by Aitchison [25] is applied. This approach allows us to work with log-ratios of compositions as real random variables, such that the multivariate classic statistics tools can be applied. Moreover, the analysis is based on the relative information between the components rather than their absolute values [55]. An introduction to some key principles for compositional data have been presented by Pawlowsky-Glahn and Egozcue [26], Egozcue [61], and Filzmoser, Hron and Templ [62].
The indicators used in this study are simple, two-part compositions, which are grouped and analyzed in different dimensions (i.e., socio-economic and socio-educative) (see Equation (1)). Data are transformed with the log-ratio approach. Hence, the value of the indicator and its complementary are analyzed jointly, as given in Equation (2).

= [ , ] ∈
(1) where: X = compositional vector of two parts x1,x2 = parts of the compositional vector = closure. The vector has been rescaled such that the components add up to 1. (A representative of the class of equivalence has been selected).
S 2 = Simplex of two parts. Sample space for compositions, a subset of ℝ . Special care should be taken for the values used in the numerator and the denominator of Equation (2). The above has a direct influence on the interpretation of the results obtained. Therefore, in this study, the aspects of the compositions that were considered as positive were placed in the numerator, and as negative, in the denominator (e.g., log-ratio of the educated population was divided by the uneducated population). Commonly, compositional vectors have more than two parts. Transformations such as the isometric log ratio (ilr) (also known as orthonormal log-ratio, olr), or the centered log-ratio (clr), also based on the log-ratio approach are more suitable in that case [53,[62][63][64]. Moreover, components with zero values must be treated before the log-ratio approach, since zeros are not allowed in the method [65].

Principal Component Analysis
Principal Component Analysis (PCA) provides information about how observations differ from each another by finding a low-dimensional description of the data variability while reducing the complexity of the raw data. High dimensionality in the analysis of multivariate data could generate multicollinearity problems, which could lead to erroneous analysis. PCA identifies a set of independent and uncorrelated variables called principal components. PCA has as many components as the number of variables; however, a subset of the variables captures most of the original variability [66]. PCA is widely used to identify observations (e.g., subgroups of populations) that share common characteristics. Aitchison [67] presented an approach to PCA for compositional data. PCA is based on either the variance-covariance matrix or the correlation matrix [66]. PCA analysis using the variance-covariance matrix is more sensitive to the values on which it is based; therefore, PCA performed in this study is based on the correlation matrix (standardized variables).

Hierarchical Cluster Analysis
Cluster analysis groups and classifies observations with similar characteristics, as well as reveals characteristics of observations that are relatively different from other sets of observations. This study uses a hierarchical cluster analysis, which builds a nested hierarchy of clusters. O'Sullivan and Unwin [68] present the procedure in which the observations merge into clusters until only one cluster, which contains all the observations, is left (bottom-up agglomerative clustering). The importance of the method lies in the distances used to form the clusters (e.g., minimum distance, maximum distance or average between observations). The present study considers Ward's hierarchical clustering method. Ward's method is the only agglomerative clustering method that is based on the sum of squares criterion, minimizing the dispersion within the group in each binary amalgamation [69].
As next step, Geographic Information Systems (GIS) are used to visualize the distribution and the social division of space, based on the cluster analysis results, each colonia belong to a cluster and is mapped for a better analysis.
The study has used R [70] and Rstudio [71] in the multivariate descriptive statistical analyses performed.

Data Sets
The present study includes the analysis of census information of approximately 13,520 urban blocks, which are grouped into 395 colonias of the city (used as compositional vectors), which in turn are grouped into seven large urban districts (Centro, Cruz del Sur, Huentitán, Minerva, Oblatos, Olímpica and Tetlán), see Figure 2. Therefore, this study involves the use of different sources of information, as described. The vector information corresponding to the GIS in its shapefile format was obtained from two sources. The vector information of the 13,520 urban blocks and the territorial limits of the city of Guadalajara was obtained from the National Institute of Statistics and Geography (INEGI) and was used as a cartographic base [72]. In this digitalized blueprint of the city, the INEGI assigs a unique code to each urban block. The territorial delimitation of the 395 colonias and the seven urban districts was obtained from GeoGDL [73].
Census information in Mexico is gathered every ten years. Thus, the most recent census information at the urban block level corresponds to the year 2010. Information corresponding to the indicators used in this study is obtained from the 2010 Population and Housing Census [74]. In this information, unique identification codes are presented for the different urban blocks. With ArcMap 10.2.2 data are matched and linked to the cartographic base from INEGI, which facilitates processing data at different urban scales (such as colonias and districts).
As the national census did not include the income variable in the census of 2010, indicators of material goods and services were used as a substitute for income received to approximate the economic dimension and to determine the spatial structure of the population in the city (Table 1). Table 1. Variables are measuring the socio-economic dimension.

Abbreviation Definition Rel_TV
Log-ratio of households with a television, divided by those without Rel_Fridge Log-ratio of households with a refrigerator, divided by those without Rel_Washer Log-ratio of households with a washing machine, divided by those without Rel_Auto Log-ratio of households with an automobile, divided by those without Rel_PC Log-ratio of households with a computer, divided by those without Rel_Phone Log-ratio of households with a telephone, divided by those without Rel_Mobile Log-ratio of households with a cellphone, divided by those without Rel_Internet Log-ratio of households with internet service, divided by those without Inequality is one of the great challenges in the 2030 agenda of the Sustainable Development Goals. In order to diminish existing gaps (and taking into account the economic system in which we live), individual skills based on technology, mathematics and science will be required [75]. In this sense, formal education provides individuals with a set of skills to enter the professional world and to grow in their personal and social development [76]. Therefore, indicators of school attendance, or completion of formal education, by age groups have been selected in the dimension of education ( Table 2). Table 2. Variables are measuring the socio-educative dimension.

Abbreviation
Definition Rel3To5 Log-ratio of children 3 to 5 years old who attend school, divided by children 3 to 5 years old who do not Rel6To11 Log-ratio of children 6 to 11 years old who attend school, divided by children 6 to 11 years old who do not Rel12To14 Log-ratio of adolescents 12 to 14 years old who attend school, divided by adolescents 12 to 14 years old who do not

A Divided City
The principal component method has been applied to the socio-economic and socio-educative variables of the study, and is represented by the log-ratios of the considered indicators. Therefore, the compositional character of each bivariate composition is considered in a joint multivariate treatment.
As part of the results from the standard PCA conducted for the socio-economic variables, it can be seen that PC1 and PC2 explain 59.1% and 32.7% of the variation of data, respectively. Together, PC1 and PC2 capture 91.84% of the variability (Table 3). More importantly, PC1 and PC2 give a reasonable description of the variation in the data through their correlation with the variables (Table  3). From the correlation table, PC1 has a large negative correlation with Rel_Phone, and Rel_Auto, indicating that this first component primarily measures the financial capacity to acquire goods that are not considered as basic necessities (such as automobile and telephone). On the other hand, PC2 has a large positive correlation with Rel_TV, Rel_Fridge and Rel_Washer, indicating that this component primarily measures the financial capacity to acquire material goods considered as basic (Such as refrigerator, television and washing machine). Rather than examining the numbers, the principal component Biplot is more helpful for understanding the transformed observations and the original variables ( Figure 4). In the Biplot, angles between the variable arrows give an idea of the correlation between the variables. The smaller the angle, the greater the correlation (e.g., Rel_Auto and Rel_PC are highly correlated, while Rel_Fridge and Rel_Internet are not correlated). Likewise, the PCA Biplot gives a rough indication of how big or small the value of the variable for each observation is. (Note that, to better understand the behavior of the observations, colonias are color coded by urban district.) The Biplot reveals that the colonias belonging to the Minerva district are characterized by having a similar behavior between themselves but opposite from the rest. Specifically, the households with the most significant proportion of cars, internet and computers are found in the Minerva district (red) and its colonias. On the other hand, the colonias belonging to the Oblatos district (blue) have the lowest percentage of households with cars, computers and internet access. Moreover, these three indicators have the strongest relationship with each other.
Although the rest of the colonias are little appreciated, this first approach allows us to observe a clear differentiation between the Minerva and Oblatos districts, highlighting the west-east dichotomy in the city regarding the economic dimension. The standard PCA results with regard to the socio-educative variables can be seen in Table 4. The two first components have variances of 59.1% and 25.3%, respectively, together explaining 84.4% of the variation. The variables with the largest positive correlations with the first component are Rel18HigherEducation, Rel15MaxElementarySchool and Rel15MaxMiddleSchool (Table 4); therefore, PC1 primarily measures the population with a higher educational level. Moreover, PC2 has a large positive correlation with Rel12To14 and Rel6To11, revealing that this component measures the basic years of education in the population. On the other hand, PC3 has a strong negative correlation with Rel3To5, indicating that this component primarily measures the absence of education in the first years of the population. To better understand standard PCA, the principal component Biplot for this dimension was carried out ( Figure 5). It is evident that the angle between the variable arrows Rel18HigherEducation and Rel15MaxMiddleSchool is small, meaning that there is a high correlation between these two variables. The same can be said with respect to Rel15MaxElementarySchool and Rel15MaxMiddleSchool In order to illustrate the behavior of the observations (for colonias), these have been color coded by urban district. Colonias belonging to the Minerva district (red) have the highest level of persons with a formal education, as compared to the six other districts ( Figure 5). On the other hand, colonias belonging to the Oblatos district (blue) have the highest level of persons lacking a structured education. As in the economic dimension, the west-east dichotomy is evident, especially in the polarization of the Minerva-Oblatos districts.

West-East, Different Urban Fragments from the same City
As explained in Section 2.5, the selected clustering method is based on Ward's method. Through the dendrogram obtained from the bottom-up agglomerative clustering, we decided to classify the colonias into four clusters ( Figure 6). To examine differences in the territory, clusters are color coded.
In the socio-economic dimension, the colonias belonging to cluster four (red) are mostly grouped in the Minerva district. Using the Independencia Causeway (dashed line) as a reference, the classification of colonias in cluster four is predominant ( Figure 6). Moreover, colonias to the east of the Independencia Causeway are mostly in clusters one (blue), two (green) and three (purple).
As evident from the cluster characterization (Figure 7), and based on the variables selected to measure the economic dimension, colonias in clusters two and three are the most disadvantaged. These clusters are characterized by dwellings that mostly lack material goods (e.g., television, refrigerator, washing machine, car, computer, telephone, cell phone and internet); of these, cluster two lacks most non-essential goods, especially computer and internet access. Concerning cluster one, its proportion of dwellings with some material goods is almost similar to the ones lacking these goods. Finally, cluster four stands out from the rest. The relation of households that own a car is by far the highest. Likewise, this cluster has the highest relation of households that own different goods, with the exception of dwellings with television and refrigerator.    The cluster characterization shows that the colonias in cluster four stand out for having a highly educated population (e.g., Rel18HigherEducation), mainly in the population over the age of fifteen with secondary studies terminated (e.g., Rel15MaxMiddleSchool) ( Figure 9). On the other hand, the colonias in clusters one and two are characterized for having a significant lack of formal education. In regard to this dimension, and according to the results, the colonias in cluster two present the greatest lack of formal education in all the age groups analyzed. Even more importantly, and considering poverty is an intergenerational phenomenon, this cluster is the most vulnerable of all.

Segregation and Urban Challenges
The analysis of the existence of socio-economic segregation in the city of Guadalajara has been carried out considering variables belonging to the socio-economic and the socio-educative dimensions. The results obtained show the existence of internal differentiation in the social structure of the city.
According to the analysis, the colonias belonging to the Minerva district have a higher education level and a major purchasing power compared to the rest of the city. On the other hand, the colonias that belong to the Oblatos district are those with the highest degree of vulnerability, reflected by a lower purchase power and the lowest educational levels.
The results come as consequences of the foundational processes of the city (Spanish city versus the city of the indigenous), which also have been accentuated to the present day by a multitude of different economic, social and political forces. This highlights the still existing dichotomy between west and east of the city. Moreover, the results obtained in the study are consistent with other studies in different periods of time where, through different approaches, also highlighted the west-east differences [39][40][41][42][43][45][46][47][48].
The results obtained in this study show the challenges that the city government will have to face, especially regarding the east of the city, in order to meet the SDGs by 2030, particularly Objectives 10 (reduced inequalities) and 11 (sustainable cities and communities).
Among the challenges, there are low schooling levels in the economically active population that limit the competitiveness and productivity of the municipality to attract investment with high added value. Nearly 80% of schools (741 schools) are over 30 years old (technically, they have served their useful life) and are unused. Moreover, approximately 44.2% of high school students drop out of school during the first year for two main reasons: The need for income and the lack of motivation [34]. Furthermore, the rise in urban land cost, the increase in construction materials and the real state sector has influenced the capacity to acquire housing.
Additionally, in 2018, according to the Federal Mortgage Society Price Index, the metropolitan area of Guadalajara presented the largest increase in the price of housing in the country, with 11% [34]. The lack of income and the increase in housing costs make it impossible for a sector of the population to have access to it. Due to the increase in housing costs, in the period 1990 to 2015, the city has lost about 11.5% of its population [34,35]. To mitigate this loss, the government aims to repopulate the city by increasing population density (vertically) in central areas and in the vicinity of public transport corridors. The 2018-2021 municipal urban development plan, states that the city repopulation will be achieved through an inclusive and sustainable approach. Likewise, it is contemplated that one of every five houses produced in the city would be affordable housing [34,77]. However, very little information regarding the specific policies to achieve these objectives is available in the development plans.
Furthermore, fear, the presence of criminal organizations, insecurity and the low quality of public infrastructure have an impact on social cohesion and can influence the transmission of information regarding employment opportunities [76,78]. In the Guadalajara metropolitan area, the crime rate is approximately 37.9 crimes per 100,000 inhabitants. However, only 4.9% of crimes are reported, and 81.2% of the population feels unsafe [79]. Likewise, 50% of the parks and gardens are in poor condition, and 37% of municipal sports units have a significant level of deterioration [34]. Consequently, there is an impact on the opportunities of people to fully develop and access to better jobs and income-which makes them be in a state of vulnerability.

Lessons and Methodological Implications
The results obtained in this study with the log-ratio approach in compositional data are congruent with those of the urban marginalization index [22]. However, it should be reminded that the application of standard methods may cause spurious correlations and wrong conclusions. Even though the application of standard methods to compositional data yielded interpretable and apparently reasonable results, the application of standard methods in compositional data are, at best, inappropriate [26]. As explained by Aitchison [25], Pawlowsky-Glahn and Egozcue [26], Filzmoser, Hron, and Reimann [60], Marcillo-Delgado, Ortego and Pérez-Foguet [80], Cruz-Sandoval, Ortego and Roca [29], the use of standard statistical methods in compositional data originate problems, such as prediction of values outside the sample space, spurious correlations and sub-compositional incoherence, among others.
Accordingly, ignoring the compositional nature of the data of any urban context might drive to the implementation of mitigation measures and urban policies in places where they are not completely necessary, and their impact might be less effective. This becomes relevant if we take into account the low tax collection of the city (a current common phenomenon in the Latin American context). Therefore, the low budget allocated to anti-segregation interventions and programs (41% of the budget in the city is assigned to security, and the rest is allocated in different areas). Thus, the present method (through the statistical analyses) accurately identifies the vulnerable areas in any urban context with the purpose of implementing anti-segregation actions towards more sustainable and inclusive cities, such as the provision and recovery of public spaces, the integral rehabilitation of disadvantaged neighborhoods, residential diversification, social innovation, anti-speculation taxes, land use controls, provision of infrastructure, social rented dwellings and social housing, among others [2,37,[81][82][83].

Limitations and Future Research
The scale and the lack of detailed variables in the population census poses a challenge in this type of studies. In the former, the principles of confidentiality, zeros and missing values are higher, as the scale of the analysis is reduced, while the latter can be observed as a consideration of two-part compositions in this study. If the variables of the economic dimension in this study (with the existing information available) were considered as a composition of eight parts, many households would have been considered more than once in the same ratio-thereby, the results would have been skewed. In other words, the available census information from Mexico has only allowed for two-part compositions to be considered in this study. More parts could only be considered in the compositions as long as the population census disaggregates the variables in more detail. Consequently, other transformations must be taken into account (clr, ilr/olr).
Additionally, it would be interesting for future research to know whether public expenditure by the city government has a direct influence on differences in the social structure and to know whether there is an association between the expenditure and socio-economic strata. It would also be relevant to know if the government has caused gentrification processes through these urban interventions.
Finally, this study is intended to serve as a tool for decision-makers in identifying areas that need interventions towards the SDGs fulfillment, as well as in the proper management of compositional data.

Conclusions
A descriptive, multivariate statistical analysis of different indicators based in socio-economic and socio-educative dimensions was performed. Results from PCA and cluster analyses based on a log-ratio approach with two-part compositions showed a first picture of the existing socio-economic and socio-educative segregation pattern in the city of Guadalajara. This study supports the existence of patterns of segregation that have existed and have transcended the colonial period in the city of Guadalajara. The latter might mean that the old natural barrier of the San Juan de Dios River that separated the Spanish city from the city of indigenous persons, and which is now physically present by the Independencia Causeway, has become an "imaginary" barrier that continues to divide the rich and the poor.
Such a division is clearly seen in the polarization between the Minerva district and the rest of the districts. Moreover, recognizing segregation as a complex and multidimensional phenomenon, different indicators and dimensions should be incorporated into the study, considering the compositional nature and the log-ratio transformation of the data. Funding: This work was developed in the framework of a grant received by the Mexican government through a CONACYT fellowship (Reference 612800) and partially supported by grants RTI2018-095518-B-C22 and PCI2019-103674 (MCIU/AEI/FEDER) of the Spanish Ministry of Science, Innovation and Universities, and the European Regional Development Fund.

Conflicts of Interest:
The authors declare no conflict of interest.