Decision Model for Predicting Social Vulnerability Using Artiﬁcial Intelligence

: Social vulnerability, from a socio-environmental point of view, focuses on the identiﬁcation of disadvantaged or vulnerable groups and the conditions and dynamics of the environments in which they live. To understand this issue, it is important to identify the factors that explain the di ﬃ culty of facing situations with a social disadvantage. Due to its complexity and multidimensionality, it is not always easy to point out the social groups and urban areas a ﬀ ected. This research aimed to assess the connection between certain dimensions of social vulnerability and its urban and dwelling context as a fundamental framework in which it occurs using a decision model useful for the planning of social and urban actions. For this purpose, a holistic approximation was carried out on the census and demographic data commonly used in this type of study, proposing the construction of (i) a knowledge model based on Artiﬁcial Neural Networks (Self-Organizing Map), with which a demographic proﬁle is identiﬁed and characterized whose indicators point to a presence of social vulnerability, and (ii) a predictive model of such a proﬁle based on rules from dwelling variables constructed by conditional inference trees. These models, in combination with Geographic Information Systems, make a decision model feasible for the prediction of social vulnerability based on housing information.


Introduction
Vulnerability is often defined as the potential for physical or economic loss or damage [1] located in a specific territory. When vulnerability is approached from a social point of view, it focuses sharply on its human aspect of population application [2]. In recent years, a multitude of lines of work have emerged around the concept of social vulnerability, some more linked to natural risks and disasters [3,4], some to environmental factors [5], and others, closer to the concept of poverty [4]. From an approach that initially considered natural events as the main focus, there has been a gradual shift to one that considered that the effects on the population were conditioned by its own mitigation capacity [6]. Mitigation, in this sense, is considered to be the ability of an individual or community to anticipate, resist, and overcome the impact of unforeseen events [7]. Thus, the approach to the concept of social vulnerability has been opened up, placing people at the center. This approach concerns the people who have or do not have the capacity to overcome [8] or adapt to vicissitudes, which are not exclusively linked to environmental risks, and even incorporates a spatial aspect [9].
Social vulnerability presents various challenges, such as multidimensionality [4,5,[10][11][12], or the fact that many of the variables or dimensions to be evaluated are not generally directly observable [10]. Among the studies that have tried to identify indicators of social vulnerability, one by Cutter  2003 stands out in which they incorporated, as variables of the so-called Social Vulnerability Index (SoVI), a whole series of indicators including socio-economic factors, age, commercial or industrial development indicators, unemployment, rurality indicators, residential property, level of infrastructure, level of income, occupation, access to medical services, gender factors, race and ethnicity, family structure, educational level, vegetative growth, dependence on social services, and the presence of a population with special needs [5]. There are studies that integrated similar techniques in data interpretation methodologies [13,14], and many other studies [14][15][16][17][18][19] that integrated or compiled indicators with the same objective.
With the challenge of developing a decision model connected to the prediction of social vulnerability, together with the concept of a decision model, the decision support system (DSS) is adopted because of its capacity, beyond the use of information technologies (ITs) [20], to amplify the capacities of decision-makers [21]. The DSS concept was introduced by George Anthony Gorry and Michael S. Scott Morton [22]. Linked in our case to social vulnerability, it is proposed as a tool in which a Geographic Information System (GIS) must be connected in a fully integrated manner.
The integration of massive data, in what some call the new quantitative geography based on GIS tools, is elevating the granularity of geographic data to the extreme, in an authentic "n-dimensionality" of the data [23]. In most cases, according to Pragya Agarwal and André Skupin, GISs have focused on traditional statistical analysis to solve spatial autocorrelation problems, leaving many other areas totally unexplored. Some of these spaces are being addressed by emerging approaches such as artificial intelligence (AI) or artificial neural networks (ANN), machine learning (ML), or specifically geo-computing. ANNs are usually included as a category of ML methods frequently used for prediction, classification, and pattern recognition [24], with multiple practical applications, e.g., monitoring and control of industrial or medical instrumentation, in telecommunication networks, etc. [25] These new techniques and approaches are propitiating a change of paradigm in DSSs, considering that at present they can be useful for the understanding of reality, detection of its problems, and in short, for the formulation of new hypotheses and not only as an instrument to verify those previously established.
Within this conceptual framework, the main aim of the research is the construction of a predictive model that allows the identification of territories with high social vulnerability from a limited set of variables that are easy to access for the decision-maker. Specifically, the creation of a model-based exclusively on residential information as the basis and fundamental support of social reality is proposed. The main contribution to the field of social vulnerability consists of evidencing the viability of such a model of social vulnerability, constructed from residential indicators that are simple to obtain. The residential model can be obtained by means of an ocular inspection in situ, in contrast to the information necessary to evaluate social vulnerability, which is notably more complex and costly to obtain. To this end, techniques for interpreting reality using artificial intelligence and machine learning supported by a geographic information system will be used. Specifically, as a case study of the proposed methodology, the social vulnerability of the population of Andalusia (Spain) is characterized on the basis of residential information in which the population resides, validating the model by evaluating its predictive performance compared with research on social vulnerability in the region.

Social Vulnerability
Social vulnerability is a complex concept, which requires an approach that includes multiple dimensions and factors. With the intention of synthesizing some of the main contributions with respect to it, a review of the literature is carried out, organizing its approaches and indicators of measurement or evaluation, highlighting among all of them the indicators of social vulnerability (SoVI) [5]. In order to carry out a systematic approach, the main references are organized around the classification recently proposed by Lee [15], all of which are summarized in Table 1. Table 1. Integrated factors of social vulnerability and resources. Classification based on [5,15]. Source: Compiled by the authors based on cited references.  [15,18,31,33,41,43,45,50] Family and social structure [16,18,34,43,45,46] Public resource provision and public security Public infrastructure and resources that belong to inhabitants and its safety Infrastructure and lifelines [16,18,32,41,43,50,56,57] Medical services [17,18,27,31,32,43,45,57] It is important to highlight the existence of a reference work on social vulnerability in Andalusia, which is the area where the methodological proposal is assessed. This research identifies deprived urban areas [58] throughout the Andalusian region. Deprived urban areas are understood as those areas that present a series of weaknesses in their socio-demographic structure and/or in the environmental qualities of their physical space. The authors explain that these are neighborhoods with a structural social and economic weakness in which any threat, external risk or even social intervention without prior analysis can turn them into a vulnerable area.

Decision Model and Decision Support System
The DSS assists and guides decision making [59], allowing the amplification of the decision maker's abilities to interpret information and knowledge [21], reducing improvisation and indeterminacy [60]. DSSs have been used in multiple fields such as business intelligence [61], health [62], and fleet management [63], and with the help of a GIS, for the management of means of transport [64]. On the other hand, GISs have been used assiduously by governments, researchers, and companies as a decision-making tool in which the spatial dimension reaches a certain repercussion and influence [65].
GIS emerged at the end of the 1960s, with a particularly important development occurring in the 1980s [66], and reached the generalization of its use from the 1990s, coinciding with the arrival of GPS technologies to the civilian population in 1993 [67]. Today, GIS has been fully integrated into social media [68], in a space-time integration [69]. They have been established in society in what is called "spatial thinking" [70], making it easier for citizens such as urban planners to become important actors in planning.
It must be borne in mind, that GISs have not always been prepared to act as a DSS, as they require the integration of complex realities and problems for decision support [71], consolidating as flexible and resilient systems. Likewise, DSSs have recently moved from a focus on technology and systems to one focused on decision-makers [20], with the intention of helping them to process knowledge [21] and facilitate decision-making based on technology [59] in an accessible and affordable way.
Moreover, since its origin, the informational reality on which GISs are based has also changed rapidly, increasing the presence of spatial data and information of free access, configuring itself practically as a discipline in itself, which some call GIScience [72][73][74][75], a term introduced by Goodchild in 1992 [76] based on the idea of a new quantitative geography, fundamentally spatial, tending towards planning and management. With time, GISs have evolved from an emphasis on the "S" for the computational problems (1960s-1970s), to the "I" for the interest in the information (1980s-1990s), to, from 2000, focusing on the "G", due to a need for geographical interpretation, materializing in the "society of the geographic information" and opening a new stage for the history of geography [77]. It is at this point that the GIS approaches the DSS concept.
In order to achieve the aim of the research, a decision framework or decision model is created. This decision framework is conceptually fed by the idea of DSS, flowing between a more focused approach to DSS towards the IT techniques that support the decision [78], and a more focused vision on broadening the capacity of decision-makers [20,21]. In the first approximation, a DSS could be described as "a set of interactive and expandable IT techniques and tools for data processing and analysis, that supports managers in decision making" [78]. In the second approximation, it can be described in the words of Power, Sharde, and Burstein [20]: "More broadly, DSS is not exclusively based on the use of the computing technologies, instead it is focused on the "ability to relax cognitive, temporal, and economic limits of decision-makers-amplifying decision-makers' capacities for processing knowledge which is the lifeblood of decision making " [21]". This research is framed in two of the different types of DSS described by Power et al. [20]: (1) knowledge-focused and (2) model-oriented [20]. This research is framed within the DSS paradigms of knowledge-focused and model-oriented [20]. The first type focuses on the construction of a knowledge discovery system based on institutional databases on the demographic and social qualities of Andalusia. It allows the identification of socially vulnerable areas. The second DSS type focuses on the creation and management of a quantitative model of social reality. It is aimed at providing decision support and drawing it up from the dwelling properties of the territories under study.
Initially, DSSs were not thought of as an autonomous discipline, but rather as a method for bringing intelligence to decisions at the productive and environmental level [79], it has undergone an important development and evolution in recent years in parallel with the emergence of data science. DSSs use data and parameters provided by decision-makers to help them analyze a situation, although they do not have to be based on massive data [20]. In our case, the model was built with massive data (both demographic and residential) but can be used for decision making with very limited information-even scarce information.
The research, as it has progressed, consists of the construction of two models linked to tools from information technologies [80], as it is described in the following sections: The study used self-organizing maps (SOM) methodologies. They were initially proposed by Teuvo Kohonen [81,82]. The SOM methodology is a knowledge discovery or data mining technique consisting of an artificial neural network.
It is based on non-supervised learning, obtaining from the input data (input layer) the organization of them in a representation of the space of M neurons, which are arranged in a lattice of size a · b, where M = a · b. This lattice, which has the capacity of evidencing the topological relations and similarity between the subjects under study, locates those instances that present properties or attributes with greater similarity closer to each other. By means of an iterative process, the topological distance between the neurons is evaluated. Each neuron presents a prototype representing a cluster of input samples. At each time step, a new sample is presented to the network, and a winner neuron is declared, and the prototypes are adjusted. The process is stopped with a predicted number of iterations or when a decaying learning rate is reached.
SOM come from the field of knowledge of artificial intelligence, having shown itself to be very effective and robust in numerous disciplines. SOM show diverse capacities, among which we can highlight two, initially: (i) it is capable of showing and visualizing the starting information in a clear and ordered way, (ii) it allows the clustering and, therefore, labelling of study subjects in classes that do not require their definition, characterization or previous nominative labeling (non-supervised learning). Compared to other pattern discovery methodologies, such as cluster analysis, the SOM methodology has the advantage of (i) allowing a large set of statistical data to be visualized [83], (ii) showing the topological relationships of similarity or difference among the items under study, (iii) being graphically interpretable, and (iv) constituting by itself a knowledge system of a DSS for the analysis and visualization of statistical indicators [83].
By means of these techniques, on the one hand, the labeling is obtained as classes or profiles of the different fragments of the Andalusian territory studied, paying attention to the multi-variable analysis of the demographic and social attributes of the study. Based on the SOM methodology, analysis and interpretation of the profiles obtained are carried out, which is materialized in thematic cartographies of the different attributes included in the neural network and in different tables and statistical data that allow the differentiating characteristics of each profile to be known. To facilitate its use as part of a DSS, such classes are represented, and in particular, the social vulnerability profile through GIS.
The SOM methodology has been widely used in numerous fields. In the field of image interpretation, we can highlight its use for the analysis and classification of multiple satellite images of 200 different bands [84] and the classification of soils and minerals using spectral radio images and GIS [85]. It has also been used in transport for the graphical analysis of spatial interactions and obtaining patterns in US air transport structures [86], or for the classification of the sustainability of transport in cities according to TOD (Transit-Oriented Development) criteria [87]. The use of SOM to classify and recognize patterns of epidemiology with the help of GIS, or in distribution of ecological risk by contamination is described in [88]. It has been frequently used to classify and locate, for example, patterns of pesticide contamination in the Asour-Garonne river basin in France [88], to create models of plant location for the treatment of wood residues [89] or regarding the environmental quality of soils [90]. The SOM and GIS have also been used to classify community health based on environmental conditions variables [91] and to show quality of life trends in the neighborhoods of Charlotte (USA) [92].
In areas of knowledge with a more social and demographic aspect, certain studies with the use of SOM focused on the representation of data stand out, such as, for example, the joint visualization with GIS of the demographic changes of the counties of Texas (USA) over time by means of SOM [93], SOM visualization of spatial-temporal patterns of geographical variables in the USA [94], or the use of the SOM for the realization of an alternative and complementary holistic representation to the spatial representation of the GIS in which information of 69 census attributes in the USA is simultaneous, with information on climate, topography, soil, geology, land use, and population [95]. SOM have been used as a classifier to determine homogeneous demographic regions from data from the Athens census [96], to classify European adaptation strategies [97], for characterizing neighborhoods by tagging New York census sections from 79 geo-demographic attributes [98], and for non-supervised classification of geospatial data from German communities in terms of population, migration, taxes, residence, employment, and transportation [99]. SOM have also been used to make a semantic representation and characterization of exemplary neighborhoods from the recent history of urbanism [100], to identify and characterize the urban sprawl of Milan (Italy) [101], and for the analysis of the residential market from variables of prices, qualities, and characteristics of housing, density, inhabitants, etc., of Finland, Hungary, and The Netherlands [24].
As indicated, the state of the art shows that SOM is very often used as a methodology for reduction and classification [102] and also for entity labeling [103]. Compared to other dimensional reduction methods such as PCA (principal component analysis) or MDS (multidimensional scaling), the ability of SOM to preserve the topology of the data results in more efficient use of the available space in the map representation, with the consequence of greater distortion in relative distances [104]. On the other hand, the SOM has notable advantages over other techniques or methods. SOM is relatively insensitive to missing values while tolerating data with a non-normal distribution, which allows you to dispense with checks that are difficult to comply with, making it valid for any data distribution. On the other hand, as a clustering method, SOM is more robust than, for example, the K-means, although it requires more computing time [90,105].

Construction of Predictive Models through Supervised Learning-Decision Trees
By means of a machine learning process, a series of rules was obtained that allows for the prediction of one of the profiles that were determined with the SOM model, using only attributes on the dwelling reality of the territories under study. These dwelling variables were not taken into consideration in the evaluation of the SOM neural network, nor could they consequently affect or correlate with the definition of the profiles obtained by the SOM neural network. An approach to the problem of learning is proposed through the "divide and conquer" paradigm, which, when carried out on a set of independent instances, naturally leads to a style of representation called the decision tree [106]. In each node of the tree, a particular attribute intervenes, typically comparing each instance of the attribute with the value of a constant, and, usually, generating two branches attending to the instances that fulfill or do not fulfill such a rule. The decision trees suppose a simple and user-friendly representation to interpret and use in the prediction of the demographic and social reality of a territory and are consequently useful for the decision making about it. The model uses a limited portion of the available features and generally, it is easy and economical to obtain the residential reality of the place under study. Likewise, when evaluating the "value" of the profiles reached in Phase 1, in their spatial characterization by means of GIS, the usefulness of the proposed methodology is verified.
Decision trees are machine learning techniques that generate models that are very easy to understand and use. Decision trees are (i) models insofar as they construct a hypothesis or representation of the regularity of the data, (ii) understandable by symbolically expressing a set of conditions, and (iii) propositional by establishing "attribute-value" rules in their construction in which the conditions are expressed over the value of a single attribute [107].
There are many variants of decision tree algorithms. Here, we highlight just a few of them. As historical antecedents of the most used decision trees, we can find, for instance, the algorithms CHAID, CART, ID3, and C4.5. CHAID stands for automatic interaction detection by Chi-squared automatic interaction detection. It is an original by Kass [108] based on Bonferroni's significance test and characterized by its ease of graphic interpretation and for not being a parametric analysis. CHAID is a multivariate technique in which there is a single variable to explain and several explanations, in which different categories are identified to serve as a division in each branch, selecting for each of them the variable that discriminates most and the classes that, when combined, provide the greatest discrimination in the dependent variable under analysis, i.e., detects the interactions that most discriminate. CART stands for classification and regression trees, proposed by Breiman, Friedman, Olshen and Stone [109]. ID3 later gave rise to algorithm C4.5, both of which are very popular for their non-parametric approach and their interpretability [110].
Among the algorithms that generate decision trees, we highlight the recursive partitioning methods that have become very popular and widely used in recent years for nonparametric regression and for classification in many scientific fields [111].
The method specifically chosen is that of conditional inference trees, which is based on the permutation test [112], using nonparametric tests as criteria for branch division. It should be noted that the selection of this method is mainly due to its high comprehensibility and ease of interpretation of the rules obtained. There are other methods that are more precise, such as random forests. This method shows great precision in its predictions, often greater than other recent statistical learning techniques such as support vector machines (SVMs) or boosting [111], but it offers high illegibility and complexity of the rules that make it very difficult for the analyst to reproduce the model by hand.

Hybrid Model-SOM and Decision Trees
This section provides a few examples of research that present a methodological framework similar to the one used in this paper, i.e., the implementation of a hybrid model in which a decision tree is applied to the SOM, i.e., the decision tree uses the non-supervised clustering provided by the SOM as information to be predicted.
With this approach, we can find some methodological work [113,114] and others related to biology and medicine, such as a mining study on biological data [115]. Concerning engineering, there is a selection of variables to group road samples [116] and post-processing of accident scenarios [117]. Related to economics and business, there is a discovery of preferences in stock trading [118], focused on this approach of "SOM + decision trees". To conclude this section, we find some similarities in the approach with our research for the selection of properties in the analysis of census data by SOM and decision trees [80,119].

Materials and Methods
For an optimal understanding of the methodology ( Figure 1) and to obtain the best results from the DSS, the following phases [120] were followed: (i) information and processing functions, (ii) data sets, (iii) models, and (iv) visual representations.

Materials and Methods
For an optimal understanding of the methodology ( Figure 1) and to obtain the best results from the DSS, the following phases [120] were followed: (i) information and processing functions, (ii) data sets, (iii) models, and (iv) visual representations.

Materials. Processing Information, and Functions
The information used in this research came from the 2001 Population Census of Andalusia provided by the regional government of Andalusia through the "Instituto de Estadística y Cartografía de Andalucía" (IECA). Data from 2011 was not used as much as this update was based on interpolations rather than survey data. Intense data preparation was carried out on this information with data integration and cleaning, the transformation of attributes through the creation of aggregated indicators that synthesize the main demographic qualities of the original data in an objective and compact way. Due to the robustness of the SOM, it is not necessary to carry out their typification or normalization [121] prior to aggregation and incorporation into the model.

1.
Instances: The unit of territory on which the data were obtained is the Census Section, reaching the totality of the 5381 census sections of Andalusia, representing the totality of the surface and population censused in the Andalusian region, not initially carrying out any kind of sampling. 2.
Attributes: Table 2 lists the indicators elaborated from the Andalusian Population Census used as Modeling Phase 1, measuring instruments to identify the factors and concepts of the state-of-the-art of social vulnerability with which they are related. The attributes used in Modeling Phase 2 ( Table 3) were composed of variables of the residential dimension not being used in Modeling Phase 1.

Data Warehouse
Initially, it operates with two disconnected databases: one for Modeling Phase 1, with a mainly demographic and social dimension, and another for Modeling Phase 2, based on the dwelling dimension. The functioning is fundamentally independent, connecting only after Modeling Phase 1 to assess how the social vulnerability profiles fit and for the construction of the decision tree.

Methods-Models
As progress was made, we distinguished between two phases of modeling: 1. Modeling Phase 1: Clustering and knowledge model. Its objectives include the clustering and labeling of demographic and social dimension data, as identified above. In this phase, an artificial neural network was used, specifically, the SOM methodology. This methodology, since it is non-supervised, allows clustering without attributing, a priori, a label with previously attributed definitions and meanings which is useful to reduce the enormous complexity of the data [98].
The clustering of the entities was carried out by means of additional Ward-cluster analysis on the map. In this way, profiles or prototypes are generated by modeling patterns and trends in information [122]. To choose the number of clusters or profiles to be reached, there are multiple different methods and criteria, sometimes using a combination of them [123]. There are statistical approaches that use validation metrics, such as measures of "sums of squares" or dispersion. These include the Ball and Hall indices [124], Calinski and Harabasz [125], the Davies-Bouldin (DB) [126], the Silhouette Coefficient [127], the Cubic Clustering Criterion (CCC) [123], and the method based on the observation of dendrograms [123]. We can highlight approaches that are not strictly based on statistical criteria, such as the a priori method described by Joseph F. Hair Jr. et al. [128], in which an adjusted range was initially defined with which it was expected that it would be possible to interpret the groupings based on manageability criteria and efficiency in the communication and interpretation of the results. After that, by means of practical judgment based on common sense and theoretical foundations, the researcher increases or reduces the final number based on conceptual aspects of the problem. According to the author, this methodology provides a better probability solution than those based exclusively on statistical criteria [128]. Considering the above and due to the descriptive nature of the research, it is considered pertinent to restrict the number of profiles to a conceptual criterion of the problem, choosing the number of profiles from which a relevant interpretation of them can be obtained. In view of the above, and considering that this research presents a clear intention descriptive of reality, it was considered appropriate to restrict the solution of the number of profiles to an exclusively conceptual criterion of the problem, proposing to reach a number of profiles on which it is possible to make a relevant and useful interpretation of the data. In this way, an iterative process was carried out in which the number of profiles grew, evaluating meaning and relevance. The process ends when it is no longer possible to explain, with the necessary clarity, the meaning of a new profile, or, on the contrary, its fragmentation presents little value at a practical and conceptual level in research.
In order to facilitate the understanding of the profiles obtained, each cluster was characterized with its basic statistics, such as the Mean, Standard Deviation, Maximum and Minimum [88], with the main aim of obtaining two additional results, (i) the factor or variable that is most important for the effect and (ii) the value of such a factor [129]. In addition to the statistical information that defines the profiles, monovariable SOM Maps are valuable for the analysis of the profiles, since they allow, according to the distribution of values in the same position, the evaluation of relationships and correlations between variables. Following the recommendations of the American Statistical Association [130] for each variable and profile, in addition to the statistical significance, its effect size (ES) was calculated [131]. Statistical significance was calculated using the bilateral T-Student Test (p-value ≤ 0.05). The ES is a measure of how the values reached in the variable are influenced by whether or not they are within the profile in question. It is calculated as the quotient of the difference of the mean between the experimental group (profile) and the mean of the control group (population mean) divided by the standard deviation of the population [131]. In the corresponding tables, the effect sizes are indicated for each attribute/variable that intervenes in the construction of the profile: +++ large positive effect, ++ medium positive effect, + low positive effect, − low negative effect, − − medium negative effect, − − − large negative effect [132], obtaining very relevant information of the effect that the variables have on the definition and singularity of each profile. The Viscovery SOMine 5.0.2.t. software was used in this work for the construction of the SOM model, due to its good results at the visual representation level [133].
2. Modeling Phase 2: Prediction model. For the construction of the model that allows for the prediction of social vulnerability, a decision tree based on rules was evaluated, identifying through the representation of successive conditions, the degree of probability of the existence of the vulnerability pattern obtained in the Modeling Phase 1. For this purpose, the data was partitioned into 70/30 (training/test) and conditional inference trees were used based on the permutation test [112] using non-parametric tests as branch division criteria, not requiring pruning. For this purpose, the "rpart" package of the statistical software R-Project [134] was used, using minimum division = 20, maximum depth = 2, and minimum cube = 7 as the parameters.

Visual Representations
One of the main qualities of the SOM is its ability to represent the resulting information in a very powerful and synthetic way and, at the same time, in a way that is relatively simple to understand and interpret, by showing a two-dimensional representation of the starting instances with the characteristic that each one of them has as a "neighbor", the instance with the most similar qualities. The same cartography usually represents the groupings of the instances in the different conformed profiles. This representation is usually completed with a map for each of the attributes or variables that helped build the ANN of the SOM.
As each evaluated instance has its identity and form in space, in our case, the spaces that made up the profiles in Modeling Phase 1 are represented through a GIS. This return to the GIS of the instances, once classified into classes, has been frequent, for example, in medical research in non-linear analysis of multiple variables in certain diseases [91], in the representation of SOM clustering results on the ecological risk of contamination [88] or experimentally applying them to data from official socio-demographic information of the Lisbon Metropolitan Area [135].
Finally, once the graphs of the decision trees from Modeling Phase 2 were obtained, it was possible to move on to the Application Phase of the learned models, enabling the prediction from certain residential variables, whether there is a greater or lesser probability of social vulnerability.

Results
According to the methodology, two independent databases were obtained. A descriptive synthesis of these baseline variables can be seen in the first two columns of data (population) in Table 4 for the main demographic data and Table 5 for the dwelling dimension. Continuing with the following section of the methodology, Modelling Phase 1 was carried out using 66 variables of the demographic, social, labor, facilities, and services, etc., dimensions. The profiles that characterize the demographic reality and the main dimensions of social vulnerability were obtained.
Once the profiles were obtained, they were spatially represented ( Figure 2). A simple observation of the figure shows that there are several areas on the map that are known and recognized as areas of some degree of vulnerability [58]. Among them, the areas of Northeast Granada, North of Huelva, and Interior of Almería can be highlighted.
Beyond this rough verification, validation of Model 1 is carried out, comparing it with the results of an investigation in which deprived urban areas of Andalusia are identified. The authors of that study describe that any threat in certain circumstances could make such deprived areas as vulnerable [58]. In order to obtain it, the authors used aggregated indicators in a simple way (sum) based on their typification and without any type of weighting or complex statistical analysis. It can, therefore, be approximated that the areas proposed by this work are areas exposed to a degree of vulnerability, although a certain weakness in the methodological approach could be criticized. In any case, they are considered adequate to validate the results of Model 1.
In order to validate Model 1, the data presented in the Tables in [58] are used, which are presented in aggregate form by municipality. This has meant that only data from municipalities with only one census section and those that can be extrapolated to each census section without the possibility of error can be used for the validation. Table 6 shows the confusion matrix of Model 1. Table 6. Confusion matrix of Model 1. This does not include all the data incorporated in the construction of Model 1. It excludes the data of the document used [59], from which the true condition attribution cannot be guaranteed for each specific census section. This is because the information in such a document is presented aggregated at the municipal level and not at the census section level as in our models. Normally excluded are municipalities with more than one census section, i.e., bigger  [58]. Source: Compiled by the authors.
A simple observation of the figure shows that there are several areas on the map that are known and recognized as areas of some degree of vulnerability [58]. Among them, the areas of Northeast Granada, North of Huelva, and Interior of Almería can be highlighted.
Beyond this rough verification, validation of Model 1 is carried out, comparing it with the results of an investigation in which deprived urban areas of Andalusia are identified. The authors of that study describe that any threat in certain circumstances could make such deprived areas as vulnerable [58]. In order to obtain it, the authors used aggregated indicators in a simple way (sum) based on their typification and without any type of weighting or complex statistical analysis. It can, therefore, be approximated that the areas proposed by this work are areas exposed to a degree of vulnerability, although a certain weakness in the methodological approach could be criticized. In any case, they are considered adequate to validate the results of Model 1.   In order to validate Model 1, the data presented in the Tables in [58] are used, which are presented in aggregate form by municipality. This has meant that only data from municipalities with only one census section and those that can be extrapolated to each census section without the possibility of error can be used for the validation. Table 6 shows the confusion matrix of Model 1. Table 6. Confusion matrix of Model 1. This does not include all the data incorporated in the construction of Model 1. It excludes the data of the document used [59], from which the true condition attribution cannot be guaranteed for each specific census section. This is because the information in such a document is presented aggregated at the municipal level and not at the census section level as in our models. Normally excluded are municipalities with more than one census section, i.e.
Because data is imbalanced, the indicator considered most suitable for evaluating performance is (4) balanced accuracy (bAAC). A bACC = 0.8610 is obtained, which is considered a good accuracy. You also get one (1) recall = 0.9375, quite good performance, which shows that true positive predictions are high. On the other hand, the indicator (2) Precision = 0.1271, which denotes that the model is predicting many more cases with social vulnerability than the reference considers as such. This weakness of the predictions is considered tolerable since it is somehow predicting practically all "real" cases and others in which with certain probability situations tending towards social vulnerability are taking place.
If we analyze the five profiles obtained in Clustering Model 1, comparing both the statistical information that characterizes each of them (Table 3) and the spatialization of the profiles in the Andalusian region (Figure 2), we obtain the following results: • Profile 1: Statistically, it is verified that the census sections contained in this profile present, compared with the other profiles, a greater presence of delinquency, a greater number of persons per building, a greater dedication in service employment, and a lower number of dwellings per occupied household. Through spatial representation through GIS, coincidences are observed with the main urban areas and their closest conurbations throughout the region. This profile shows the urban connotations of a well-consolidated city. • Profile 2: A clear diversification of employment is observed, with little presence of the service sector, an eminently Spanish population, with few immigrants and a high number of illiterates, with little presence of households with only one adult and minors. This profile is spatially identified with a population located in rural environments, differing with respect to the other rural profile (Profile 4) in that its population is younger than in the former, with a larger active population with more activities typical of that reality, such as, for example, a greater dedication to construction or industry, and with households with a greater number of inhabitants. • Profile 3: It stands out for a greater number of births, a greater number of immigrants of provincial origin, and to a lesser extent, regional or national. They usually work in the province, with a high percentage of employed-a low unemployment rate. It is below average age, with few single-person households, and a low level of rootedness. Spatially, they are located in the main cities' outskirts. • Profile 4: The statistical analysis reveals that this population profile presents a high average age, a large number of households with a single occupant, an abundance of empty dwellings, and with issues such as a greater proportion of lack of running water than the rest. Statistical data reveal that they live in settlements with good ratios of cultural equipment and well-being per population, probably derived from the low number of inhabitants of such populations and acceptable distribution of such functions. Spatially, it is observed that they correspond to the most isolated rural sites and at a greater distance from the main cities. Comparing this profile with Profile 2, it is observed that it coincides with an older rural population, which often lives alone in urban environments with a small population, with little occupation of the dwellings and with high rates of illiteracy, unemployment, and inactivity. We can locate this profile, among other areas, prominently in Hoya de Baza (Granada), in Campos de Tabernas (Almería), in Altos de Sierra de Gádor (Almería) or in Sierra de Aracena (Huelva). As we observed in the state-of-the-art, this profile is identified with most of the factors that trigger social vulnerability. • Profile 5: It stands out for a high number of dwellings occupied by one person, on many occasions with some minor in charge, a high presence of immigrants from the rest of Andalusia, the rest of Spain and especially, foreigners with the consequent low rootedness of its population. They have a high employment rate, low unemployment, and low inactivity, working primarily in the service sector or in agriculture. They are spatially recognized and identified as well-known urban areas with a strong and unique presence of foreign residents. It is shown in tourist enclaves, such as the coast of Málaga and Granada, and in a very intensive agricultural production zone, such as the greenhouse area of the coast of Almería (Campo de Dalías).
Then, in Modeling Phase 2, a tree was obtained that allowed "predicting" how to identify Profile 4, from the variables that were introduced as predictors (dwelling variables). In other words, it is a question of identifying the belonging or probability of belonging to the socially vulnerable profile from certain dwelling qualities that can be observed with certain ease in the scope of the corresponding census section (Figure 3). most isolated rural sites and at a greater distance from the main cities. Comparing this profile with Profile 2, it is observed that it coincides with an older rural population, which often lives alone in urban environments with a small population, with little occupation of the dwellings and with high rates of illiteracy, unemployment, and inactivity. We can locate this profile, among other areas, prominently in Hoya de Baza (Granada), in Campos de Tabernas (Almería), in Altos de Sierra de Gádor (Almería) or in Sierra de Aracena (Huelva). As we observed in the state-of-the-art, this profile is identified with most of the factors that trigger social vulnerability. • Profile 5: It stands out for a high number of dwellings occupied by one person, on many occasions with some minor in charge, a high presence of immigrants from the rest of Andalusia, the rest of Spain and especially, foreigners with the consequent low rootedness of its population. They have a high employment rate, low unemployment, and low inactivity, working primarily in the service sector or in agriculture. They are spatially recognized and identified as well-known urban areas with a strong and unique presence of foreign residents. It is shown in tourist enclaves, such as the coast of Málaga and Granada, and in a very intensive agricultural production zone, such as the greenhouse area of the coast of Almería (Campo de Dalías).
Then, in Modeling Phase 2, a tree was obtained that allowed "predicting" how to identify Profile 4, from the variables that were introduced as predictors (dwelling variables). In other words, it is a question of identifying the belonging or probability of belonging to the socially vulnerable profile from certain dwelling qualities that can be observed with certain ease in the scope of the corresponding census section (Figure 3). The conditional or decision tree obtained in Figure 3 represents, at the bottom, the probability (ratio: 1 = 100%) of presenting the profile with social vulnerability (marked in black) as opposed to the probability of belonging to other profiles, which, as we previously verified, show other welldifferentiated characteristics. In the tree obtained, it was observed that with only two variables observable in situ-the average age of construction and the percentage of housing building-it is possible to predict whether or not they belong to the socially vulnerable profile, which, as we verified, requires the use of numerous variables and indicators, often difficult and costly to access.
To evaluate the predictive capabilities of the model obtained by means of a conditional classification tree, the ROC (receiver operating characteristic) curve was calculated (Figure 4), obtaining AUC = 0.78 (area under the curve). The conditional or decision tree obtained in Figure 3 represents, at the bottom, the probability (ratio: 1 = 100%) of presenting the profile with social vulnerability (marked in black) as opposed to the probability of belonging to other profiles, which, as we previously verified, show other well-differentiated characteristics. In the tree obtained, it was observed that with only two variables observable in situ-the average age of construction and the percentage of housing building-it is possible to predict whether or not they belong to the socially vulnerable profile, which, as we verified, requires the use of numerous variables and indicators, often difficult and costly to access.
To evaluate the predictive capabilities of the model obtained by means of a conditional classification tree, the ROC (receiver operating characteristic) curve was calculated (Figure 4), obtaining AUC = 0.78 (area under the curve).
Finally, Table 7 provides a preview of the predictive capabilities of Models 1 and 2 when predicting the presence of municipalities with more than 50% of the population in deprived areas. Table 6 shows the four municipalities with a false negative (Table 5) extracted from the municipalities with more than 50% of the population in deprived areas. Two municipalities with an imprecise Model 1 prediction are added to the previous ones. It can also be observed that most of the probability predictions of Model 2 are close to those of Model 1 and the reference data [58]. It should be noted that in Model 2, the highest predicted probabilities are 60%.  Finally, Table 7 provides a preview of the predictive capabilities of Models 1 and 2 when predicting the presence of municipalities with more than 50% of the population in deprived areas. Table 6 shows the four municipalities with a false negative (Table 5) extracted from the municipalities with more than 50% of the population in deprived areas. Two municipalities with an imprecise Model 1 prediction are added to the previous ones. It can also be observed that most of the probability predictions of Model 2 are close to those of Model 1 and the reference data [58]. It should be noted that in Model 2, the highest predicted probabilities are 60%.

Discussion
The main contribution of this research is that it has been possible to predict, with a certain level of precision both in Model 1 (Balanced Accuracy = 0.8610) and for Model 2 (AUC = 0.78), the probability of social vulnerability based on such simple residential indicators as the age of the buildings (year of construction) and the percentage of residential housing. The indicators that can be used to predict social vulnerability are (1) P01 Average age of constructions (year of construction), and (2) T07 Percentage of housing buildings. There is no doubt that such immediate approaches to such complex problems can have weaknesses, but they also allow us to have an almost immediate first approximation that can be extremely useful when carrying out approaches with greater depth of analysis and knowledge for the development and implementation of social and urban policies.
From the analysis of the state-of-the-art on the application of the clustering and knowledge model by means of the SOM methodology and corroborated by our own experience, it can be concluded that the SOM methodology is useful to carry out an exploratory analysis [98] to make the descriptive classifications more powerful, robust, and more complete [102], and to help understand the patterns of spatial distribution [88], facilitating explorations and visual evaluations [88,100], effectively analyzing complex geographic and demographic data sets. It also allows inferring spatial considerations from the taxonometric groups found [88], coding classifications in a GIS to approximate them to a wider audience not familiar with AI [24], overcoming the traditional challenges associated with studies of the complexity of environmental communities and showing their value by integrating SOM and GIS [91]. This study verified the ability to label geographic reality without the need to name such categories, suppressing the inherent problems of factor analysis [98], making it possible to evaluate the effects of the concurrence of certain variables under study [88], constituting a powerful alternative solution in a time characterized by information technologies and data proliferation [96], and that can be used as a decision support system to analyze and visualize sets of statistical indicators for various applications [83].
Moreover, the methodology based on decision trees from SOM clustering proved useful to attribute, in a very simple way, behavior patterns that can be very complex in order to effectively predict behaviors of variables that present a certain cost or difficulty of evaluation, such as demographic or social variables, from other variables with less complexity and cost of evaluation, such as residential variables. Its usefulness was verified to generate and verify hypotheses on complex realities and behaviors, without the user's participation is necessary for its formulation, making decision support systems accessible to a non-expert public, and allowing the identification of variables that are significantly related and their weight or size of the effect on the studied reality.
However, it is necessary to bear in mind certain precautions and limitations in the use of these methodologies and in their concrete implementation. These include the fact that the data used may already be obsolete, and that not all the dimensions of social vulnerability [5] were represented, such as rural/urban differentiation, although, as we have seen, it was implicit in some way with the rest of the indicators. In addition, an analysis of the population of a census section is not an analysis of the population itself, and extreme caution should be exercised and inference should be limited to the scale of observation, not directly reaching individuals [98], i.e., the conclusions obtained from the study of groups of individuals should not be extrapolated to individuals. Moreover, the complete integration between SOM and GIS is complex [136], being limited to a more or less manual connection. Except for a few connection attempts, a "friendly" direct connection between none of the main GIS and SOM software has been implemented to date, requiring the combination of both expert knowledge and creativity [24]. Likewise, the methodologies based on knowledge-based systems are not developed for direct integration into urban and territorial development and planning processes [99,137], which suggests, in conjunction with the previous one, that there is an important technological gap that can become a space for technical and technological development and for research and/or business opportunities. Another limitation that should be highlighted is that the results of "Prediction model" are specific to the territory under study, i.e., Andalusia. They will probably not fit to the specific features of other regions. However, the methodology for obtaining such a model can be used and applied in other geographical contexts.

Conclusions
Through research applied to the case study of the region of Andalusia, we obtained a decision tree oriented to the prediction of a model of social vulnerability. This model was constructed using a clustering methodology non-supervised by Self-Organizing Maps. Both techniques proved to be simple to use, as well as useful and able to predict, with relatively low error (Model 1: Balanced Accuracy = 0.8610; Model 2: AUC = 0.78), complex and relevant demographic phenomena, such as social vulnerability. For such a prediction, once the models were trained, only residential reality information was used.
In the methodological process, a series of socio-demographic profiles were obtained in Andalusia. In these models, it is worth highlighting that the presence of an eminently urban profile was distinguished (Profile 1); two suburban profiles, among which we can differentiate a Profile (3) in which there abounds a young and active population with families, short-distance immigrants (provincial), with housing and work in the province, as opposed to another Profile (5) characterized fundamentally by the abundance of long-distance immigrants (regional, national or foreign), who are very active in jobs linked to agriculture or services and who predominantly live in rented housing. Finally, two eminently rural profiles stood out, one in which a certain vitality was observed, youth and economic activity (Profile 2) and another in clear depression, ageing of its population and recession (Profile 4), in which a whole series of indications were evidenced that according to the state-of-the-art, predict a high social vulnerability.
Together with this statistical approximation, by representing the spatial profiles in the region, the areas that could be affected by social vulnerability were detected, evidencing what could be called "another Andalusia", an eminently rural Andalusia, with signs of isolation from the opportunities for employability, etc., offered by cities. Urban areas framed in the social vulnerability profile are certainly scarce. This could be a weakness of the model, and it would be advisable to adjust it to modify the vulnerability threshold and thus encompass areas that the state-of-the-art identifies as such.
Nevertheless, the decision tree obtained was interesting and relevant in that it allowed, in a simple way and with a certain level of precision, prediction of the probability that the inhabitants of an area are socially vulnerable, using a small number of variables that could be observed practically in situ without costly analysis or surveys. Specifically in the region evaluated, it was observed that only with the age of the buildings and the amount of single-family housing in the place under study was it possible to predict belonging to an urban profile related to situations of social vulnerability, with a probability that can be evaluated with the indicator AUC = 0.78.
Therefore, it can be concluded that there is a connection and relationship between demographic and social vulnerability phenomena and the residential configuration of Andalusia, being cautious and avoiding a priori a cause-effect establishment between such phenomena, which would require other differentiated tests that are far from being the objective of this research. It can be summarized that the main contribution that this work contributes to the field of social vulnerability consists of the prediction with a certain level of precision of the complex phenomenon from easily obtained dwelling information, almost by means of a simple ocular inspection.