Variable Selection for Meaningful Clustering of Multitopic Territorial Data
Abstract
:1. Introduction
- We introduce the semantics of the variables through a new formalism because there is a need to drive the meaning of the variables to the interpretation of results through the analysis process. Our solution is the thermometer.
- We design a new automatic methodology of building a conceptual interpretation of profiles based on the traffic light panel (presented in Section 2.4) method of automatization, which helps the final user understand the results because there is a need for automatically building TLPs, including the semantics of the variables. Our solution is the creation of a TLP based on the thermometer.
- We create new data-driven variables to sum up variables from the same topic because there is the need to balance the number of variables per thematic block for the final global analysis. Our solution is the data-driven 2nd generation indicator.
- We discover which variables should be included in the global clustering to guarantee clusters with territorial coherence because there is a need for objective criteria to choose the most appropriate variable from each thematic block. Our solution is the index of potential explainability.
- The territorial feature selection method: This methodology is used to discover which clusters have territorial coherence. This method is the main contribution proposed in this paper and includes the identification of the variable with more significance in a set of territories.
- The thermometer: This is a new tool that assigns basic traffic light colors (green, yellow, or red) to ranges of values of the numerical variables or to the modalities of qualitative variables so that colors are associated with the semantics of the variable. It is a knowledge acquisition tool that allows domain experts to transfer semantics to the machine. It is enhanced with a fourth color (violet), which is used to represent missing values.
- TLP based on the thermometer: This is a new method to automatically determine the color of each cell using the knowledge and semantics given by the thermometer.
- Data-driven 2nd generation indicators (DD2gI): This is a new methodology that enriches the data-driven 3rd generation variable creation presented in [1] with the introduction of the thermometer, combined with clustering and the traffic light panel.
- Index of potential explainability: This is a new index based on the Lebart test values for qualitative variables computed versus location. It is used as the metric to select candidate variables inside TFSM.
2. Materials and Methods
2.1. State of the Art
2.1.1. Multiview Clustering
2.1.2. Feature Selection
- The Laplacian score, which was used in [8], where the methodology is based on the observation that data points from the same class are close. It was compared to one supervised and another unsupervised procedure. ReliefF, which was used in [9], is a similarity-based method for conventional data. Ref. [9] investigated and discussed how and why ReliefF methods work and all their characteristics (proprieties, parameters, dependencies, scalability, robustness, etc.)
- Mutual information feature selection, which was used in [10], which investigated the application of this kind of feature selection method to evaluate a set of variables and select an informative subset to be used for a neural network classifier. The minimum redundancy maximum relevance method was used in [11], which studied how to select good features as a function of the maximization of the maximal statistical dependency criterion based on mutual information. Conditional mutual information maximization was used in [12], where features were selected if they maximized their mutual information with the class to predict the conditions of any feature already selected. This method ensures that selected variables are individually informative. A fast correlation-based filter was used in [13], where the authors proposed a filter method that identified relevant features as well as redundancy among relevant features without a pairwise correlation analysis. All these methods are information-theoretical-based methods for conventional data.
- Feature selection with a lp-norm regularizer was used in [14], where the authors proposed a new method for making estimations in linear models by minimizing the residual sum of squares. This method is a sparse-learning-based method for conventional data.
- In Ref. [15], which presents statistical techniques applied in geology, the T-score appears as a statistical-based feature selection method for conventional data.
- Group lasso was used in [16], where efficient methodologies were proposed for the extensions of lassos for variable selection and were shown to improve the performance. The overlapping sparse group lasso method used in [17] proposed a new penalty method that leads to sparse estimators when it is used as a form of regularization to minimize the empirical risk. Both are methods of creating group feature structures for structured features.
2.1.3. Selecting Discriminant Variables
2.2. Creation of New Variables
2.3. Clustering
2.4. TLP (Traffic Light Panel)
- Represent all variables versus the discovered variables in a class panel graph (CPG), a compact graphic tool with the conditional distributions of the variables vs. classes where the particularities of each class are easily shown.
- Calculate the mean, standard deviation, minimum, and maximum for numerical variables and the absolute and relative frequencies for qualitative variables.
- Using the results from Steps 1 and 2, identify variables or combinations of variables with specific ranges of values that distinguish the class from others.
- Assign qualitative levels (high values, medium values, and low values) to the variables identified in step 3 by detecting the area where the mass of the distribution is placed.
- Perform a TLP for the variables using the qualitative values assigned in Step 4.
- Show the TLP to an expert and ask him to select a label for the class, which could be good or positive, neutral or medium, and bad or negative. The expert is conceptualizing the class up to this step on the basis of the traffic light panel. Colors are assigned in accordance with the interpretive codes of the expert. Green is assigned to positive or good values of the variable, yellow is assigned to medium or neutral values, and red is assigned to negative or bad values. The context and meaning of each color must be related to some latent concept of the domain that allows the association between the variable polarity and the idea of improvement or worsening. The authors of [33] propose two basic ways to assign colors to the scale of the variable.
- Direct color coding (red–yellow–green) for low–medium–high values.
- Inverse color coding (green–yellow–red) for low–medium-high values.
- Perform significance tests (ANOVA, Kruskall–Wallis, or χ2 independence tests) to assess the relevance of differences for the variables implied in the above steps.
2.5. Annotated TLP (aTLP)
2.6. Methodological Contributions
2.6.1. Thermometer
- (a)
- There is a latent reference concept that can guide the evaluation of the variable values as promoting or not promoting the individuals regarding these latent concepts (i.e., water quality, the goodness of a care system, the availability of services…). This latent reference concept is aligned with the goals of the analysis.
- (b)
- A set of traffic light colors (red (r), yellow (y), and green (g)) is associated with the semantics of the values of the variables according to the latent reference concept. For example, variables indicating dirty water will be associated with red for water quality problems, and clean water will be associated with green. Violet will be used for missing values.
- is a set of individuals described by K variables {X1, X2,…, Xj,…XK}.
- Dk = {m1, m2, m3,…, mk} is the set of modalities for a qualitative variable Xk.
- T = {t1, t2, t3, ..., tK} is the available thermometer panel, where tk, k ∈ {1:K} is the thermometer for the variable Xk ∈ K.
- When Xk is qualitative, tk = {(m1;q1),(m2,q2),…,(mk,qk)}, where:
- M ∈ Dk is a modality of variable Xk;
- qk is the color assigned to mk.
- When Xk is quantitative: tk = {r1, r2, o}, where:
- r1 is a numerical value for Xk such that min(Xk) ≤ r1 ≤ max(Xk);
- r2 is a numerical value for Xk, such that r1 ≤ r2 ≤ max(Xk);
- o is the semantic polarity of the variable (o ∈ {direct, inverse}). It represents a direct association of the variable’s meanings with the traffic light colors or their inverse meanings (high values of numerical variables can link to red if they measure water pollutants or can link to green if they measure, for example, biodiversity in water quality problems).
- A.
- NUMERICAL VARIABLES
- The first color zone ranges from min(Xk) to r1;
- The second color zone ranges from r1 to r2;
- The third color zone ranges from r2 to max(Xk).
- Direct: low values are red, and high values are green.
- Inverse: green values are low, and high values are red.
- Variable name: Name of the quantitative variable represented in the thermometer.
- Minimum: Minimum value of Xk observed in the sample.
- Maximum: Maximum value of Xk observed in the sample.
- Scale: A graduated axis with possible values of Xk.
- r1: The upper bound of the first color zone.
- r2: The lower bound of the third color zone
- B.
- QUALITATIVE VARIABLES
2.6.2. Creation of TLP based on Thermometer
- 1.
- The discretization/re-coding phase: Create Zk, a new qualitative variable resulting from the re-coding or discretization of XK according to its original type and the thermometer information:
- If Xk is quantitative, create Zk = dis(Xk, tk) by discretizing Xk according to the cutoff values indicated in the thermometer and the associated colors.
- If Xk is qualitative, create Zk = rec(Xk, tk) by re-coding Xk according to the colors given in the thermometer for each modality.
- If tk ∉ T, then assign yellow to all values of Xk; the user can manually edit this.
- If xi is a valid value of Xk,
- If xi ≤ r1:
- If o = direct, then zi = “r”;
- If o = inverse, then zi =”g”;
- If (xi > r1) ꓥ (xi ≤ r2) →zi = “y”;
- If xi > r2:
- If o = direct, then zi = “g”;
- If o = inverse, then zi = “r”.
- If xi is missing, then zn = “v”.
- If (xi = mm), then zi = qm;
- If xi is missing, then zi = “v”.
- 2.
- The cross-matrix creation phase can be represented as follows:
- 3.
- The color assignment phase. Let Fc ∈ Mk be a row of the above matrix. The cell color is denoted as SC and is expressed as follows:
- Binary qualitative variables:
- If argmax(Fc) = 4, then SC = v.
- If argmax(Fc) = 3, then SC = y.
- If argmax(Fc) = 1:
- 1.
- If nc1/nc ≥ γ, then Sc = r;
- 2.
- If nc1/nc < γ, then Sc = y.
- If argmax(Fc) = 2:
- 1.
- If nc2/nc ≥ γ, then Sc = g;
- 2.
- If nc2/nc < γ, then Sc = y.
where γ ∈ [0,1] and determines the threshold proportion of a modality to be considered a non-yellow color. This is required because the binary variables have only two modalities and represent a basic dichotomy, and in a given class, the number of red or green elements has to be transformed into a single-colored cell. The default value for γ should be 0.5 so that more than 50% of the elements in a class assigned the green color would comprise a green cell in the TLP. The parameter γ allows more flexibility to deal with false positives and false negatives and provides the possibility of continuing to assign the yellow color to a cell up to a higher proportion of green (i.e., for γ = 0.7, the algorithm will not assign the green color if the class is composed of less than 70% green elements). - Other variables:
- If card[argmax(Fc)] = 1, then .
- If card[argmax(Fc)] > 1 ꓥ (argmax(Fc) = 3), then SC = y.
- If nc*/nc > γ, then SC = v.
where is the assigned color according the position of q in Dz = {r,g,y,v}.
2.6.3. Creation of Data-Driven Second-Generation Indicators (DD2gI)
- Select the component variables to be synthesized in a single data-driven variable: Experts should identify the subsets of variables in this situation and determine the variables that are to be components of the new variable.
- Cluster the selected component variables: Cluster the selected variables with Ward’s aggregation criteria [26], the chained reciprocal neighbor algorithm presented in [36], and Gibert mixed metrics. Let P be the new class variable obtained. The resulting classes are identified by the software with a number. In this form, the class variable cannot be interpreted by itself; a postprocessing technique is used to induce labels for each cluster and create the new qualitative variable with meaningful modalities.
- 3.
- Create a CPG: Build a CPG from all components chosen in Step 1 versus = the identified in Section 2.
- 4.
- Create a thermometer for the component variables selected in Step 1: This should be accomplished together with the expert in the field, according to Section 2.6.1.
- 5.
- Create a TLP based on the thermometer: This should be accomplished using the methodology described in Section 2.6.2.
- 6.
- Create the interpreted new indicator : This should be accomplished by labelling modalities of according to the information provided by the TLP/aTLP and the joint representation of TLP/aTLP over the CPG. These tools show the differences and commonalities of the different variables in each class, so the domain experts can induce proper labels for all the classes in P according to their main characteristics and can summarize the main concept behind each class. There is a bijective relationship between P and . Thanks to the TLP and the thermometer, the class variable becomes an interpretable new qualitative variable with modalities having semantic meaning.
- 7.
- Name : Associate a label to the variable (often the name of the concept) and provide a description of the variable itself and each of its modalities to fix interpretation.
- 8.
- Adding to the general database: This step enlarges the set of variables with this new variable. becomes a new column of the dataset, indicating a labeled cluster for each individual with the structure of an ordinary qualitative variable.
2.6.4. Territorial Feature Selection Method (TFSM)
- 1.
- Select the location variable: In this case, the territorial variable should be selected from the original database. This is an informative variable in the original dataset that does not belong to any block. The territorial variable XLoc is a qualitative variable where the modalities are locations (L) DLoc = {l1, l2,…, l,…lL}, Loc ϵ 1:K.
- 2.
- Select the block: In Step 1 of the creation of new data-driven third-generation indicators, some blocks were created. Now, one block should be selected.
- 3.
- Select the candidate variables: At this moment, the block might contain original variables, second-generation variables, and the data-driven third-generation indicator. Only the third-generation variable and its component variables are selected as candidate variables. The candidate variables are χ = {Xk tq k ϵ 1:K} with modalities Dk = {m1,m2,…,m,…mnk}, m ϵ Dk
- 4.
- Compute a ranking of variables according to their capacity to explain the territorial distribution: Evaluate the candidate variables to determine their selection. Repeat the following steps for each candidate variable pre-selected in Step 3.
- a.
- Calculate Lebart test values: Using the methodology from Section 2.1.2, all Lebart test values are calculated for each qualitative candidate variable against Loc. For each Xk qualitative, the output of this process is a table with l ∈ DLoc in rows and nk columns, named Mm (m=1: nk), where vlm are the p-values of the Lebart test value of .
- b.
- Create a new variable with l ∈ DLoc in rows.
- c.
- Calculate the ratio of significant locations for the variable using the information provided by ,This indicator assesses the percentage of locations that can be characterized by some modalities of . We will be interested in the variable that provides significant modalities to as many locations as possible, meaning that the same variable can explain a bigger part of the territory.
- d.
- Given , and its corresponding create Table with l ∈ DLoc in rows and nk columns, named Mm (m=1: nk), where lm corresponds to the empirical probability of for significant cells:This helps to understand where the significant modalities of a certain location cover a big portion of the location population or, on the contrary, target a limited group of individuals. In fact, having a significant modality that covers a minority is not sufficient to explain a cluster (or a location).
- e.
- Create a new variable with l ∈ DLoc in rows andindicates the proportion of individuals involved in locations with significant modalities of according to the Lebart test value.
- f.
- Calculate the average of for all locations so that we obtain an estimate of the average contribution of the significant modalities of to the location population.This indicator is an estimate of the average coverage of significant modalities of along the territory. For low values of , the significant modalities of correspond to a minority of the different locations, and the variable is not as informative as required.
- g.
- Calculate the index of potential explainability of the variable in the territory asThe index weights the average proportion of individuals involved in significant territories for a given variable according to the proportion of significant locations in the entire territory. We are interested in variables with the capacity to characterize as many locations as possible, with as big an area of coverage as possible. The relevance of the variable decreases if it is significant in many locations with sparse coverage or in few locations even though it involves a large number of individuals. This is a correction that is important in order to balance the impact in the analysis of big cities, which concentrate a large part of the population in a relatively small area, as is the case with Barcelona in the Catalan territory. is defined to gain robustness with regard to an unbalanced distribution of the population in the territory or modalities pointing out exceptional groups of individuals with low presence. This prevents big cities and exceptional minorities from dominating the entire analysis. By computing the value of for all candidate variables, a ranking can be built regarding the potential of to explain the territorial distribution.
- 5.
- Select the variables: For each block, the variable with the maximum is selected to represent the block in further analysis.
- 6.
- Cluster the selected variables: The set of variables selected in Step 5 has one representative variable for each block in the dataset. This set of variables is the input to a conditional clustering method using a hierarchical algorithm based on the Ward’s method aggregation criteria presented in [23], Gibert mixed metrics, and as a conditional variable. A new class variable results from this process. The resulting classes are identified by the software with a number. In this form, the class variable cannot be interpreted by itself; a postprocessing technique is used to induce labels for each cluster and make the new qualitative variable with meaningful modalities.
- 7.
- Interpret and label the obtained classes: Repeat Steps 3 to 8 of the data-driven second-generation variable creation process.
- 8.
- Profile the classes: Analyze the significance of input variables in the classes and the conditional distributions to identify the relevant characteristics of each class so that a short description of the characteristics of each class can be provided.
- 9.
- Create maps to visualize classes: This step is carried out to visualize the final classification.
2.6.5. Validation Methodology
- To validate the introduction of the thermometer into the generation of automatic TLPs: The thermometer method is used to improve the TLP. Thus, the validation methodology proposed is based on the comparison of two TLPs. One is built traditionally, where the color of a cell is decided by a human based on CPG analysis; the second, the thermometer-based TLP, is where the color of the cells is decided by the method proposed in Section 2.6.1. Both TLPs and their interpretation are shown to a group of experts in the field, who discuss which of the two is more believable or which of them provides a better understanding of the target domain.
- To validate the DD2gI (data-driven second-generation indicator) methodology: This is a semantically enriched methodology based on the data-driven methods presented in [1]. The validation methodology proposed is the same defined in the previous work.
- To validate the territorial feature selection method (TFSM): Here, two aspects are validated:
- The pertinence of the index of potential explainability that we propose: The results of using the proposed TFSM methodology are compared with the state of the art using a χ2 test of the territorial variable (XLoc) versus each single variable Xk as a tool to select the most discriminant variable for further clustering steps. However, all Xk provide a significant χ2 independent test p-value and are not helpful in reducing the number of variables to be used to represent the blocks in the clustering process, whereas the proposed index allows a ranking of variables from more discriminant to less and allows the variables to undergo a selection process.
- The final classification obtained: Two classifications are compared. One results from the application of TFSM, namely . Another results from clustering the set of third-generation variables created for each block, namely . There are two ways of comparing and validating if the classification obtained through TFSM is better; one method is based on graphical tools, and the other is based on numerical tools.
- Numerical validation: We start by calculating the Lebart test values. Using the methodology from Section 2.1.2, all Lebart test values are obtained for each c ∈ versus XLoc and for each c’ ∈ . Tables tables presented in Section 2.6.3. Here, since the clustering has been conditioned to the locations, all individuals of a location are clustered into a single class. Therefore, and by construction so that and the proposed index of potential explainability cannot be used in conditional clustering methods. For this reason, we will just compute the global proportion of significant cells in to see which of the two partitions can explain a bigger part of the territory. Given a partition P with classes, (P ∈{}), the index isThis index accounts for the proportion of significant cells in a table. The greater the value of the better P distributes along the locations. We propose comparing and consequently in order to see which of the two partitions distributes into the territory in a more significant way into the territory.
- Graphical validation: One map per classification should be drawn. The map should be painted as a function of the class that belongs to the location. The territorial cohesion of the classes will be considered for the evaluation.
3. Case Study and Results
3.1. INSESS-COVID19
3.1.1. The Project
3.1.2. INSESS-COVID19 Database
3.2. Creating New Variables
3.2.1. Creation of Location Variable
3.2.2. Creation of Data-Driven Second-Generation Indicators (DD2gI)
- 1.
- Select the component variables to be synthesized
- 2.
- Cluster the selected component variables
- 3.
- Create the CPG
- 4.
- Create the Thermometer
- Green was assigned to the modalities “2.MM-MP” and “3.Nucleus” because the expert considered that someone living with their nuclear family and people who love them is a positive situation. In this instance, an individual has people who can help them in case of necessity, and difficulties are easy to overcome.
- Yellow was assigned to the modalities “4.Regrouped” and “5.Extended” because the expert considered that living with people that belong to one’s family group is positive. On the other hand, these kinds of families usually are composed of many members, and the apparition of conflicts between members is possible. Therefore, as this possible aspect is negative, the modality was marked yellow.
- Red was assigned to the modalities “1.Alone” and “6.NoFamily”. Living with people who are not one’s own family where there is, therefore, no emotional attachment is akin to living alone. It is more difficult to find help. Moreover, if one lives with people that are not one’s family, it is easier for conflicts to appear.
- 5.
- Create the TLP based on the Thermometer
- Nuclear: People in this group answered “3.Nucleus” for the 3 periods of time.
- MM-MP: People in this group answered “2.MM-MP” for the 3 periods of time.
- Regrouped: People in this group answered “4.Regrouped” for the 3 periods of time.
- Extended: People in this group answered “5.Extended” for the 3 periods of time.
- Coexistence: People did not know what to do in January 2021 (note that the questionnaire was answered in 2020, during the first COVID-19 wave). During 2020, they had different situations.
- Alone: People in this group answered “1.Alone” for the 3 periods of time.
- No family: People in this group answered “6.Non-Family” for the 3 periods of time.
- NA: People in this group answered “7.No Answer” for the 3 periods of time.
- 6.
- Create the interpreted new indicator and name :
- 7.
- Add to the general database:
3.2.3. Feature Selection and Final Clustering
- Select the location variable:
- 2.
- Select the block
- 3.
- Select the candidate variables
- 4.
- Compute a ranking of variables according to their capacity to explain the territorial distribution:
- 5.
- Variable selection:
- 6.
- Clustering selected variables:
- 7.
- Creating CPG
- 8.
- Creating Thermometer
- 9.
- Create aTLP based on a Thermometer
- 10.
- Profile the classes
- Vilassar de Mar: This group does not present with relationship problems within their units of coexistence, and members of this group comprise families living in houses that they own or rent. Members of this group did not need psychological support due to COVID-19. They are mostly of Catalan origin. The people in charge of familities in this group do not waste any time caring for their unit. They did not need any aid due to COVID-19. Some have pending lawsuits related to the economic field. Some did not need social assistance during COVID-19. This group is comprised of working people as well as people with unconventional work situations. Their socialization with the environment has diminished or is nonexistent.
- C35: This group does not present with relationship problems regarding their unit of coexistence, and members of this group comprise familial cohabitation units (extensive, traditional, or single-parent) in houses they own or rent. They did not need psychological support due to COVID-19. They are mostly of Catalan origin. The people in charge of families in this group do not waste any time caring for their units. They did not need any aid due to COVID-19. They have no pending economic decisions, and some people need help with food. This group is comprised of people with other unconventional work situations and people who do not work. They do not participate in their environment.
- Pallars: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. They did not need psychological support due to COVID-19. They are mostly of Catalan origin. The people in charge of families in this group do not waste any time caring for their family units. They did not need any aid due to COVID-19. They have no pending economic situations, and some of them need help with food. This group is comprised of people who work and who have other unconventional work situations. They do not participate in their environment.
- Garrigues: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. They did not need psychological support due to COVID-19. They are mostly of Catalan origin. The people in charge of families in this group do not waste any time caring for their family units. Their situation of dependency has not diminished due to COVID-19. Some members of this group need help from social services due to COVID-19. They have pending economic decisions; furthermore, some people in this group have not needed food-related help, while others have needed food-related help. This group is comprised of people who work and people who have other unconventional work situations. Their participation in their social environment is varied. Some participate a lot, some demonstrate decreased participation, and some do not participate at all.
- C30: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. They did not need psychological support due to COVID-19. They are mostly of Catalan origin. The members of this group have no people in charge of their familial units. Some of them need help from social services due to COVID-19. They have pending financial decisions, and some people have needed food-related assistance and other specific types of aid that arose during the pandemic. This group is comprised of people who work and people who have other unconventional work situations. They do not participate in their environment and have not cared for people with COVID-19.
- C25: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. They did not need psychological support due to COVID-19. They are mostly of Catalan origin. Most of them have dependent children who take up part of their time. They did not need any aid due to COVID-19. They have no pending economic decisions, and some people need help with food. This group is comprised of people who work and people who have other unconventional work situations. They do not participate in their environment and have not cared for people with COVID-19.
- C31: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. They did not need psychological support due to COVID. They are of a foreign origin. The people in charge of families in this group do not waste any time caring for their family units. They did not need any aid due to COVID. They have no pending economic decisions, and some people need help with food. This group is comprised of people who work and people who have other unconventional work situations. They do not participate in their environment and have not cared for people with COVID-19.
- C40: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. Some people have needed psychological support due to experiences related to COVID-19 within their personal network. They are mostly of Catalan origin. The people in charge of families in this group do not waste any time caring for their family units. They have no pending economic decisions, and some people need help with food. Some are people with other unconventional work situations, and others are working people. They do not participate in their environment and have not cared for people with COVID-19.
- Sant Cugat del Vallès: This group does not present with relationship problems within their units of coexistence. In fact, they have good coexistence units, as they live mainly with their nuclear families or are single-parent families. Some people have needed psychological support due to experiences related to COVID-19 within their personal network. They are mostly of Catalan origin. They take care of children, who consume part of their time. They did not need to attend reception services during the COVID-19 pandemic. They have pending economic decisions, and most people need help with food. Some people worked before the pandemic but lost their jobs, while others are self-employed. Some people do not participate in their environment at all, and in other cases, their participation in their environment has decreased.
- Sant Joan Despí: This group does not present with relationship problems within their units of coexistence, and members of this group comprise familial coexistence units (extensive, traditional, or single-parent) in houses that they own or rent. They have not needed psychological support due to experiences related to COVID-19 within their personal network. They are of Spanish origin and are suspected of being victims of violence. They have children in charge who take care of them, as well as grade III dependents who take up their time. They have pending lawsuits related to economic matters, and all people need help related to food. Some people worked before the pandemic but lost their jobs. Not everyone is involved in their environment.
- o176: This group is composed of people who suffered verbal violence from their units of cohabitation but now live alone in a flat. These individuals do not need emotional support but may suffer physical violence at work and psychological violence at rest. These individuals are dependent people who live in Catalonia and have accepted the measures proposed by the Spanish government to solve the COVID-19 pandemic. They have pending lawsuits on economic issues and have received multiple grants. They worked before the pandemic and are very involved people.
- 11.
- Creating maps to visualize classes
- C25: Alt Empordà, Amposta, Baix Penedès, Calafell, Mollet del Vallès, Sant Vicenç dels Horts, Solsonès, Tarragona, and Vilafranca del Penedès.
- C30: Lleida, Montcada i Reixac, Sant Andreu de la Barca, ad Sant Pere de Ribes.
- C31: Barcelona, Manresa, Masnou, el Rubí, and Tarragonès.
- C35: Alt Penedès, Bages, Baix Llobregat, CAS Garrotxa, Girona, Maresme, Osona, Pallars Jussà, Reus, Ribera d’Ebre, and Vilanova i la Geltrú.
- C40: Baix Empordà, Barberà del Vallès, Figueres, Gironès-Salt, Noguera, Pla de l’Estany, Sant Feliu de Guíxols, Selva, and Vallès Oriental.
- Garrigues: Garrigues.
- Pallars-Sobira25: Pallars Sobira.
- Sant-Cugat-del-Valles7: Sant Cugat del Vallès.
- Sant-Joan-Despi33: Sant Joan Despí.
- Vilassar-de-Mar20: Vilassar de Mar.
3.2.4. Validation
Validating the DD2gI
Validating the TFSM
- a.
- Numerical validation
- b.
- Graphical validation
4. Discussion, Conclusions, and Future Work
- Territorial feature selection method: This methodology is intended to build groups of territorial locations with territorial coherence based on clustering that might be interpreted through TLP based on the thermometer method and provide groups of BASS to be managed in a common way from the application domain point of view. This method is the main contribution proposed in this paper and includes the identification of the variable with the best performance in the global clustering (see details in Section 2.6.4).
- Thermometer: This is a new tool that assigns basic traffic light colors (green, yellow and red) to ranges of values for numerical variables or to the modalities of qualitative variables so that colors are associated with the semantics of the variable. It is a knowledge acquisition tool that allows domain experts to transfer semantics to the machine. It is formalized in this paper and enlarged with a fourth color (violet) used to represent missing values (see Section 2.6.1) and the DD2gI.
- TLP based on the thermometer method: This is a new method to automatically determine the color of each TLP cell using the knowledge and semantics formalized in a thermometer (see Section 2.6.2). It has been validated by experts based on the results of a real application on the INSESS-COVID19 database and by a comparison with the traditional methods of building TLPs. This significantly increases the potential of the TLP tool, which was traditionally built by visual inspection of the conditional distributions of the variables with regard to a class variable.
- Data-driven second-generation indicators (DD2gI): This is a new methodology that enriches the data-driven third-generation variable creation method presented in [1] with the introduction of the thermometer method combined with clustering and traffic light panels. It is described in Section 2.6.3.
- Index of potential explainability: This is a new index based on the Lebart test values for qualitative variables computed versus location. It is used as the metric for selecting candidate variables inside TFSM. It is described in Step 4 of Section 2.6.4.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Angerri, X.; Gibert, K. Preprocessing and Artificial Intelligence for increasing explainability in Mental Health. Int. J. Artif. Intell. Tools 2023, 32, 2. [Google Scholar] [CrossRef]
- Gibert, K.; Angerri, X. The INSESS-COVID19 Project. Evaluating the impact of the COVID19 in social vulnerability while preserving privacy of participants from minority subpopulations. Appl. Sci. 2021, 11, 3110. [Google Scholar] [CrossRef]
- Gibert, K.; Codina, T.; Angerri Torredeflot, X. Informe INSESS-COVID19: Identificació de Necessitats Socials Emergents Com a Conseqüència de la COVID19 i Efecte Sobre els Serveis Socials del Territori; Intelligence Data Science and Artificial Intelligence Research Center (IDEAI): Barcelona, Spain, 2020. [Google Scholar]
- Sevilla-Villanueva, B.; Gibert, K.; Sànchez-Marrè, M. Identifying nutritional patterns through integrative multiview clustering. Artif. Intell. Res. Dev. 2015, 277, 185. [Google Scholar] [CrossRef]
- Sevilla-Villanueva, B.; Gibert, K.; Sànchez-Marrè, M. A methodology to discover and understand complex patterns: Interpreted Integrative Multiview Clustering (I2MC). Pattern Recognit. Lett. 2017, 93, 85–94. [Google Scholar] [CrossRef]
- Bickel, S.; Scheffer, T. Multi-view clustering. In Proceedings of the 4th IEEE International Conference on Data Mining, Brighton, UK, 1–4 November 2004; pp. 19–26. [Google Scholar]
- Jundong, L.I.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2017, 50, 1–45. [Google Scholar] [CrossRef] [Green Version]
- He, X.; Cai, D.; Niyogi, P. Laplacian score for feature selection. Adv. Neural Inf. Process. Syst. 2005, 18. [Google Scholar]
- Robnik-Šikonja, M.; Kononenko, I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef] [Green Version]
- Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
- Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
- Yu, L.; Liu, H. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), Washington, DC, USA, 21–24 August 2003; pp. 856–863. [Google Scholar]
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Davis, J.C.; Sampson, R.J. Statistics and Data Analysis in Geology; Wiley: New York, NY, USA, 1986. [Google Scholar]
- Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef]
- Jacob, L.; Obozinski, G.; Vert, J.-P. Group lasso with overlap and graph lasso. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 433–440. [Google Scholar] [CrossRef]
- Núñez, H.; Sànchez-Marrè, M. Instance-based learning techniques of unsupervised feature weighting do not perform so badly! ECAI 2004, 16, 102. [Google Scholar]
- Lebart, L.; Morineau, A.; Fénelon, J.P. Traitement Statistique Des Données; Dunod: Paris, France, 1990; p. 34. [Google Scholar]
- Gibert, K.; Sevilla-Villanueva, B.; Sànchez-Marrè, M. The role of significance tests in consistent interpretation of nested partitions. J. Comput. Appl. Math. 2016, 292, 623–633. [Google Scholar] [CrossRef]
- Gibert, K.; Sànchez–Marrè, M.; Izquierdo, J. A survey on pre-processing techniques: Relevant issues in the context of environmental data mining. AI Commun. 2016, 29, 627–663. [Google Scholar] [CrossRef] [Green Version]
- Torres, P.; Cruz, C.H.; Patiño, P.J. Índices de calidad de agua en fuentes superficiales utilizadas en la producción de agua para consumo humano: Una revisión crítica. Rev. Ing. Univ. Medellín 2009, 8, 79–94. [Google Scholar]
- Vergara, C.; Arregui, I.; Balaguer, A.; Gómez, T.; Sandoval, C.; Sànchez Marrè, M.; Gibert, K. Learning on the relationships between respiratory disease and the use of traditional stoves in Bangladesh households. In Proceedings of the 8th International Congress on Environmental Modelling and Software Met, Toulouse, France, 10–14 July 2016. [Google Scholar]
- Zhao-Hui, L.U.; Cai, C.-H.; Zhao, Y.-G.; Leng, Y.; Dong, Y. Normalization of correlated random variables in structural reliability analysis using fourth-moment transformation. Struct. Saf. 2020, 82, 101888. [Google Scholar]
- Karina, G.I.O. The use of symbolic information in automation of statistical treatment for ill-structured domains. AI Commun. 1996, 9, 36–37. [Google Scholar] [CrossRef]
- Ward, J.R.; Joe, H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
- Glielmo, A.; Husic, B.E.; Rodriguez, A.; Clementi, C.; Noé, F.; Laio, A. Unsupervised learning methods for molecular simulation data. Chem. Rev. 2021, 121, 9722–9758. [Google Scholar] [CrossRef]
- Minghua, L.I.; Ferretti, M.; Ying, B.; Descamps, H.; Lee, E.; Dittmar, M.; Lee, J.S.; Whig, K.; Kamalia, B.; Dohnalová, L.; et al. Pharmacological activation of STING blocks SARS-CoV-2 infection. Sci. Immunol. 2021, 6, eabi9007. [Google Scholar]
- Murtagh, F.; Contreras, P. Algorithms for hierarchical clustering: An overview. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2012, 2, 86–97. [Google Scholar] [CrossRef]
- Gibert, K.; Cortés García, C.U. Weighting quantitative and qualitative variables in clustering methods. Mathw. Soft Comput. 1997, 4, 1997. [Google Scholar]
- Lefkovitch, L.P. Conditional clustering. Biometrics 1980, 36, 43–58. [Google Scholar] [CrossRef]
- Gibert, K.; Garcia-Rudolph, A.; Garcia-Molina, A.; Roig-Rovira, T.; Bernabeu, M.; Tormos, J. Response to TBI-neurorehabilitation through an AI& Stats hybrid KDD methodology. Med. Arch. 2008, 62, 132–135. [Google Scholar]
- Gibert, K.; Conti, D.; Vrecko, D. Assisting the end-user in the interpretation of profiles for decision support. an application to wastewater treatment plants. Environ. Eng. Manag. J. 2012, 11, 931–944. [Google Scholar] [CrossRef]
- Gibert, K.; Conti, D. aTLP: A color-based model of uncertainty to evaluate the risk of decisions based on prototypes. AI Commun. 2015, 28, 113–126. [Google Scholar] [CrossRef]
- Gibert, K.; Conti, D.; Sànchez-Marrè, M. Decreasing uncertainty when interpreting profiles through the traffic lights panel. In Advances in Computational Intelligence, Proceedings of the 14th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 2012, Catania, Italy, 9–13 July 2012; Springer: Berlin/Heidelberg, Germnay, 2012; pp. 137–148. [Google Scholar] [CrossRef]
- De Rham, C. La classification hiérarchique ascendante selon la méthode des voisins réciproques. Cah. De L’analyse Des Données 1980, 5, 135–144. [Google Scholar]
- COVID-19. Web del Project INSESS-COVID19. Available online: https://insess-covid19.upc.edu/ (accessed on 27 April 2023).
- DIXIT Centre de Documentació de Serveis Socials. Available online: https://dixit.gencat.cat/ca/detalls/Noticies/tsf_presenta_eina_cribratge_ajudar_identificar_gestionar_casos_socials_complexos.html (accessed on 15 February 2021).
- Pla Estratègic de Serveis Socials. Available online: https://treballiaferssocials.gencat.cat/web/.content/03ambits_tematics/15serveissocials/pla_estrategic_serveis_socials/Pla_estrategic_serveis_socials_catalunya_NOU/01_Plana_principal/1.-2020-12-29-Pla-estrategic-de-serveis-socials-2021-2024.pdf (accessed on 15 February 2021).
- Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
Block Number | Block Name | Original Variables | Preprocessed Variables | DD2gI | KB2g | 3gV |
---|---|---|---|---|---|---|
B02-B03-B04 | Origin | 1 | 1 | |||
B05 | Trials | 2 | 8 | 1 | ||
B06 | LivingCoexistance | 5 | 1 | 1 | ||
B07 | DigitalGap | 4 | 6 | 1 | 1 | |
B08 | DependentEcolution | 4 | 1 | 1 | ||
B09 | UseofTimes (peopleincharge) | 15 | 4 | 1 | ||
B10 | UsdelTempsDINt1T2 | 16 | 9 | 1 | ||
B11 | ConvRel | 3 | 1 | |||
B11 | Violence | 19 | 5 | 4 | 36 | 1 |
B12 | Allparticipation | 12 | 3 | 1 | ||
B13 | Labour | 17 | 1 | 1 | ||
B13 | LaborBussiness | 5 | 1 | |||
B14 | Telework | 8 | 1 | |||
B15 | Economy | 18 | 1 | 14 | 1 | |
B15 | Gethelp | 14 | 2 | |||
B18 | Health | 9 | 1 | |||
B18 | Addiction | 12 | 1 | |||
B18 | MentalHealth | 9 | 1 |
Indicator/ Component | Block XI: Use of Time (Caretakers) | Children | GI | GII | GIII |
---|---|---|---|---|---|
0.7 | 0.5 | 0.5 | 0.43 | 0.45 | |
0.47 | 0.56 | 0.58 | 0.54 | 0.57 | |
0.329 | 0.28 | 0.29 | 0.23 | 0.26 |
Indicator/ Component | Block XI: Use of Time (Caretakers) | Children | GI | GII | GIII |
---|---|---|---|---|---|
χ2 p-value | 1.37 × 10−11 | 4.27 × 10−26 | 2.88 × 10−26 | 1.83 × 10−22 | 1.57 × 10−22 |
Indicator/ Component | B11Rel Conv | R1.RelU ConvG20 | R2.RelU ConvJ20 | R3.RelU ConvG21 |
---|---|---|---|---|
χ2 p-value | 1.37 × 10−11 | 2.95 × 10−06 | 1.58 × 10−17 | 1.88 × 10−10 |
Indicator/ Component | B11Rel Conv | R1.RelU ConvG20 | R2.RelU ConvJ20 | R3.RelU ConvG21 |
---|---|---|---|---|
0.61 | 0.66 | 0.68 | 0.66 | |
0.38 | 0.33 | 0.34 | 0.39 | |
0.23 | 0.22 | 0.23 | 0.26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Angerri, X.; Gibert, K. Variable Selection for Meaningful Clustering of Multitopic Territorial Data. Mathematics 2023, 11, 2863. https://doi.org/10.3390/math11132863
Angerri X, Gibert K. Variable Selection for Meaningful Clustering of Multitopic Territorial Data. Mathematics. 2023; 11(13):2863. https://doi.org/10.3390/math11132863
Chicago/Turabian StyleAngerri, Xavier, and Karina Gibert. 2023. "Variable Selection for Meaningful Clustering of Multitopic Territorial Data" Mathematics 11, no. 13: 2863. https://doi.org/10.3390/math11132863
APA StyleAngerri, X., & Gibert, K. (2023). Variable Selection for Meaningful Clustering of Multitopic Territorial Data. Mathematics, 11(13), 2863. https://doi.org/10.3390/math11132863