Semantics-Constrained Advantageous Information Selection of Multimodal Spatiotemporal Data for Landslide Disaster Assessment

: Although abundant spatiotemporal data are collected before and after landslides, the volume, variety, intercorrelation, and heterogeneity of multimodal data complicates disaster assessments, so it is challenging to select information from multimodal spatiotemporal data that is advantageous for credible and comprehensive disaster assessment. In disaster scenarios, multimodal data exhibit intrinsic relationships, and their interactions can greatly inﬂuence selection results. Previous data retrieval methods have mainly focused on candidate ranking while ignoring the generation and evaluation of candidate subsets. In this paper, a semantic-constrained data selection approach is proposed. First, multitype relationships are deﬁned and reasoned through the heterogeneous information network. Then, relevance, redundancy, and complementarity are redeﬁned to evaluate data sets in terms of semantic proximity and similarity. Finally, the approach is tested using Mao County (China) landslide data. The proposed method can automatically and effectively generate suitable datasets for certain tasks rather than simply ranking by similarity, and the selection results are compared with manual results to verify their effectiveness.


Introduction
Landslides have become increasingly common in recent years as a result of global climate change, and they have caused large losses in terms of property and lives [1][2][3].Post-disaster assessments must be performed immediately to collect information on damage and support emergency rescue and subsequent reconstruction [4][5][6][7][8].When a landslide occurs, the disaster-related organizations collect and firstly prepare the related datasets, including basic geographic data, economic data, remote sensing data, and so on.Then, different models and approaches are adopted to determine the affected area, damage extent, and direct losses.Many issues, such as population of dead or injured, affected crops, collapsed houses, and the economic losses need to be assessed.Finally, the accuracy and precision of assessment result can be improved by integrating field investigation, remote sensing monitoring, model estimation, and local level reporting of disaster impact data.Datasets from Thus, filter methods are more applicable to high-dimensional datasets related to emergency situations.The minimum-redundancy-maximum-relevance (mRMR) filter criterion is considered one of the most frequently used methods for reducing dimensionality to its high accuracy [34,35], and it is applied to the research to find the optimal datasets.However, most existing filter feature selection approaches use statistical metrics based on samples to calculate the relevance and redundancy of data attributes [33,36,37], so these terms of relevance and redundancy must be properly redefined for data selection.Data similarity is appropriate for creating these filter criteria but when it comes to the specific assessment tasks, more inter-data relationships, such as complementary or homogeneous relations extracting from the domain knowledge and experience, must be considered.Neglecting feature interaction or dependence may not lead to optimal selection results [38,39].
Overall, geographic information retrieval focuses on finding relevant datasets and ranking them using a semantic matching score; however, the retrieval process does not fully consider redundancy, and ranking candidates could not meet the analysis requirements directly.The study of feature selection methods provides a possible way of extending the retrieval process through subset generation and evaluation, but a comprehensive definition of subset evaluation criterions still needs to be studied.In this research, the fittest datasets with minimum redundancy are termed advantageous information, for which we expand the similarity-based data retrieval method and construct an advantageous information selection approach with a focus on giving a comprehensive definition of subset evaluation criteria.Relevancy, redundancy, and complementarity evaluation indicators are combined to find the best subset in this research.These evaluation indicators are defined based on both data-level relations, such as spatial coverage similarity, and task-level relations, such as extracting complementary relations from existing models.Regarding landslide assessment, the proposed method could select and group the candidates to satisfy the analytical requirements rather than simply ranking via measure score.Effectiveness is verified through a comparison with the manual selection result.Moreover, it is more reliable than other methods in terms of Precision and Recall.

Framework of the Methodology
As shown in Figure 1, the advantageous information selection approach consists of three components: construction of semantic relationships, definition of evaluation indicators, and selection strategy.Inputs are predefined models and massive datasets, and the models can be ontological or simply conceptual, but this is beyond the scope of this research.
The first component, the construction of semantic relationships, is part of the preprocessing for information selection that includes the two levels of relationships.The datasets, task, and variables are formulated by a set of united metadata, which is addressed elsewhere [11,24].The data-level relationships are based on semantic similarity and used to link the datasets, tasks, and variables.Then, all these relations are organized as heterogeneous information networks (HINs) and the task-level relationships are reasoned through a meta-path method which is used to constrain the selection process.In the second component, three evaluation indicators, relevance, redundancy, and complementarity, are defined based on the relationships to measure the datasets as a whole.The relevant indicator is used to filter irrelevant datasets; the redundancy indicator can be used to eliminate redundant datasets; and the complementarity indicator guarantees effectiveness for a given task.Finally, the selection strategy component provides a detailed approach flow, which mainly contains datasets filtered by relevance, subset generation, and subset evaluation.The first component, the construction of semantic relationships, is part of the preprocessing for information selection that includes the two levels of relationships.The datasets, task, and variables are formulated by a set of united metadata, which is addressed elsewhere [11,24].The data-level relationships are based on semantic similarity and used to link the datasets, tasks, and variables.Then, all these relations are organized as heterogeneous information networks (HINs) and the task-level relationships are reasoned through a meta-path method which is used to constrain the selection process.In the second component, three evaluation indicators, relevance, redundancy, and complementarity, are defined based on the relationships to measure the datasets as a whole.The relevant indicator is used to filter irrelevant datasets; the redundancy indicator can be used to eliminate redundant datasets; and the complementarity indicator guarantees effectiveness for a given task.Finally, the selection strategy component provides a detailed approach flow, which mainly contains datasets filtered by relevance, subset generation, and subset evaluation.

Unified Description of Multiple-Association Relationships
This paper abstracts and defines the multiple-association relationships as two levels according to their roles in data selection (shown in Figure 1).Data-level relationships (denoted as DRs) include data attribute relationships (ARs), scale relationships (SRs), and spatiotemporal relationships (STRs), which indicate the intrinsic characteristics of geospatial data.Task-level semantic relationships (denoted as KRs) include causal relationships (CauRs), homogeneous relationships (HomoRs) and complementary relationships (ComRs), which represent the specific 'knowledge' for each task and application [40,41].These semantic relationships are difficult to obtain using data alone and must be developed from an existing domain ontology, models, or expert experience.Given two datasets, Di and Dj, the various relationships between them are defined as follows: where DR = <STR, SR, AR> and KR = <CauR, HomoR, ComR>.
The spatiotemporal relationship is a principal characteristic of earth observation data, as it is the essential filter condition for geographical data retrieval [42,43].Scale relationships show the autocorrelations and interactions among scales.These relationships exist within geographic spatiotemporal variables, and variables at different scales can depict a geographic phenomenon or process at micro-, meso-, and macro-levels.Attribute relationships express the similarity between variables with data content, spatial precision, and temporal granularity.
Causal relationships refer to the objective relationships that describe and analyze cause and effect.This type of relationship is defined here to express the hierarchical relationships between parameters or data.During implementation, causal relationships are divided into mapping relationships and consist-of relationships.Homogeneous relationships are defined to show the synonymy between heterogeneous data from different domains on the application level.Datasets from different sources that reflect the same variable can share a relationship.For example, an earthquake intensity map and a Weibo or Twitter post distribution map may have the same effect in disaster assessment.Complementary relationships show variables that complement each other; considerable research has contributed to the discovery and utilization of this type of relationship [44,45].In the next section, the calculation methods of these relationships will be given.

Calculation of Data-Level Relationships Based on Data Similarity
Data-level relationships can be quantitatively measured using similarity according to a previous study [46][47][48][49].Since calculating the data-level relationships is the foundation of the following approach, we briefly introduce the formula in this paper.Some modifications and simplifications have been made based on recent research, and more detailed information can be found in the literature [29].Three elementary similarities between the characteristics can be calculated using the following methods.Before the calculation, the metadata of datasets and the description of task or variables are formulized and unified through a series of description factors such as title, keywords, data resolution, spatial coverage, and so on [11,24].
(1) Spatiotemporal similarity Spatiotemporal similarity refers to spatial coverage similarity and temporal coverage similarity; the former can be determined using the overlapping area and the latter can be determined by the time overlapping length or distance of the analysis, usually with preference for the latest data.The spatiotemporal similarity is calculated using Equation (2): S S = (Area(SD ∩ TD)/Area(TD) where SD represents the source data; TD represents the task requirement or target data; W S and W T refer to the weights of spatial coverage and temporal coverage, respectively, which are both set to 0.5 as space and time are tightly coupled during disasters; Time (SD) and Time (TD) indicate the middle time; and δ represents the adjustment parameters used to control deceleration, which is set to 0.9 when the time interval between Time (SD) and Time (TD) is less than a year and a half; otherwise it is set to 0.6.The product in Equation (2) may be transformed into a sum when space and time are not tightly coupled.
(2) Scale similarity Scale similarity refers to the spatial and temporal granularity/resolution similarity.If the scales of SD and TD are the same, then their similarity is equal to 1. Furthermore, a reasonable interval is defined as I = [scale TD − σ, scale TD + σ], where σ represents the acceptable error threshold.If the scale of SD is different from that of TD and if scale SD is in interval I, we set the similarity to 0.875 when the scales of SD and TD are fine-to-coarse, whereas similarity is set to 0.125 when coarse-to-fine, for data that can be converted but whose conversion simplicity differs.Otherwise, if scale SD exceeds interval I, similarity is equal to 0. Scale similarity is calculated using Equation (5).
(3) Attribute similarity Attribute similarity can be determined by the matching degree of the keywords.Keywords mainly include category, data type, and usage.The keyword type can be predefined for the convenience and accuracy of computing.Given keyword sets KW SD and KW TD , similarity is calculated using Equation (6).String and linguistic similarity calculation methods can be used here when keywords have not been extracted from text [44], but these methods are not included in this research.

Task-Level Relationship Discovery Based on Meta-Paths
Task-level relationships are reasoned through the meta-path method based on the HIN which can powerfully represent the essential information and links among various objects [50,51].The HIN is built firstly by integrating the related datasets and a pre-defined model (such as Figure 2) that describes the relations between terms or variables.to 0.9 when the time interval between  () and  () is less than a year and a half; otherwise it is set to 0.6.The product in Equation ( 2) may be transformed into a sum when space and time are not tightly coupled.
(2) Scale similarity Scale similarity refers to the spatial and temporal granularity/resolution similarity.If the scales of SD and TD are the same, then their similarity is equal to 1. Furthermore, a reasonable interval is defined as  = [  − ,   + ], where σ represents the acceptable error threshold.If the scale of SD is different from that of TD and if   is in interval , we set the similarity to 0.875 when the scales of SD and TD are fine-to-coarse, whereas similarity is set to 0.125 when coarse-to-fine, for data that can be converted but whose conversion simplicity differs.Otherwise, if   exceeds interval , similarity is equal to 0. Scale similarity is calculated using Equation (5).
(3) Attribute similarity Attribute similarity can be determined by the matching degree of the keywords.Keywords mainly include category, data type, and usage.The keyword type can be predefined for the convenience and accuracy of computing.Given keyword sets   and   , similarity is calculated using Equation ( 6).String and linguistic similarity calculation methods can be used here when keywords have not been extracted from text [44], but these methods are not included in this research.

Task-Level Relationship Discovery Based on Meta-Paths
Task-level relationships are reasoned through the meta-path method based on the HIN which can powerfully represent the essential information and links among various objects [50,51].The HIN is built firstly by integrating the related datasets and a pre-defined model (such as Figure 2) that describes the relations between terms or variables.As shown in Figure 3, the heterogeneous information network is composed of task, model, variable, and data, the first three of which can be considered term nodes.Specifically, the heterogeneous information network is defined as G = (N, E, D, W), where N indicates the set of nodes; E is the set of links between nodes and consists of triplets (u, v, d, where u, v∈N and d∈D); D is a set of relationship types, each member of which represents a different type of link; and W is the set of the weights of the links between nodes u and v in dimension d and represents the strength of the links.The link type can be abstracted as two relationships: mapping and consist-of.
variable, and data, the first three of which can be considered term nodes.Specifically, the heterogeneous information network is defined as G = (N, E, D, W), where N indicates the set of nodes; E is the set of links between nodes and consists of triplets (u, v, d, where u, v∈N and d∈D); D is a set of relationship types, each member of which represents a different type of link; and W is the set of the weights of the links between nodes u and v in dimension d and represents the strength of the links.The link type can be abstracted as two relationships: mapping and consist-of.To complete the network, the datasets are required to link with the corresponding variables (or task) by matching the metadata and description factors of requirements.The data similarity method mentioned in section 2.2.2 is used to quantitatively calculate the strength of relations.The relations between task node and data node can be used to roughly filter the datasets, and the relations between the variable node and the data node are used to further filter irrelevant data and support the discovery of the following task-level relations.
After the HIN is completed, the task-level relationships can be reasoned by a meta-path-based approach.A meta-path is defined as the shortest typed sequence that connects two or more objects in a HIN.The node of the sequence is the object type, and the edge is the relationship between object types.A meta-path can be a widely used description of how two objects are uniquely related in networks, and the relationships among similar types of links usually share similar semantic meanings.Once standard meta-paths are defined, the relationships between two objects can be found in a network by matching the meta-path query [52].
In this paper, the hierarchical structure of the terms is organized as a tree graph, and the term node with minimum depth in the shortest path between two datasets is defined as the middle node (MN).The relationship type on both sides of the node determines the association type of the datasets.As shown in Figure 3, three types of semantic relationships of type nodes are defined as follows: (1) Causal relationship The causal relationship reflects data that affect the analysis results in a certain way.This type of relationship between a task and data is constructed by using an existing model as a bridge.The corresponding meta-path indicated in Figure 3 is , which represents a path of type (

2) Homogeneous relationship
The homogeneous relationship is defined for dataset pairs with the same physical meaning.Two datasets will have a homogeneous relationship when not all relationships linked to the MN are consist-of relationships in the shortest path between them.The corresponding path indicated in  To complete the network, the datasets are required to link with the corresponding variables (or task) by matching the metadata and description factors of requirements.The data similarity method mentioned in Section 2.2.2 is used to quantitatively calculate the strength of relations.The relations between task node and data node can be used to roughly filter the datasets, and the relations between the variable node and the data node are used to further filter irrelevant data and support the discovery of the following task-level relations.
After the HIN is completed, the task-level relationships can be reasoned by a meta-path-based approach.A meta-path is defined as the shortest typed sequence that connects two or more objects in a HIN.The node of the sequence is the object type, and the edge is the relationship between object types.A meta-path can be a widely used description of how two objects are uniquely related in networks, and the relationships among similar types of links usually share similar semantic meanings.Once standard meta-paths are defined, the relationships between two objects can be found in a network by matching the meta-path query [52].
In this paper, the hierarchical structure of the terms is organized as a tree graph, and the term node with minimum depth in the shortest path between two datasets is defined as the middle node (MN).The relationship type on both sides of the node determines the association type of the datasets.As shown in Figure 3, three types of semantic relationships of type nodes are defined as follows: (1) Causal relationship The causal relationship reflects data that affect the analysis results in a certain way.This type of relationship between a task and data is constructed by using an existing model as a bridge.The corresponding meta-path indicated in Figure 3 is As shown in Figure 3, the heterogeneous information network is composed of task, model, variable, and data, the first three of which can be considered term nodes.Specifically, the heterogeneous information network is defined as G = (N, E, D, W), where N indicates the set of nodes; E is the set of links between nodes and consists of triplets (u, v, d, where u, vN and dD); D is a set of relationship types, each member of which represents a different type of link; and W is the set of the weights of the links between nodes u and v in dimension d and represents the strength of the links.The link type can be abstracted as two relationships: mapping and consist-of.

Figure 3. Relationship discovery based on heterogeneous information networks (HINs).
To complete the network, the datasets are required to link with the corresponding variables (or task) by matching the metadata and description factors of requirements.The data similarity method mentioned in section 2.2.2 is used to quantitatively calculate the strength of relations.The relations between task node and data node can be used to roughly filter the datasets, and the relations between the variable node and the data node are used to further filter irrelevant data and support the discovery of the following task-level relations.
After the HIN is completed, the task-level relationships can be reasoned by a meta-path-based approach.A meta-path is defined as the shortest typed sequence that connects two or more objects in a HIN.The node of the sequence is the object type, and the edge is the relationship between object types.A meta-path can be a widely used description of how two objects are uniquely related in networks, and the relationships among similar types of links usually share similar semantic meanings.Once standard meta-paths are defined, the relationships between two objects can be found in a network by matching the meta-path query [52].
In this paper, the hierarchical structure of the terms is organized as a tree graph, and the term node with minimum depth in the shortest path between two datasets is defined as the middle node (MN).The relationship type on both sides of the node determines the association type of the datasets.As shown in Figure 3, three types of semantic relationships of type nodes are defined as follows: (1) Causal relationship The causal relationship reflects data that affect the analysis results in a certain way.This type of relationship between a task and data is constructed by using an existing model as a bridge.The corresponding meta-path indicated in Figure 3 is , which represents a path of type

) Homogeneous relationship
The homogeneous relationship is defined for dataset pairs with the same physical meaning.Two datasets will have a homogeneous relationship when not all relationships linked to the MN are consist-of relationships in the shortest path between them.The corresponding path indicated in Figure 3 is , which represents a path of type , where the relationships of r1 and r2 do not belong to the consist-of relationship at the same time.
(3) Complementary relationship , which represents a path of

) Homogeneous relationship
The homogeneous relationship is defined for dataset pairs with the same physical meaning.Two datasets will have a homogeneous relationship when not all relationships linked to the MN are consist-of relationships in the shortest path between them.The corresponding path indicated in Figure 3 is variable, and data, the first three of which can be considered term nodes.Specifically, the heterogeneous information network is defined as G = (N, E, D, W), where N indicates the set of nodes; E is the set of links between nodes and consists of triplets (u, v, d, where u, vN and dD); D is a set of relationship types, each member of which represents a different type of link; and W is the set of the weights of the links between nodes u and v in dimension d and represents the strength of the links.The link type can be abstracted as two relationships: mapping and consist-of.

Figure 3. Relationship discovery based on heterogeneous information networks (HINs).
To complete the network, the datasets are required to link with the corresponding variables (or task) by matching the metadata and description factors of requirements.The data similarity method mentioned in section 2.2.2 is used to quantitatively calculate the strength of relations.The relations between task node and data node can be used to roughly filter the datasets, and the relations between the variable node and the data node are used to further filter irrelevant data and support the discovery of the following task-level relations.
After the HIN is completed, the task-level relationships can be reasoned by a meta-path-based approach.A meta-path is defined as the shortest typed sequence that connects two or more objects in a HIN.The node of the sequence is the object type, and the edge is the relationship between object types.A meta-path can be a widely used description of how two objects are uniquely related in networks, and the relationships among similar types of links usually share similar semantic meanings.Once standard meta-paths are defined, the relationships between two objects can be found in a network by matching the meta-path query [52].
In this paper, the hierarchical structure of the terms is organized as a tree graph, and the term node with minimum depth in the shortest path between two datasets is defined as the middle node (MN).The relationship type on both sides of the node determines the association type of the datasets.As shown in Figure 3, three types of semantic relationships of type nodes are defined as follows: (1) Causal relationship The causal relationship reflects data that affect the analysis results in a certain way.This type of relationship between a task and data is constructed by using an existing model as a bridge.The corresponding meta-path indicated in Figure 3 is , which represents a path of type

) Homogeneous relationship
The homogeneous relationship is defined for dataset pairs with the same physical meaning.Two datasets will have a homogeneous relationship when not all relationships linked to the MN are consist-of relationships in the shortest path between them.The corresponding path indicated in where the relationships of r1 and r2 do not belong to the consist-of relationship at the same time.
(3) Complementary relationship A complementary relationship refers to the group capacity of datasets, and a group of datasets with a complementary relationship always has a synergistic effect during analysis.Two datasets will have a complementary relationship when all relationships linked to the MN are consist-of relationships in the shortest path between them.The corresponding path indicated in Figure 3 For example, if one dataset (denoted as DS1) is linked to variable D9, and another dataset (denoted as DS2) is linked to variable D5 in Figure 2. The relationship between datasets DS1 and variables D9 is treated as a causal relationship.And data with no corresponding variables will be removed from the candidates.The shortest path between DS1 and DS2 is DS1 − → data, which is matched with the meta-path of the complementary relationship.The semantic relationships of homogeneous and complementary are used to group the datasets automatically and find the best data sets for the execution of the analysis.

Semantics-Concerned Evaluation Indicators
Advantageous information selection methods aim to find a suitable and reliable subset of datasets that minimize redundancy and maximize relevance.In this paper, relevance, completeness and redundancy are defined for selection results evaluation.
) is a subset with n datasets, then the indicators can be defined as follows.
Relevance is used to evaluate and rank datasets individually.We apply data similarity, including spatiotemporal similarity, scale similarity, and attribute similarity, to the quantitative calculation of relevance.Relevance, as used to quantify the total similarity between X and T(TASK), is defined as Equation (7). Rel Completeness is used to evaluate all candidates as a whole, and it is ignored in general selection methods, which are usually not constrained by the analytical model.Completeness consists of two parts in this study, i.e., spatiotemporal integrity and data category integrity.The former is a basic component of spatiotemporal analysis, and the latter is necessary for specific analytical model.Suppose that the analytical model needs n essential variables.Then, completeness is defined as follows: where VarNum(X) denotes the number of corresponding variables in the candidate set X. Multiple variables that have causal relationships can be counted only once.Redundancy refers to information that is expressed more than once.Examples of redundancy include multiple datasets with homogeneous relationships or multiple datasets with a repeating spatiotemporal range.However, redundancy should be a relative concept, and it changes according to the situation [34].For example, some datasets may be redundant with regard to the spatiotemporal range but complementary with regard to resolution.Therefore, we define a redundancy-complementarity indicator to consider the intercorrelations between datasets.Given datasets X i and X j , which have a homogeneous relationship, redundancy and related definitions are calculated as follows: 11) where C SCA represents the constraints delivered from semantic relationships.Datasets with high similarity score may bring few redundancies when they represent different variables or semantic meanings.Taking two image data points of the same area as an example, one is obtained on 23 June and the other on 24 June.These two data may be redundant because they are so similar.Once the landslides occurred in the early hours of 24 June, they become complementary because they are both necessary to detect changes.R X i , X j can reflect the amount of repetitive information.Here, we use multiplication to emphasize the potential information gains in space, time, scale and attributes.
Local detailed information may improve global accuracy according to the methods of data fusion, transfer learning, etc.Finally, based on the evaluation indicators above, the overall evaluation function is proposed as follows: where Rel(X) is the total relevance of the candidates, Red Comy (X) is the sum of the redundancy of each pair of datasets in the candidate set, and n is the number of candidates.This equation is used as the utility function in the selection process to effectively evaluate the subsets.

Semantics-Constrained Advantageous Information Selection Strategy
The proposed selection approach is divided into three steps: (1) Dataset ranking by relevance The relevance ranking process aims to rank the dataset based on the relevance between task requirements and available datasets.As shown in the previous section, relevance can be calculated by Equation (6).Datasets are ranked by the final relevance score, and datasets without spatiotemporal similarity will be removed.
(2) Subset generation Random generation and sequential selection are adopted to generate subsets.First, n datasets are randomly selected from the candidates as the original subset, and then all datasets that are complementary to them are selected into the subset.
(3) Subset evaluation The indicators of completeness and redundancy are calculated firstly for the subset.Then, sequential backward selection, which sequentially removes datasets that can increase the value of overall evaluation function defined in Equation ( 12), is adopted.This subset generation and evaluation processes will repeat until the value of the overall evaluation function cannot be increased.

Case Study
To verify the effectiveness of the proposed approach, we adopt landslide loss assessment as the target and the case study demonstrates the process of the automatic and adaptive selection of data for an assessment.Post-disaster damage and losses assessment of landslides is a complex process that involves many factors, such as landslide area and strength, the distribution and property value of housing and infrastructure.
The studied landslide occurred on 24 June 2017 in Diexi Town, Mao County, Sichuan Province in south-western China (Figure 4).It destroyed 40 homes in Xinmo Village and killed 10 people, with a further 73 people missing [53].It was a high-speed and long-distance landslide.The volume of this landslide was about 8 million m 3 and the sliding rocks buried village and blocked the river near the toe of the slope [54].The direct economic loss was about 300 million Yuan according to the assessment reports.The pre-disaster information collection is necessary for the loss assessments because nearly all the buildings was buried and even the field investigations could not determine the exact losses situation.Therefore, the assessment of housing damage and losses was selected for the implementation of the proposed method in the case study.The assessment can be carried out using the following steps.
(1) The landslide area must be identified according to airborne or satellite images before and after the disaster.The area can be extracted through image interpretation or change detection.(2) Housing distribution should be determined based on Digital Line Graphics (DLGs), high-resolution images or local statistical data.(3) Housing type is also used in the assessment to determine the cost of a building; this information can be generated from field survey data, local statistical data or other in situ data.data, local statistical data or other in situ data.
The presented implementations do not focus on the entire assessment process to generate the final assessment results, but rather concentrate on automatically finding the best datasets for the assessment.The feasibility and applicability of the proposed method are verified by demonstrating the use cases.To test the effectiveness of the proposed approach, test data are collected from the National Catalogue Service for Geographic Information of China (http://www.webmap.cn)and the Sichuan Geomatics Center.The following datasets are used in the implementation:

Test Data Description
(1) Basic geographic data Basic geographic data mainly cover DLG, DOM (Digital Orthophoto Map), DEM (Digital Elevation Model), DRGs (Digital Raster Graphics), and massive original images from satellites or UAVs (unmanned aerial vehicles).Two hundred DLG data points, ten thousand DEM data points, and twenty thousand images (with resolutions ranging from 0.1 m to 15 m) are included in the test datasets.
(2) Emergency thematic data Emergency thematic data contain historical case data, socioeconomic data, real-time field data, and crowd-sourced geographic data.Some historical case data, Sichuan province census data, and more than 5000 pieces of multimedia data were obtained via crowd sourcing.In addition, four pieces of data from UAV (unmanned aerial vehicle) images and two pieces of interpreted data are included in the test datasets.

Advantageous Information Selection Process
The overall workflow of the advantageous information selection approach for geographic analysis has four steps (Figure 5): The presented implementations do not focus on the entire assessment process to generate the final assessment results, but rather concentrate on automatically finding the best datasets for the assessment.The feasibility and applicability of the proposed method are verified by demonstrating the use cases.

Test Data Description
To test the effectiveness of the proposed approach, test data are collected from the National Catalogue Service for Geographic Information of China (http://www.webmap.cn)and the Sichuan Geomatics Center.The following datasets are used in the implementation: (1) Basic geographic data Basic geographic data mainly cover DLG, DOM (Digital Orthophoto Map), DEM (Digital Elevation Model), DRGs (Digital Raster Graphics), and massive original images from satellites or UAVs (unmanned aerial vehicles).Two hundred DLG data points, ten thousand DEM data points, and twenty thousand images (with resolutions ranging from 0.1 m to 15 m) are included in the test datasets.
(2) Emergency thematic data Emergency thematic data contain historical case data, socioeconomic data, real-time field data, and crowd-sourced geographic data.Some historical case data, Sichuan province census data, and more than 5000 pieces of multimedia data were obtained via crowd sourcing.In addition, four pieces of data from UAV (unmanned aerial vehicle) images and two pieces of interpreted data are included in the test datasets.

Advantageous Information Selection Process
The overall workflow of the advantageous information selection approach for geographic analysis has four steps (Figure 5):

Data Filtering
This step aims to remove obviously irrelevant data and reduce the overall amount of data through relevance (Equation 7) between task requirement and available data.Data with low relevance score will be removed from the original datasets and the threshold is set to 0.3.The filter results are listed in Table 1.Some datasets (i.e., items 16 and 17) are irrelevant; they are in Mao but far from the landslide area.

Data Filtering
This step aims to remove obviously irrelevant data and reduce the overall amount of data through relevance (Equation 7) between task requirement and available data.Data with low relevance score will be removed from the original datasets and the threshold is set to 0.3.The filter results are listed in Table 1.Some datasets (i.e., items 16 and 17) are irrelevant; they are in Mao but far from the landslide area.

Task-Level Relationship Construction
A conceptual model of landslide assessment, as shown in Figure 2, is used to provide hierarchical associations between variables.Then, the datasets are linked with the corresponding variables and the datasets without no corresponding variables will be removed from the candidates.For example, the radar data (item 3 in Table 1) is removed because it cannot meet the optical image requirement of the variable predisaster image (D3).Finally, the meta-path approach introduced in Section 2.2.3 is used to discover task-level relationships.

Selection and Optimization of Datasets
The candidates are ranked by the relevance calculated by Equation (7), as shown in Table 1.Random generation and sequential selection are adopted to generate subsets.First, a dataset is randomly selected and all datasets that are complementary to it are selected as subsets.Then, sequential backward selection, which sequentially removes datasets that can increase the value of overall evaluation function defined in Equation ( 12), is adopted.This process is repeated until the value of the overall evaluation function cannot be increased.Taking candidates S = [2,4,5] for landslide area as the example, the selection and evaluation processes proceed as follows.First, we calculate relevance (Rel(X 2 ) = 1, Rel(X 4 ) = 0.738, Rel(X 5 ) = 0.685).Then, redundancy is computed (Red Comy (X 4 , X 5 ) = 0.12, Red Comy (X 2 , X 4 ) = 0, Red Comy (X 2 , X 5 ) = 0) because the ComR relationship exists.Finally, completeness is calculated (Coms(S) = 1).Therefore, the overall evaluation function can be computed ( f (S) = 0.768).Given subsets St1 = {2, 4} and St2 = {2, 5}, and completeness Coms(St1) = 0.651 and Coms(St2) = 0.469, the overall evaluation function is computed ((St1) = 0.565 f (St2) = 0.440).Although the number of datasets in St1 and St2 is lower, the overall evaluation score is worse for the incompleteness of the spatiotemporal characteristic.

Selection Results
Although the initial datasets used in this implementation are massive, most of them are removed during filtering.The best candidate subset, with which all indicators are satisfied, consists of 5 datasets (Nos. 1, 4, 5, 6, and 11), as indicated with a green background in Table 1.No. 1 represents the landslide area; Nos. 4 and 5 correspond to house distribution and are complementary in spatial range; and Nos.6 and 11 represent house use and are complementary variables in the assessment model.Datasets (Nos.2, 7-10, and 12-15) with a blue background are redundant data, and datasets (Nos.3, 16, and 17) with a red background are removed due to low relevance or lack of causal relationships.To verify the effectiveness of the selection results, ten students with professional backgrounds participated in the selection process and manually picked the datasets; the results of the 8 students are consistent with the results of the proposed research.

Result Analysis and Discussion
The proposed method based on an integrated evaluation index of relevancy, redundancy, and completeness can help in the data collection process and improve automation and efficiency.Table 2 shows the statistics of the final results and the results were evaluated in terms of Precision and Recall.In an information retrieval scenario, the indicators of Precision and Recall are widely used to evaluate the query methods.The high Precision means that an algorithm return more useful datasets than useless ones while high Recall means that an algorithm returned most of the useful results without consideration on redundancy.Three methods, namely keyword-based retrieval method, the relevance-based retrieve method, and the semantics-constrained advantage selection method, are compared in the table.The results of the proposed approach, which considers more semantic constraints and integrates the subset evaluation process into the retrieval method, is more reliable and has higher Precision and Recall values than the other two methods.
Keyword-based retrieval contains the most irrelevant and redundant items and its Precision is lowest (33%) of the three methods.As shown in Table 1, datasets with a red background (No. 16 and No. 17) are selected using the keyword "Mao County", but they are far away from the landslide area due to semantic ambiguity.Moreover, multimedia datasets may not be retrieved by keywords because they are from different domains and do not contains the label related to "Mao County" or "landslides", but it's helpful to restore the scene before the disaster when the village are completely buried.
The relevance-based method usually only filter datasets by similarity.However, a high similarity value does not mean that they are useful for the assessment task.For example, the ranking of radar images will drop significantly after considering the gain in relevance from causal relationships because they are useless in the house losses assessment.The recommended results by relevance-based method must be further selected by operators for every variable as the method still ignores the redundancy.Furthermore, as we can see from the change of statistical indicators, this method is sensitive to the threshold values of relevance; large threshold values will improve Precision but worsen Recall.There is an inverse relationship between precision and recall, where it is possible to increase one at the cost of reducing the other.The threshold value definition remains a problem for relevance-based methods [11].
In the proposed method, relevance is used to generate the original candidates, and then the subsets generation and evaluation processes are used to select final results.Therefore, a lower relevance value (0.3) is adopted to guarantee that all useful datasets are selected into the original candidates.In the subset generation and evaluation process, the redundancy is removed while ensuring the completeness of the datasets.Hence, the proposed method can have a higher Precision and Recall at the same time.The improvements depend on the domain knowledge and the new evaluation indicators.Domain knowledge is presented by the predefined conceptual model (Figure 2) in this research, which can give the relationships between assessment factors or variables.But, these hierarchical relations cannot be adopted directly for the selection process.Thus, the task-level relationships are defined in this paper and the hierarchical relationships between concepts can be converted into semantic constraints between datasets by the reasoning method based on meta-path.The predefined conceptual model can be transformed from the existing assessment models and case data and an automatic transformation tool should be developed in the future.Comprehensive disaster assessment of affected area, damage extent, and direct economic losses is an emergency work for the government in order to make rescue and reconstruction plans.The proposed selection approach can facilitate the automation and accuracy of loss assessment and can be applied to the disaster management platform as a service to help improve the efficiency of the response and assessment works.Experts or operators can concentrate more on the decision-making process rather than the time-consuming data collection and selection process.The selection method is task-oriented, and operators can change the input parameters, such as keywords, in the task requirements when a new landslide happens.Then, the methods can retrieve and select the advantageous information for loss assessment.During the selection process, the semantic relationships between datasets can be dynamically built and the relationships between the concepts will not change with the landslides.This research concentrates on landslide assessment as specific knowledge input is required for implementation.We believe that it can be adopted for other disaster scenarios, and even for other data selection tasks, such as visualization.The users can establish different task requirements and corresponding concept association graphs as the predefined input model for other cases.However, the thresholds used for the indicators may need to be adjusted for ideal selection when a new task is given.

Conclusions and Future Studies
Determining optimal datasets for complex geographic analysis has become an urgent but time-consuming issue in the age of "big data".Therefore, it is important to rapidly and accurately identify appropriate input data to ensure the timeliness and reliability of disaster assessment.
This study proposes a framework based on semantic relationships for the automatic recommendation of the best dataset for the assessment model according to a group evaluation index that includes relevance, redundancy, and completeness.The employment of semantics can make the selection process smarter based on an understanding of the concepts underlying geo-information.In the framework, semantic relationship types are formalized into two levels that contain a similarity relationship based on geospatial data characteristics and an interaction relationship based on their roles in analytical models.Furthermore, a reasoning method is adopted to qualitatively and quantitatively indicate associations based on HIN and similarity calculation.Thus, dataset evaluation indices are defined and calculated based on these relationships.Through this index, the advantageous information selection method is adopted for choosing the best subset.As illustrated in the case study, this approach effectively contributes to the discovery of reliable and high-quality datasets for the analytical model.The results cover all the needs of the dataset, and users do not need to manually choose and combine datasets based on their experience and knowledge because all this knowledge is transformed into semantic relationships and used in the selection process.
Threshold value or weight parameters appear to play important roles in the assessments considered in this research, and the values of these parameters often vary with the task.Further work will employ data mining methods to adjust the parameters based on historical case data.

Figure 1 .
Figure 1.Computational framework of semantics-constrained advantageous information selection.

Figure 1 .
Figure 1.Computational framework of semantics-constrained advantageous information selection.

Figure 2 .
Figure 2. Flowchart of advantageous information selection from the original datasets.Figure 2. Flowchart of advantageous information selection from the original datasets.

Figure 2 .
Figure 2. Flowchart of advantageous information selection from the original datasets.Figure 2. Flowchart of advantageous information selection from the original datasets.

Figure 3
Figure 3 is , which represents a path of type  → ••• →  → ••• → , where the relationships of r1 and r2 do not belong to the consist-of relationship at the same time.

Figure 3
Figure 3 is , which represents a path of type   → ••• 1 →  2 → •••  → , where the relationships of r1 and r2 do not belong to the consist-of relationship at the same time.(3) Complementary relationship , which represents a path of type data R

Figure 4 .
Figure 4. Case study area: a landslide in Diexi Town, Mao County.

Figure 4 .
Figure 4. Case study area: a landslide in Diexi Town, Mao County.

Figure 5 .
Figure 5. Flowchart of advantage information selection from the original datasets.

Figure 5 .
Figure 5. Flowchart of advantage information selection from the original datasets. is → , which is matched with th , which represents a path of type data R

Table 1 .
Data selection results of the proposed method.

Table 2 .
Evaluation of selection methods.