Next Article in Journal
A GIS-Based Tool for Automatic Bankfull Detection from Airborne High Resolution Dem
Previous Article in Journal
Tree-Based and Optimum Cut-Based Origin-Destination Flow Clustering
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Hierarchical Semantic Correspondence Analysis on Feature Classes between Two Geospatial Datasets Using a Graph Embedding Method

Geospatially Enabled Society Research Division, Korea Research Institute for Human Settlements, Sejong 30147, Korea
ISPRS Int. J. Geo-Inf. 2019, 8(11), 479; https://doi.org/10.3390/ijgi8110479
Submission received: 19 August 2019 / Revised: 9 October 2019 / Accepted: 21 October 2019 / Published: 24 October 2019

Abstract

:
A method to find corresponding feature class pairs, including hierarchical M:N pairs between two geospatial datasets is proposed. Applying an overlapping analysis to the object sets within the feature classes, the similarities of the feature classes are estimated and projected onto a lower-dimensional vector space after applying the graph embedding method. In this space, conventional mathematical tools—agglomerative hierarchical clustering in this study—could be used to analyze semantic correspondences between the datasets and identify their hierarchical M:N corresponding pairs. The proposed method was applied to two cadastral parcel datasets; one for latest land-use records in an urban information system, and the other, for original land-use categories in the Korea land information system. To quantitatively assess identified feature pairs, F-measures for each pair are presented. The results showed that it was possible to find various semantic correspondences of the feature classes and infer regional land development characteristics.

Graphical Abstract

1. Introduction

Establishing data integration between different geospatial information systems is necessary in order to set up geospatial data infrastructures for collecting and disseminating the data from different systems [1]. As each dataset belonging to these systems represents similar real-world entities or phenomena according to their own abstraction models and surveying rules, syntactic, structural, semantic, and geometric heterogeneities occur between corresponding objects of different datasets [2]. Among the mentioned heterogeneities, the first two can be addressed by applying well-known knowledge representation techniques, such as the web ontology language (OWL) or resource description framework (RDF), while remaining semantic; geometric ones, however, are still complicated problems [3]. This is due to the fact that corresponding spatial objects of different datasets, which represent the same real-world entity, have their own conceptual meanings and geometric representations according to the application purposes of datasets. For example, a small narrow road connecting a main road and a parking lot in a large commercial center may be represented as a polyline object attributed as “road” in one dataset, whereas it may be represented as a polygon object attributed as “auxiliary facility” in another dataset.
In the field of the map conflation, various methods have been developed to address the aforementioned semantic and geometric heterogeneity problems [4,5]. Authors in [5] proposed a conceptual framework for a general process to address these problems, as shown in Figure 1. In this process, a pre-processing step is performed to transform two geospatial datasets (GeoDSs) to have a uniform format, scale, reference system, and so on. Then, a semantic filter step is applied to identify the corresponding feature class pairs, which represent the same geographic entities or phenomena. While these two steps are related to model- (or dataset)-oriented analysis, the remaining steps correspond to the object-oriented analysis used to identify matching object pairs, and then, to address the geometric discrepancies between them.
When the geospatial datasets to be integrated originate from a similar domain, a simple comparison of feature class names would provide the desired results for the semantic filter step. However, in the case when they are from different domains, the names can be the same or similar, even though the feature classes represent substantially different real-world entities or phenomena. Moreover, the corresponding relations may vary from 1:1 to 1:N or M:N. In these cases, detailed data specifications of the datasets to be compared are necessary. However, most of the datasets do not provide such information [6].
To address this problem, various object-based analysis techniques have been proposed. These techniques use matching objects between two datasets to identify corresponding feature classes. They assume that, if spatial objects of a certain feature class in one geospatial dataset correspond to spatial objects in another feature class in the other dataset with a high probability, there is high semantic similarity between the two feature classes [7]. Uitermark et al. [2] extended this method by introducing taxonomical and partonomical relationships of feature classes within each dataset, so that relations of feature classes between datasets, as well as within each dataset, can be obtained. Similarly, authors in [8,9] have proposed an ontology integration method based on searching for relations between objects, which are able to infer taxonomic relations between the feature classes. Cruz and Sunna [10] applied a graph-matching method, where a graph is constructed for each geospatial dataset, and the taxonomic and partonomical relations of feature classes of one geospatial dataset are represented as nodes and edges, respectively. This graph model has been adopted in some studies that proposed their own similarity measurement methods. Khatami et al. [3] combined several similarities for feature class name pairs and derived the overall object correspondence between feature classes (consequently, between geospatial datasets), as well as the semantic structure among feature classes (within a geospatial dataset). Buccella et al. [11] propose a novel system that manually creates domain ontologies and automatically enriches domain ontologies with standard information using semantic, syntactic, and structure analyses. Then, ontology integration is carried out with the information. Bhattacharjee and Ghosh [12] proposed the semantic hierarchy-based similarity measurement for semantic similarity between land cover feature classes, which considers a hop number from the top feature class node to a certain node in their graph. Kuo and Hong [13] proposed a conceptual framework for semantic integration of geospatial datasets, which allows identifying matching geospatial feature classes. In this framework, hierarchical semantic relations between the datasets such as “is_subset_of”, “is_superset_of”, or “is_same_to” were determined by analyzing intersection relations of objects belonging to feature classes. Kuai et al. [14] focused on natural language barriers for semantic matching between feature classes in different geospatial datasets. Recently, Zhang et al. [15] proposed a multi-feature-based similarity measurement based on geospatial relationships, feature catalog, and name tag, and then applied a supervised machine learning process to identify corresponding pairs.
Although the above studies showed good results, there is room for improvement by applying recent semantic analysis techniques [16,17,18] and developing new approaches to obtain hierarchical corresponding relations of feature classes between geospatial datasets, as well as within each dataset. These techniques begin from a co-occurrence matrix in which rows and columns represent individual entities used for analysis; in this study, feature classes are these entities. Considering the aforementioned object-based methods [7,8,9,10,11,12,13,14,15], these co-occurrence values could be measured by degrees of object sharing or intersection between feature classes from two geospatial datasets. This matrix representation easily shows overall degrees between feature classes—conventional mathematical tools—which are suitable for feature vector data but not matrix data and cannot be easily applied to the matrix for identifying corresponding feature class pairs. To address this problem, several dimensionality reduction techniques, such as latent semantic analysis or graph embedding, are employed to define a new vector space where individual entities are represented as feature vectors to which conventional mathematical tools can be easily applied [17,19,20].
In this study, the Laplacian graph embedding proposed in [20] was applied to address the above issue. This method was developed to identify the multi-level corresponding object–set pairs between two remote sensing data. It constructed a bipartite graph representing each object as a node, and node pairs’ similarities between datasets as an edge with a weighted value. Thereafter, by applying Laplacian graph embedding, objects with higher similarity were distributed on closer coordinates in the embedding space. Finally, a clustering analysis on the projected nodes in the space was conducted, and the hierarchical corresponding object–set pairs could be found. In this study, nodes are used to represent feature classes rather than individual objects, so that the feature class pairs between datasets with a greater number of shared objects have close coordinates in the embedding space. Thus, this space can be understood as a semantic feature space, where two feature classes representing similar real-world entities or phenomena have geometrically close embedding coordinates. Therefore, with the knowledge of these coordinates and their distances, which are proportional to semantic dissimilarity, the previously mentioned complicated correspondence relationship between the feature classes of the two geospatial datasets can be found, and the semantic relationships of the feature classes can also be compared and inferred.
In this paper, the proposed method is applied to cadastral parcels’ latest land-use records obtained from the urban information system (UIS) and their original land-use categories obtained from the Korea land information system (KLIS). These two systems have the same parcel dataset, however, attributes of their parcels could be different; a land-use category is assigned in the perspective of taxation, whereas the land-use record is assigned in the perspective of urban management. Consequently, even for the same parcels, their categories and records can be different, so that corresponding relations between these feature classes cannot be properly derived without having background information. These relations include not only M:N corresponding relations, but also their nested hierarchies. Moreover, these relations can be distinctive for specific areas due to unique geographical conditions typical for areas in question. The proposed method defines a semantic feature space where feature classes (in this study, the land-use category or land-use record) are represented as vectors. As conventional mathematical tools can be easily applied to vectors, and the distance between vectors in this study is proportional to semantic dissimilarity, the complicated relationships could be identified using proper mathematical tools such as clustering analysis.
The rest of the paper is organized as follows. In the subsequent section, an explanation of Laplacian graph embedding is given; in Section 3, the proposed method is explained; and in Section 4, it is applied for two areas, Seoul city to represent an urban area, and the Jeonnam Province to represent a rural area; then, their results are compared. Finally, in Section 5, the conclusions of this study are discussed.

2. Laplacian Graph Embedding

2.1. One-dimensional Embedding

In this paper, we assume an undirected and connected graph. The graph G = ( V , E ) is represented by sets of vertices V = { v i | i = 1 , , N } and edges E = { ( v i , v j ) | v i , v j V } . Given a weighted graph, edge weights are represented as a weight matrix W R N × N . One-dimensional graph embedding finds a configuration of embedded vertices in one-dimensional space, such that the vertices’ proximities from the edge weights are preserved as the embedded vertices’ distances. Assuming each entry of a column vector x = ( x ( 1 ) , , x ( n ) ) T as coordinates of the embedded vertices, this problem can be solved through minimization of the following objective function [21].
( i , j ) E ( x ( i ) x ( j ) ) 2   w i , j
This function could be minimized when vertices i and j with large w i , j are embedded at close coordinates, whereas vertices with small w i , j are embedded into distant coordinates. In this study, this mathematical property is applied as follows: feature classes (e.g., land-use category and record) with a greater degree of object sharing have close coordinates in their embedding space and feature classes with a lesser degree of object sharing have distant coordinates. Equation (1) can be expressed in a matrix operation form with a Laplacian matrix L , and can be represented as Equation (2) [19,20,21].
1 2 ( i , j ) E ( x ( i ) x ( j ) ) 2   w i , j = x T   L   x
where, the Laplacian matrix L is defined as Equation (3) with a vertex degree matrix D whose diagonal entries are obtained as d i , i = j i w   i , j and the remaining entries are 0.
L = D W
Now, the problem can be changed to find a vector x that minimizes x T L x , and can be represented as Equation (4).
x = arg min   x T L x
Since the value of x T L x is vulnerable to the scaling of a vector x , a constraint x T   B   x = 1 is imposed to remove any such arbitrary scaling effect [17]. The diagonal matrix B provides weights on the vertices, so that the higher b i , i is, the more important is that vertex [21]. Equation (4) with the constraint can be solved by the Lagrange multiplier method as in Equations (5)–(7).
F ( x ) = L x ,   x λ ( B x , x 1 )
F ( x ) x = x T ( L + L T ) λ x T ( B + B T )
2 ( L x ) T = 2 λ ( B x ) T L x = λ B x
Thus, the solution of one-dimensional embedding, x , is obtained by solving the eigenproblem L x = λ B x . However, according to the rank of matrix L , there could be more than one eigenvector. In the field of graph spectral theory, the eigenvector corresponding to the smallest eigenvalue larger than 0 is the proven solution, which is called a Fiedler vector. Thus, the coordinates of vertices in one-dimensional embedding are obtained as components of the Fiedler vector as represented by Equation (7).

2.2. k-dimensional Embedding

Now, consider k-dimensional graph embedding. These embedded coordinates are represented as an n × k matrix X = [ x 1 , , x k ] , so that the ith row of X , x ( i   ) = ( x 1 ( i ) , , x k ( i ) ) , contains the k-dimensional coordinates of vertex v i . Now, an objective function is defined as Equation (8) with the constraint, X T   B   X = I .
1 2 i , j x ( i   ) x ( j   )     2   w i , j = t r a c e   ( X T L X )
Sameh and Wisniewski [22] proved that the solution to this trace minimization problem is obtained by the k-eigenvectors of L X = λ B X that correspond to its smallest eigenvalues other than 0. Thus, the solution of Equation (8) is obtained by a matrix X = [ x 1 , , x k ] , where x i represents an eigenvector corresponding to eigenvalue λ i under the condition 0 = λ   0 < λ   1 λ k .
However, the constraint X T   B   X = I normalizes the scales of the coordinates in each dimension. Thus, it is necessary to rescale them according to each dimension’s relative importance. Sameh and Wisniewski also proved that the minimum value of X T L   X in Equation (8) equals the sum of the corresponding eigenvalues, as shown by Equation (9) [22].
min   t r a c e   ( X T L   X ) = i = 1 k λ i
Accordingly, we can assume the eigenvalue λ i as the amount of either the penalty or the cost caused by the ith dimensional space in the embedding problem. So, when k < l , it is appropriate to apply more weight to | x k ( i ) x l ( j ) | than | x k ( i ) x l ( j ) | in measuring the proximity for a clustering analysis. Based on these mathematical properties, we determined the embedded coordinates as Equation (10), because the increase in distance is proportional to that of the root-squared coordinate difference [20].
X = [ x 1 λ 1 , , x k λ k ]

3. Proposed Method

The proposed method begins with an edge weight matrix whose cells represent the degree of object sharing between two feature classes (Step 1). From this matrix, k-dimensional feature vectors for each feature class are obtained by the Laplacian graph embedding technique (Step 2). Then, agglomerative hierarchical co-clustering is applied to find hierarchically corresponding feature class–set pairs (Step 3). Figure 2 presents a pseudocode of the proposed method and details of each step are explained in the following sections.

3.1. Step 1: Constructing Edge Weight Matrix W

The proposed method begins with a weighted bipartite graph represented by a similarity matrix S R n × m , where n and m stand for the numbers of feature classes in two datasets A and B respectively, and cell values are calculated by Equation (11).
s ( i , j ) = N (   f i f j ) min ( N ( f i ) , N ( f j ) )
where, N (   ) is a function that returns the number of spatial objects, f i and f j represents feature class i and j in two datasets A and B , respectively. This similarity measure effectively explains a partial and complete relationship of two feature classes, which is necessary to find complicated corresponding pairs such as N:1, 1:M, or N:M [23,24].
Since Laplacian graph embedding assumes a normal graph, an edge weight matrix W R N × N , where N = n + m , is obtained by Equation (12). With this matrix W , its Laplacian matrix L is obtained by Equation (3).
W = [ 0 S S T 0 ]

3.2. Step 2: Solving Eigenproblem and Obtaining K-dimensional Coordinates

The process of Laplacian graph embedding in Section 2 considered each vertices’ weight using a diagonal matrix B . However, in this study, each feature class has the same importance so that B is set to an identity matrix and Equation (13) is applied instead of Equation (7).
L   x = λ x
Although all the eigenvectors of Equation (13) are orthogonal and convey distinct information, we need to determine the optimal dimensionality k, because eigenvectors corresponding to small eigenvalues are appropriate for the embedding problem, as shown in Equation (9). The optimal dimensionality k for an expected number of clusters was proposed by [25]. Assuming each eigenvector has information to partition vertices into at least two clusters, he determined k as log 2   c , where c is the expected number of clusters. ⌈ ⌉ is a function to present the minimum integer larger than a given value. Similarly, we determine k with Equation (14), because the maximum number of corresponding feature class pairs could not exceed the numbers of feature classes in either of two datasets.
k = log 2   ( min ( n , m ) )
Accordingly, the embedded coordinates of the vertices in datasets A and B are obtained by k-rescaled eigenvectors corresponding to the k smallest eigenvalues other than 0, as in Equation (10).

3.3. Step 3: Agglomerative Hierarchical Clustering Analysis and Assessment of Clusters

Given clusters (at the initial condition, each feature class are considered as clusters), the agglomerative hierarchical clustering method searches the two closest clusters and merges them into one cluster. These searching and merging steps are repeated until all entities are merged into a single cluster. Thus, it presents a sequence of nested partitions of hierarchical cluster structure in the form of a dendrogram [26]. To apply the method, it is necessary to determine the criteria to measure the distance between two clusters. Among the several criteria, a single-link measure which considers the average distance of all entity pairs between clusters, as given in Equation (15), is chosen. The single-link measure defines the dissimilarity as the minimum distance among all the entity distances between two entity clusters and tends to find elongated clusters.
D ( C a ,   C b ) = min i C a   j C b d ( e i ,   e j )
where, D   ( C a , C b ) is a cluster distance of cluster C a ,   C b , d ( e i ,   e j ) is an entity distance between embedded coordinates of feature class f i ,   f j . A dendrogram is a tree diagram that shows a structure of clusters where the bottom row of nodes represents individual entities (in this study, feature classes of two datasets) and the remaining nodes represent the merging of their sub-nodes. Thus, by analyzing the feature types in the remaining nodes, semantically corresponding feature classes between two datasets could be obtained.
The clustering analysis in the above step presents a clustering sequence, but not obtained are clusters from which semantically corresponding feature class–set pairs are determined. Thus, statistical assessment of these clusters is necessary. Given an lth cluster, C ( l ) needs to be divided into two feature class–sets, C a   ( l )   and   C b   ( l ) , according to the datasets to which the feature classes belong. Then, a criterion could be applied to assess the pairs C a   ( l )   and   C b   ( l ) with the F-measure of Equation (16), which is often used in the field of semantic engineering and information retrieval [27]. F-measures of each and every cluster are calculated, and then the clusters whose F-measure is higher than a threshold are determined as semantically corresponding feature class–set pairs.
F m e a s u r e = 2 × P a   ( l ) × P b   ( l ) P a   ( l ) + P b   ( l )
where, P a ( l )   and   P b   ( l )   are   obtained   by   i C a ( l ) ,   j C b ( l )   N ( f i f j ) i C a ( l )   N ( f i )   and i C a ( l ) ,   j C b ( l )   N ( f i f j ) j C b ( l )   N ( f j ) , respectively.

4. Experiment and Results

4.1. Experimental Dataset

To evaluate the proposed method, two representative areas have been chosen, Seoul city and the Jeonnam Province, as shown in Figure 3, as the first one is the most urbanized area in the country, and the other is the Southwestern part of the country, which is well-known for fertile farmlands with vast plains. Land parcel datasets of the two areas were extracted from UIS and KLIS. Thereafter, each parcel’s land-use record and category were compared as shown in Table 1 and Table 2. In these tables, 1 to 21 refer to the record index, and A to T refer to the category index. The values of cells in the tables represent the number of land parcels having a certain index pair with record and category. There are several pairs whose record and category index names are the same, such as (9, A) of a dry paddy field, (11, B) of a paddy field, (20, H) of a parking lot, (17, N) of a river, which seem to be 1:1 corresponding pairs. However, for the land parcels of “N (River)”, there are 1416 parcels with “17 (river)” and 1345 parcels with “Road (16)”, which means that in terms of the land-use category in KLIS, the land parcels with “N (river)” are currently used for hydrology or transportation purpose with similar proportions. In terms of the land-use record in UIS for “17 (river)”, the land parcels were mainly registered as “N (River)” (1416 parcels); however, significant number of parcels (756 parcels) were registered as O (Ditch). This demonstrated that the corresponding land-use record and category pairs can be unexpectedly expanded according to concatenated one-to-many corresponding relations; consequently, a new method is required to identify complicated M:N corresponding feature class pairs between geospatial datasets. This method also needs to be based on the data itself, not on the geographic background knowledge of the area under consideration. In Table 2, the above relations are not valid and show completely different relations. This means that a data-driven learning method such as the one proposed in the present paper is required to obtain distinctive results for each area, for example, such as Seoul city and the Jeonnam Province.

4.2. Results and Discussion

Figure 4 shows the projection of the data provided in Table 1 onto the three-dimensional space using the proposed method. Although the projection was originally done onto five-dimensional space, the coordinates of up to three principle dimensions are used for the visual analysis. As described above, the land-use record and category that are close to each other in this space share more land parcels, such as (11, B), as can be seen at the bottom left in the figure (this cell corresponds to a paddy field).
Figure 5 shows the dendrogram of agglomerative hierarchical clustering on the embedded coordinates of the data provided in Table 1. In the dendrogram, nodes and links represent the process used to identify the clusters. For example, “8 (industrial building)” and “J (Warehouse site)” first constitute a cluster C1, to which “F (Factory site)” is clustered sequentially to transform the cluster into C26. According to this clustering process, the corresponding land-use record and category pairs between UIS and KLIS were analyzed, and subsequently, the corresponding feature class clusters could be derived and analyzed accordingly. This clustering process allows the identification of not only 1:1 correspondences (at the right side of the dendrogram), but also complex correspondences. In addition, clusters such as C18 and C19 are combined to define a supercluster for higher-level geographic concepts for a so-called trans-hydro network. From the clustering results provided in Figure 5, it can be seen that the following feature correspondences could be obtained:
  • C1 (8:J): Although a small portion of “8 (industrial building)” are located in “J (Warehouse site)”, these two feature classes have the closest embedded coordinates, as shown in Figure 4. This is because the proposed method performs data normalization in the form of relative frequency, as in Equation (11). Thereafter, “F (Factory site)” is clustered sequentially to process the cluster into C26
  • C8 ({2,3,4,6,7}:E): Seoul city is a typical megacity and the capital of the Republic of Korea with a population equal to approximately 10 million, therefore, there are so many residential buildings constructed on the land with land-use category registered as “E (Building site)”. It should be noted that, according to its high land price, detached houses are not popular in the city, except suburban areas. Thus, C8 represents this residence characteristic of the city.
  • C21 ({2,3,4,5,6,7}:E), C22 ({1,2,3,4,5,6,7}:E): During the clustering process, “5 (commercial building)” and “1 (detached house)” are sequentially merged into C8. As previously explained, “1 (detached house)” is subsequently merged into the cluster after “5 (commercial building)”.
  • C27 ({1,2,3,4,5,6,7,8}:{E,F,J,I,S): C27 is a combination of C24 and C26, which together constitute the main urban development area. Then, “I (Gas station)” is merged into this cluster, which seems to be an isolated land-use category in the urban development area. This is because the safety regulation and high land prices of Seoul city lead to the fact that gas stations are located at a significant distance from central residential and/or commercial sites.
  • C10 (17:M), C16 (17:{M, N, O, P}): “17 (river)” and “M (Bank)” are firstly clustered and then, “N (River)”, “P (Marsh)”, and “O (Ditch)” are clustered to form the water-system area. In Seoul city, central and local governments have constructed the banks along most of rivers and streams to prevent flood damage, which explains why “17 (river)” and “M (Bank)” are firstly clustered together, rather than remaining as three considered land-use categories.
  • C5 (16:K): This cluster shows that in the two datasets of the land-use record and category, feature classes named “Road” represent nearly the same real-world entity, which means that they have similar geographic concepts for roads.
  • C14 ({16, 21}:{K, L}): “21 (miscellaneous)” and “L (Railway site)” are then merged into C5.
  • C20 ({16, 17, 21}:{G, K, L, Q, M, N, P, O}): C20 is a combination of C18 and C19 which represents a so-called trans-hydro network. In an urban area such as Seoul city, many small streams have been covered to construct more roads as a part of the continuous urbanization process. In this process, the original land-use category of many land parcels have not been properly changed according to the substantive land-use condition. The inclusion of “G (School site)” seems to be erroneous. In Table 1, there is no proper land-use record class for educational facilities, and this means that the UIS does not manage these facilities. This is due to the fact that according to the Korean administrative legal system, the management of elementary school, middle school, and high school should be governed by local education offices, and not by the local government; therefore, the relevant data is not sufficiently reflected in the UIS, which is managed by local governments.
  • C15 ({12,13,14,15}:D): This cluster represents the forest area.
  • C36 (11:B), C29 (18:R), C35 (10:C), C34 (20:H): These clusters represent paddy fields, parks, orchards, and parking lot areas, respectively.
Table 3 shows the clusters in Figure 5 and their F-measure with Equation (16). The above cluster analysis does not consider a quantitative criterion. In Table 3, some feature class–set pairs such as C1, C8, and C21 have low F-measure values; meanwhile, other pairs such as C5, C9, C12 have high values. When the proposed method is applied to identify exact corresponding feature class–set pairs, a proper F-measure threshold needs to be determined. In the case of Table 3, 0.700 seems to be such a threshold, considering the above analysis. However, the determination of this threshold requires sufficient statistical experiments. The following feature class–set pairs have been identified for the Jeonnam Province:
  • C’1 (17:N): In the clustering process, the first pair of feature classes identified is “17 (river)” and “N (River)”. In the previous clustering analysis performed for Seoul city, it had a low weight (125/577 = 0.22) according to Equation (11), however, it has a high weight (6083/8225 = 0.74) for the Joennam Province. This is because, in urban areas such as Seoul city, many roads are constructed along rivers or banks; however, in rural areas such as the Jeonnam Province, river-side areas are reserved undeveloped; consequently, the above feature classes are clustered firstly.
  • C’21 (17:{M, N, O, P}): Although the order of clustering is different, the result of the analysis is similar to that of Seoul city. It can be confirmed that the 1:N feature class correspondence is the same for the city and the province; however, there is a difference in the correspondence priority of the sub-feature class depending on the regional characteristics.
  • C’16 (11:B), C’14 (9:A): Unlike for Seoul city, the cluster order of the feature class related to the agricultural land was higher than that of Seoul city owing to the characteristics of the Jeonnam Province, which has a very high proportion of agricultural land. In other words, it can be confirmed that the actual land-use is performed in the same form as the land plan related to agriculture.
  • C’17 (19:G): It shows that various physical education facilities other than educational buildings are installed and operated on the school site. It can be confirmed that physical education facilities are being promoted in connection with the development of school grounds being driven by the welfare projects organized by the local community.
  • C’25 ({1, 9, 11, 19, 21}:{A, B, G, T}): This cluster represents the suburban and agriculture area, where “G (School site)” and “T (Miscellaneous)” are included. This is explained by the data management problem similar to that of Seoul city, or by the fact that many sports or agricultural facilities are constructed in the closed school sites in old villages.
  • C’13 (16:{K, L}), C’15 (16:{K, L, Q}): Similar to the aforementioned analysis result for Seoul city, “16 (road)” and “K (Road)” were firstly clustered; however, unlike the result for Seoul city, “21 (miscellaneous)” was clustered in the suburban and agriculture area, not the transportation area.
  • C’18 ({12,13,14,15}:D): This cluster represents the forest area, similarly to that in the aforementioned case for Seoul city.
  • C’36 (8:F), C’33 (10:C), C’34 (18:R): These clusters represent industrial/factory, orchard, park areas, respectively.
The above clustering sequence describes local characteristics because even though they have the same land category, the results of substantive land development or land-use could be different across the regions. Figure 6 shows such a difference clustering result of the Jeonnam Province data in Table 2.
Table 4 shows F-measure values of the clusters in Figure 6, similar to Table 3. Comparing to Table 3, the clusters related to suburban and agriculture areas, such as C’14 and C’16, have high F-measure values. Meanwhile, those related to the development area, such as C’5, C’6, C’7, and C’8, have low values.
Considering the above analysis results provided for Seoul city and the Jeonnam Province, the characteristics of the proposed method could be identified as follows. First, it is possible to explore the various semantic correspondences of the feature classes through analyzing the clustering order in the embedded space. Adjacent feature classes in the space share more spatial objects, which means that they have a high probability to represent the same real-world entity or phenomena. According to the assumptions of this research and many previous related studies, these feature classes can be classified as semantically corresponding pairs. Therefore, applying agglomerative hierarchical clustering, hierarchical semantic relations of the feature classes such as “is_subset_of”, “is_superset_of”, or “is_same_to” could be obtained, similarly to [13].
Second, it is possible to infer regional characteristics of the feature classes. For example, the lands for which the land-use category is T (Miscellaneous) were generally used for the transportation area in Seoul city, and for the suburban and agricultural areas in the Jeonnam Province, as shown in Figure 4 and Figure 5, respectively. This is because there is high land development demand for transportation services in urban areas such as Seoul city. However, in the Jeonnam Province, where only a small part of its area is urbanized, there is no specific land development demand, and thus, the lands for which the land-use category is T (Miscellaneous) were developed in various forms. However, the water-system area and the forest area showed very similar clustering results. This can be explained by the natural environment protection due to the intervention of the central government, which results in similar land development tendencies for both urban and rural areas.

5. Conclusions

In this article, we proposed a new method to identify semantic correspondences between two datasets by means of finding hierarchical M:N corresponding feature class–set pairs. Applying the overlapping analysis to the object sets within the feature classes, the similarities of the feature classes are estimated and projected onto a lower-dimensional vector space after applying the graph embedding method. Thereafter, as the feature classes of high similarity are distributed close to each other in the projection space, distance-based clustering is conducted to identify the semantically corresponding feature class pairs. The above method was applied to the cadastral parcels’ land-use record in UIS and the corresponding land-use category in KLIS for two different test sites, Seoul city and the Jeonnam Province. As a result, it was possible to find various semantic correspondences of the feature classes between UIS and KLIS. In addition, hierarchical structures of the correspondences could be obtained. Moreover, upon analyzing these structures to obtain sequential clustering orders, regional characteristics of the feature classes were also inferred.
The proposed method is based only on the results of the overlay analysis between datasets. Therefore, aside from the location information, other prior information related to the construction of similarity measures was not required. This is an advantage in terms of generality as the proposed method can be applied to various geospatial datasets. Moreover, an advanced method could be developed by combining various similarity measures, such as lexical similarity, structural similarity, category similarity, shape similarity, and so on [18,28,29] into the co-occurrence matrix, in which rows and columns represent entities under analysis, such as feature classes in this study. To combine these various similarity measures between these entities, it is necessary to determinate their weight. We will consider these aspects to improve the proposed method in future studies.

Author Contributions

Conceptualization, Yong Huh; Methodology, Yong Huh; Software, Yong Huh; Validation, Yong Huh; Formal Analysis, Yong Huh; Investigation, Yong Huh; Resources, Yong Huh; Data Curation, Yong Huh; Writing-Original Draft Preparation, Yong Huh; Writing-Review & Editing, Yong Huh; Visualization, Yong Huh.

Funding

This research received no external funding.

Acknowledgments

The author express sincere gratitude to the Journal Editor and the anonymous reviewers who spent their valued time to provide constructive comments and assistance to improve the quality of this paper.

Conflicts of Interest

The author declare no conflicts of interest.

References

  1. Vaccari, L.; Shvaiko, P.; Marchese, M. A geo-service semantic integration in spatial data infrastructures. Int. J. Spat. Data Infra. 2009, 4, 24–51. [Google Scholar]
  2. Uitermark, H.T.; van Oosterom, J.M.; Mars, N.J.I.; Molenaar, M. Ontology-based integration of topographic data sets. Int. J. Appl. Earth Obs. 2005, 7, 97–106. [Google Scholar] [CrossRef]
  3. Khatami, R.; Alesheikh, A.A.; Hamrah, M. A mixed approach for automated spatial ontology alignment. J. Spat. Sci. 2010, 55, 237–255. [Google Scholar] [CrossRef]
  4. Ruiz-Casado, M.; Alfonseca, E.; Castells, P. From Wikipedia to Semantic Relationships: A Semi-automated Annotation Approach. In Proceedings of the Third Annual European Semantic Web Conference ESWC’06, Budva, Montenegro, 12 June 2006. [Google Scholar]
  5. Ruiz, J.J.; Ariza, F.J.; Ureña, M.A.; Blázquez, E.B. Digital map conflation: A review of the process and a proposal for classification. Int. J. Geogr. Inf. Sci. 2011, 25, 1439–1466. [Google Scholar] [CrossRef]
  6. Kokla, M. Guidelines on geographic ontology integration. In Proceedings of the ISPRS Technical Commission II Symposium, Vienna, Austria, 12–14 July 2006. [Google Scholar]
  7. Rodríguez, M.A.; Egenhofer, M.J.; Rugg, R.D. Assessing Semantic Similarities among Geospatial Feature Class Definitions. In Lecture Notes in Computer Science Vol. 1580; Včkovski, A., Brassel, K.E., Schek, H.J., Eds.; Springer: Berlin, Germany, 1999; pp. 189–202. [Google Scholar]
  8. Duckham, M.; Mason, K.; Stell, J.; Worboys, M. A formal approach to imperfection in geographic information. Comput. Environ. Urban Syst. 2001, 25, 89–103. [Google Scholar] [CrossRef] [Green Version]
  9. Duckham, M.; Worboys, M. An algebraic approach to automated geospatial information fusion. Int. J. Geogr. Inf. Sci. 2005, 19, 537–557. [Google Scholar] [CrossRef]
  10. Cruz, I.F.; Sunna, W. Structural Alignment Methods with Applications to Geospatial Ontologies. Trans. GIS 2008, 12, 683–711. [Google Scholar] [CrossRef]
  11. Buccella, A.; Cechich, A.; Gendarmi, D.; Lanubile, F.; Semeraro, G.; Colagrossi, A. Building a global normalized ontology for integrating geographic data sources. Comput. Geosci. 2011, 37, 893–916. [Google Scholar] [CrossRef]
  12. Bhattacharjee, S.; Ghosh, S.K. Measuring semantic similarity between land-cover classes for spatial analysis: An ontology hierarchy exploration analysis. Innov. Sys. Softw. Eng. 2016, 12, 193–200. [Google Scholar] [CrossRef]
  13. Kuo, C.L.; Hong, J.H. Interoperable cross-domain semantic and geospatial framework for automatic change detection. Comput. Geosci. 2016, 86, 109–119. [Google Scholar] [CrossRef]
  14. Kuai, X.; Li, L.; Luo, H.; Hang, S.; Zhang, Z.; Liu, Y. Geospatial Information Categories Mapping in a Cross-lingual Environment: A Case Study of “Surface Water” Categories in Chinese and American Topographic Maps. ISPRS Int. J. Geo. Inf. 2016, 5, 90. [Google Scholar] [CrossRef]
  15. Zhang, Y.; Yang, P.; Li, C.; Zhang, G.; Wang, C.; He, H.; Hu, X.; Guan, Z. A multi-feature based automatic approach to geospatial record linking. Int. J. Semt. Web Inf. Sys. 2018, 14, 73–91. [Google Scholar] [CrossRef]
  16. Huang, Y. Conceptual categorizing geographic features from text based on latent semantic analysis and ontologies. Ann. GIS 2016, 22, 113–127. [Google Scholar] [CrossRef]
  17. Sedoc, J.; Gallier, J.; Ungar, L.; Foster, D. Semantic word clusters using signed normalized graph cuts. arXiv 2016, arXiv:1601.05403. [Google Scholar]
  18. Sun, K.; Zhu, Y.; Song, J. Progress and Challenges on Entity Alignment of Geographic Knowledge Bases. ISPRS Int. J. Geo. Inf. 2019, 8, 77. [Google Scholar] [CrossRef]
  19. Hendrickson, B. Latent semantic analysis and Fiedler embedding. Linear. Algebra Appl. 2007, 421, 345–355. [Google Scholar] [CrossRef]
  20. Huh, Y.; Kim, J.; Lee, J.; Yu, K.; Shi, W. Identification of multi-scale corresponding object-set pairs between two polygon datasets with hierarchical co-clustering. ISPRS J. Photo. Rem. Sens. 2014, 88, 60–68. [Google Scholar] [CrossRef]
  21. Belkin, M.; Niyoki, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comp. 2003, 15, 1373–1396. [Google Scholar] [CrossRef]
  22. Sameh, A.; Wisniewski, J. A trace minimization algorithm for the generalized eigenvalue problem. SIAM J. Num. Anal. 1982, 19, 1243–1259. [Google Scholar] [CrossRef]
  23. Min, D.; Zhilin, L.; Xiaoyong, C. Extended Hausdorff distance for spatial objects in GIS. Int. J. Geogr. Inf. Sci. 2007, 21, 459–475. [Google Scholar] [CrossRef]
  24. Li, L.; Goodchild, M. An optimisation model for linear feature matching in geographical data conflation. Int. J. Image Data Fus. 2011, 2, 309–328. [Google Scholar] [CrossRef]
  25. Dhillon, I. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001. [Google Scholar]
  26. Cho, M.; Lee, J.; Lee, K. Feature correspondence and deformable object matching via agglomerative correspondence clustering. In Proceedings of the 12th IEEE International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009. [Google Scholar]
  27. Euzenat, J.; Shavaili, P. Ontology Matching, 2nd ed.; Springer: New York, NY, USA, 2013; p. 304. [Google Scholar]
  28. Van den Brink, L.; Janssen, P.; Quak, W.; Stoter, J. Towards a high level of semantic harmonisation in the geospatial domain. Comp. Environ. Urban Sys. 2017, 62, 233–242. [Google Scholar] [CrossRef]
  29. Yu, L.; Qiu, P.; Liu, X.; Lu, F.; Wan, B. A holistic approach to aligning geospatial data with multidimensional similarity measuring. Int. J. Dig. Earth 2018, 11, 845–862. [Google Scholar] [CrossRef]
Figure 1. Conceptual framework of a general process to integrate two geospatial databases (adapted from [5]).
Figure 1. Conceptual framework of a general process to integrate two geospatial databases (adapted from [5]).
Ijgi 08 00479 g001
Figure 2. Pseudocode of the proposed method in this study.
Figure 2. Pseudocode of the proposed method in this study.
Ijgi 08 00479 g002
Figure 3. Experimental areas of Seoul city and the Jeonnam Province considered in this study.
Figure 3. Experimental areas of Seoul city and the Jeonnam Province considered in this study.
Ijgi 08 00479 g003
Figure 4. Projection of the land-use record and category of Table 1 onto a 3-dimensional space with the proposed method for visual presentation. Feature class pairs that share more objects, such as (11, B) and (20, H) in Table 1, have closer coordinates.
Figure 4. Projection of the land-use record and category of Table 1 onto a 3-dimensional space with the proposed method for visual presentation. Feature class pairs that share more objects, such as (11, B) and (20, H) in Table 1, have closer coordinates.
Ijgi 08 00479 g004
Figure 5. Dendrogram constructed based on agglomerative hierarchical clustering of the coordinates of the land-use record and category of Seoul city as per Table 1 using the proposed method.
Figure 5. Dendrogram constructed based on agglomerative hierarchical clustering of the coordinates of the land-use record and category of Seoul city as per Table 1 using the proposed method.
Ijgi 08 00479 g005
Figure 6. Dendrogram constructed based on agglomerative hierarchical clustering of the coordinates of the land-use record and category in the Jeonnam Province data as per Table 2, using the proposed method.
Figure 6. Dendrogram constructed based on agglomerative hierarchical clustering of the coordinates of the land-use record and category in the Jeonnam Province data as per Table 2, using the proposed method.
Ijgi 08 00479 g006
Table 1. Comparison between land-use record in urban information system (UIS) and land-use category in the Korean land information system (KLIS) of about 800,000 land parcels in Seoul city.
Table 1. Comparison between land-use record in urban information system (UIS) and land-use category in the Korean land information system (KLIS) of about 800,000 land parcels in Seoul city.
ABCDEFGHIJKLMNOPQRST
1110121331329364,662625173295527224851260161157307
2851805711,36401101456002000023
310128011269,11801002130521180120236
421545038220,06591611062929228301091151
5309139018097,0891065358432313148732388799281
6714702511,85229841411201218260115144
71456615779,8641264202783121825230686
815811811738709191105521741444301189
9839256903414669010011381534186123920321167
1022422801284000001000510002
1162333307100000604119580075
12174016121000000000602600
1342843010,258645221000974296102817214156
142893801254395110000241081611501208
15266041535080005012140991722
16273218651262430,767110197135269,10214603461345297780301291451764
17219261010465210001201934814167565964097
181197902471834131101613010753813122008109
191208902911451000441221471115014
204610403638110387204523630507052
2161100361440501019163157123310166
1 (detached house), 2 (row house), 3 (multiplex house), 4 (apartment house), 5 (commercial building), 6 (business building), 7 (multipurpose building), 8 (industrial building), 9 (dry paddy field), 10 (orchard), 11 (paddy field), 12 (forestation field), 13 (natural forest field), 14 (grass field), 15 (bare soil field), 16 (road), 17 (river), 18 (park), 19 (gymnasium), 20 (parking lot), 21 (miscellaneous). A (Dry paddy field), B (Paddy field), C (Orchard), D (Forestry), E (Building site), F (Factory site), G (School site), H (Parking lot), I (Gas station site), J (Warehouse site), K (Road), L (Railway site), M (Bank), N (River), O (Ditch), P (Marsh), Q (Water supply site), R (Park). S (Gymnasium site), T (Miscellaneous).
Table 2. Comparison between land-use record in UIS and land-use category record in KLIS of about 4,000,000 land parcels in Jeonnam Province.
Table 2. Comparison between land-use record in UIS and land-use category record in KLIS of about 4,000,000 land parcels in Jeonnam Province.
ABCDEFGHIJKLMNOPQRST
129,88613,2701874603537,12925218969235011267355110612383684
2109923157781130040000100012
3753406700010022001010015
41921311671289340001700015100013
5125715671526933,2488814389368511427029102072273
629723737129784461243022402201010242
7100011481114827,6331591151322422048191019976
88756887257197246340214125400797221701970
9962,64350,20193446,48212,99536135476310384011744811562357131164726
1011,37443577009238233718301315225221106161
1127,8231,170,6163384468273623433048105373516563021613504347872
12320266027,706170000336004 020960
132442103430505,1694300220123467552710314347719159
14791944373079,6187220150008026214747407378
1516983551556429101911371022177998255
1614,35720,97647806610,02768179832490,89716051219167744597518391301
1731828267817524505110023432014746083934291109712792
1816914001091940010064043100240369
191142010833110212000232902415700
2012415114523119214502020032202152
218858895313713212301719181755221180201343
1 (detached house), 2 (row house), 3 (multiplex house), 4 (apartment house), 5 (commercial building), 6 (business building), 7 (multipurpose building), 8 (industrial building), 9 (dry paddy field), 10 (orchard), 11 (paddy field), 12 (forestation field), 13 (natural forest field), 14 (grass field), 15 (bare soil field), 16 (road), 17 (river), 18 (park), 19 (gymnasium), 20 (parking lot), 21 (miscellaneous). A (Dry paddy field), B (Paddy field), C (Orchard), D (Forestry), E (Building site), F (Factory site), G (School site), H (Parking lot), I (Gas station site), J (Warehouse site), K (Road), L (Railway site), M (Bank), N (River), O (Ditch), P (Marsh), Q (Water supply site), R (Park). S (Gymnasium site), T (Miscellaneous).
Table 3. Feature class–set pairs in Figure 5 and their F-measures.
Table 3. Feature class–set pairs in Figure 5 and their F-measures.
NoFeature Class-Set PairF-MeasureNoFeature Class-Set PairF-Measure
C1{8}:{J}0.003C20{16,17,21}:{G,K,L M,N,O,P}0.772
C2{3,6}:Null C21{2,3,4,5,6,7}:{E}0.586
C3{2,3,6}: Null C22{1,2,3,4,5,6,7}:{E}0.964
C4{13,15}: Null C23{16,17,21}:{G,K,L M,N,O,P,Q,T}0.773
C5{16}:{K}0.734C24{1,2,3,4,5,6,7}:{E,S}0.964
C6{4}:{E}0.056C25{9,19}: Null
C7{4,7}:{E}0.251C26{8}:{F,J}0.283
C8{2,3,4,6,7}:{E}0.433C27{1,2,3,4,5,6,7,8}:{E,F,J,S}0.966
C9{16,21}:{K}0.732C28{1,2,3,4,5,6,7,8}:{E,F,I,J,S}0.967
C10{17}:{M}0.163C29{18}:{R}0.569
C11{Nan}:{35,37} C30{1,2,3,4,5,6,7,8,16,17,21}:{E,F,G,I,J,K,L,M,N,O,P,Q,S,T}0.987
C12{13,15}:{D}0.703C31{1,2,3,4,5,6,7,8,9, 16,17,19,21}:{E,F,G,I,J,K,L,M,N,O,P,Q,S,T}0.979
C13{13,14,15}:{D}0.730C32{1,2,3,4,5,6,7,8,9,16,17,19,21}:{A,E,F,G,I,J,K,L,M,N,O,P,Q,S,T}0.987
C14{16,21}:{K,L}0.739C33{1,2,3,4,5,6,7,8,9,12,13,14,15,16,17,19,21}:{A,D,E,F,G,I,J,K,L,M,N,O,P,Q,S,T}0.992
C15{12,13,14,15}:{D}0.735 C34{20}:{H}0.493
C16{17}:{M,N,P}0.464C35{10}:{C}0.288
C17{16,21}:{G,K,L}0.740C36{11}:{B}0.424
C18{16,21}:{G,K,L,Q}0.741
C19{17}:{M,N,O,P}0.422C40All:All1.000
Table 4. Feature class–set pairs in Figure 6 and their F-measures.
Table 4. Feature class–set pairs in Figure 6 and their F-measures.
NoFeature Class-Set PairF-MeasureNoFeature Class-Set PairF-Measure
C’1{17}:{N}0.247 C’18{12,13,14,15}:{D}0.933
C’2{6}:{E}0.009 C’19{11,21}:{B}0.937
C’3{15}: {D}0.015 C’20{11,19,21}:{B,G}0.936
C’4Null:{M,O} C’21{17}:{M,N,O,P}0.702
C’5{4,6}:{E}0.013 C’22{2,3,4,6,7,20}:{E}0.100
C’6{2,4,6}:{E}0.016 C’23{11,19,21}:{B,G,T}0.934
C’7{2,3,4,6}:{E}0.018 C’24{1,11,19,21}:{B,G,T}0.768
C’8{2,3,4,6,7}:{E}0.099 C’25{1,9,11,19,21}:{A,B,G,T}0.864
C’9{12,14}:Null C’26{1,2,3,4,7,6,9,11,19,20,21}:{A,B,E,G,T}0.965
C’10{12,13,14}:Null
C’11{17}:{N,P}0.493 C’33{10}:{C}0.408
C’12{16}:{K}0.742 C’34{18}:{R}0.384
C’13{16}:{K,L}0.747 C’35{1,2,3,4,5,6,7,9,11,12,13,14,15,16,19,20,21}:{A,B,D,E,G,H,I,J,K,L,Q,ST}0.993
C’14{9}:{A}0.897 C’36{8}:{F}0.596
C’15{16}:{K,L,O}0.750
C’16{11}:{B}0.938 C’40All:All1.000
C’17{19}:{G}0.207

Share and Cite

MDPI and ACS Style

Huh, Y. Hierarchical Semantic Correspondence Analysis on Feature Classes between Two Geospatial Datasets Using a Graph Embedding Method. ISPRS Int. J. Geo-Inf. 2019, 8, 479. https://doi.org/10.3390/ijgi8110479

AMA Style

Huh Y. Hierarchical Semantic Correspondence Analysis on Feature Classes between Two Geospatial Datasets Using a Graph Embedding Method. ISPRS International Journal of Geo-Information. 2019; 8(11):479. https://doi.org/10.3390/ijgi8110479

Chicago/Turabian Style

Huh, Yong. 2019. "Hierarchical Semantic Correspondence Analysis on Feature Classes between Two Geospatial Datasets Using a Graph Embedding Method" ISPRS International Journal of Geo-Information 8, no. 11: 479. https://doi.org/10.3390/ijgi8110479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop