Hierarchical Semantic Correspondence Analysis on Feature Classes between Two Geospatial Datasets Using a Graph Embedding Method

: A method to ﬁnd corresponding feature class pairs, including hierarchical M:N pairs between two geospatial datasets is proposed. Applying an overlapping analysis to the object sets within the feature classes, the similarities of the feature classes are estimated and projected onto a lower-dimensional vector space after applying the graph embedding method. In this space, conventional mathematical tools—agglomerative hierarchical clustering in this study—could be used to analyze semantic correspondences between the datasets and identify their hierarchical M:N corresponding pairs. The proposed method was applied to two cadastral parcel datasets; one for latest land-use records in an urban information system, and the other, for original land-use categories in the Korea land information system. To quantitatively assess identiﬁed feature pairs, F-measures for each pair are presented. The results showed that it was possible to ﬁnd various semantic correspondences of the feature classes and infer regional land development characteristics.


Introduction
Establishing data integration between different geospatial information systems is necessary in order to set up geospatial data infrastructures for collecting and disseminating the data from different systems [1]. As each dataset belonging to these systems represents similar real-world entities or phenomena according to their own abstraction models and surveying rules, syntactic, structural, semantic, and geometric heterogeneities occur between corresponding objects of different datasets [2]. Among the mentioned heterogeneities, the first two can be addressed by applying well-known knowledge representation techniques, such as the web ontology language (OWL) or resource description framework (RDF), while remaining semantic; geometric ones, however, are still complicated problems [3]. This is due to the fact that corresponding spatial objects of different datasets, which represent the same real-world entity, have their own conceptual meanings and geometric representations according to the application purposes of datasets. For example, a small narrow road connecting a main road and a parking lot in a large commercial center may be represented as a polyline object attributed as "road" in one dataset, whereas it may be represented as a polygon object attributed as "auxiliary facility" in another dataset.
In the field of the map conflation, various methods have been developed to address the aforementioned semantic and geometric heterogeneity problems [4,5]. Authors in [5] proposed a conceptual framework for a general process to address these problems, as shown in Figure 1. In this process, a pre-processing step is performed to transform two geospatial datasets (GeoDSs) to have a uniform format, scale, reference system, and so on. Then, a semantic filter step is applied Figure 1. Conceptual framework of a general process to integrate two geospatial databases (adapted from [5]).
When the geospatial datasets to be integrated originate from a similar domain, a simple comparison of feature class names would provide the desired results for the semantic filter step. However, in the case when they are from different domains, the names can be the same or similar, even though the feature classes represent substantially different real-world entities or phenomena. Moreover, the corresponding relations may vary from 1:1 to 1:N or M:N. In these cases, detailed data specifications of the datasets to be compared are necessary. However, most of the datasets do not provide such information [6].
To address this problem, various object-based analysis techniques have been proposed. These techniques use matching objects between two datasets to identify corresponding feature classes. They assume that, if spatial objects of a certain feature class in one geospatial dataset correspond to spatial objects in another feature class in the other dataset with a high probability, there is high semantic similarity between the two feature classes [7]. Uitermark et al. [2] extended this method by introducing taxonomical and partonomical relationships of feature classes within each dataset, so that relations of feature classes between datasets, as well as within each dataset, can be obtained. Similarly, authors in [8,9] have proposed an ontology integration method based on searching for relations between objects, which are able to infer taxonomic relations between the feature classes. Cruz and Sunna [10] applied a graph-matching method, where a graph is constructed for each geospatial dataset, and the taxonomic and partonomical relations of feature classes of one geospatial dataset are represented as nodes and edges, respectively. This graph model has been adopted in some studies that proposed their own similarity measurement methods. Khatami et al. [3] combined several similarities for feature class name pairs and derived the overall object correspondence between feature classes (consequently, between geospatial datasets), as well as the semantic structure among feature classes (within a geospatial dataset). Buccella et al. [11] propose a novel system that manually creates domain ontologies and automatically enriches domain ontologies with standard information using semantic, syntactic, and structure analyses.
Then, ontology integration is carried out with the information. Bhattacharjee and Ghosh [12] proposed the semantic hierarchy-based similarity measurement for semantic similarity between land cover feature classes, which considers a hop number from the top feature class node to a certain node in their graph. Kuo and Hong [13] proposed a conceptual framework for semantic integration of geospatial datasets, which allows identifying matching geospatial feature classes. In this framework, hierarchical semantic relations between the datasets such as "is_subset_of", "is_superset_of", or "is_same_to" were determined by analyzing intersection relations of objects belonging to feature classes. Kuai et al. [14] focused on natural language barriers for semantic matching between feature classes in different geospatial datasets. Recently, Zhang et al. [15] proposed a multi-feature-based similarity measurement based on geospatial relationships, feature catalog, and name tag, and then applied a supervised machine learning process to identify corresponding pairs.
Although the above studies showed good results, there is room for improvement by applying recent semantic analysis techniques [16][17][18] and developing new approaches to obtain hierarchical corresponding relations of feature classes between geospatial datasets, as well as within each dataset. These techniques begin from a co-occurrence matrix in which rows and columns represent individual entities used for analysis; in this study, feature classes are these entities. Considering the aforementioned object-based methods [7][8][9][10][11][12][13][14][15], these co-occurrence values could be measured by degrees of object sharing or intersection between feature classes from two geospatial datasets. This matrix representation easily shows overall degrees between feature classes-conventional mathematical tools-which are suitable for feature vector data but not matrix data and cannot be easily applied to the matrix for identifying corresponding feature class pairs. To address this problem, several dimensionality reduction techniques, such as latent semantic analysis or graph embedding, are employed to define a new vector space where individual entities are represented as feature vectors to which conventional mathematical tools can be easily applied [17,19,20].
In this study, the Laplacian graph embedding proposed in [20] was applied to address the above issue. This method was developed to identify the multi-level corresponding object-set pairs between two remote sensing data. It constructed a bipartite graph representing each object as a node, and node pairs' similarities between datasets as an edge with a weighted value. Thereafter, by applying Laplacian graph embedding, objects with higher similarity were distributed on closer coordinates in the embedding space. Finally, a clustering analysis on the projected nodes in the space was conducted, and the hierarchical corresponding object-set pairs could be found. In this study, nodes are used to represent feature classes rather than individual objects, so that the feature class pairs between datasets with a greater number of shared objects have close coordinates in the embedding space. Thus, this space can be understood as a semantic feature space, where two feature classes representing similar real-world entities or phenomena have geometrically close embedding coordinates. Therefore, with the knowledge of these coordinates and their distances, which are proportional to semantic dissimilarity, the previously mentioned complicated correspondence relationship between the feature classes of the two geospatial datasets can be found, and the semantic relationships of the feature classes can also be compared and inferred.
In this paper, the proposed method is applied to cadastral parcels' latest land-use records obtained from the urban information system (UIS) and their original land-use categories obtained from the Korea land information system (KLIS). These two systems have the same parcel dataset, however, attributes of their parcels could be different; a land-use category is assigned in the perspective of taxation, whereas the land-use record is assigned in the perspective of urban management. Consequently, even for the same parcels, their categories and records can be different, so that corresponding relations between these feature classes cannot be properly derived without having background information. These relations include not only M:N corresponding relations, but also their nested hierarchies. Moreover, these relations can be distinctive for specific areas due to unique geographical conditions typical for areas in question. The proposed method defines a semantic feature space where feature classes (in this study, the land-use category or land-use record) are represented as vectors. As conventional mathematical tools can be easily applied to vectors, and the distance between vectors in this study is proportional to semantic dissimilarity, the complicated relationships could be identified using proper mathematical tools such as clustering analysis.
The rest of the paper is organized as follows. In the subsequent section, an explanation of Laplacian graph embedding is given; in Section 3, the proposed method is explained; and in Section 4, it is applied for two areas, Seoul city to represent an urban area, and the Jeonnam Province to represent a rural area; then, their results are compared. Finally, in Section 5, the conclusions of this study are discussed.

One-dimensional Embedding
In this paper, we assume an undirected and connected graph. The graph G = (V, E) is represented by sets of vertices V = {v i |i = 1, · · · , N} and edges Given a weighted graph, edge weights are represented as a weight matrix W ∈ R N×N . One-dimensional graph embedding finds a configuration of embedded vertices in one-dimensional space, such that the vertices' proximities from the edge weights are preserved as the embedded vertices' distances. Assuming each entry of a column vector x = (x(1), · · · , x(n)) T as coordinates of the embedded vertices, this problem can be solved through minimization of the following objective function [21].
This function could be minimized when vertices i and j with large w i, j are embedded at close coordinates, whereas vertices with small w i,j are embedded into distant coordinates. In this study, this mathematical property is applied as follows: feature classes (e.g., land-use category and record) with a greater degree of object sharing have close coordinates in their embedding space and feature classes with a lesser degree of object sharing have distant coordinates. Equation (1) can be expressed in a matrix operation form with a Laplacian matrix L, and can be represented as Equation (2) [19][20][21].
where, the Laplacian matrix L is defined as Equation (3) with a vertex degree matrix D whose diagonal entries are obtained as d i,i = j i w i,j and the remaining entries are 0.
Now, the problem can be changed to find a vector x that minimizes x T Lx, and can be represented as Equation (4).
Since the value of x T Lx is vulnerable to the scaling of a vector x, a constraint x T B x = 1 is imposed to remove any such arbitrary scaling effect [17]. The diagonal matrix B provides weights on the vertices, so that the higher b i,i is, the more important is that vertex [21]. Equation (4) with the constraint can be solved by the Lagrange multiplier method as in Equations (5)- (7).
Thus, the solution of one-dimensional embedding, x, is obtained by solving the eigenproblem Lx = λBx. However, according to the rank of matrix L, there could be more than one eigenvector. In the field of graph spectral theory, the eigenvector corresponding to the smallest eigenvalue larger than 0 is the proven solution, which is called a Fiedler vector. Thus, the coordinates of vertices in one-dimensional embedding are obtained as components of the Fiedler vector as represented by Equation (7).

k-dimensional Embedding
Now, consider k-dimensional graph embedding. These embedded coordinates are represented as an n × k matrix X = [x 1 , · · · , x k ], so that the ith row of X, x(i ) = (x 1 (i), · · · , x k (i)), contains the k-dimensional coordinates of vertex v i . Now, an objective function is defined as Equation (8) with the constraint, X T B X = I.
Sameh and Wisniewski [22] proved that the solution to this trace minimization problem is obtained by the k-eigenvectors of LX = λBX that correspond to its smallest eigenvalues other than 0. Thus, the solution of Equation (8) is obtained by a matrix X = [x 1 , · · · , x k ], where x i represents an eigenvector corresponding to eigenvalue λ i under the condition 0 = λ 0 < λ 1 ≤ · · · ≤ λ k .
However, the constraint X T B X = I normalizes the scales of the coordinates in each dimension. Thus, it is necessary to rescale them according to each dimension's relative importance. Sameh and Wisniewski also proved that the minimum value of X T L X in Equation (8) equals the sum of the corresponding eigenvalues, as shown by Equation (9) [22].
Accordingly, we can assume the eigenvalue λ i as the amount of either the penalty or the cost caused by the ith dimensional space in the embedding problem. So, when k < l, it is appropriate to apply more weight to x k (i) − x l ( j) than x k (i) − x l ( j) in measuring the proximity for a clustering analysis. Based on these mathematical properties, we determined the embedded coordinates as Equation (10), because the increase in distance is proportional to that of the root-squared coordinate difference [20].

Proposed Method
The proposed method begins with an edge weight matrix whose cells represent the degree of object sharing between two feature classes (Step 1). From this matrix, k-dimensional feature vectors for each feature class are obtained by the Laplacian graph embedding technique (Step 2). Then, agglomerative hierarchical co-clustering is applied to find hierarchically corresponding feature class-set pairs (Step 3).   where, N( ) is a function that returns the number of spatial objects, i f and j f represents feature class i and j in two datasets A and B , respectively. This similarity measure effectively explains a partial and complete relationship of two feature classes, which is necessary to find complicated corresponding pairs such as N:1, 1:M, or N:M [23,24].
Since Laplacian graph embedding assumes a normal graph, an edge weight matrix where N n m = + , is obtained by Equation (12). With this matrix W , its Laplacian matrix L is obtained by Equation (3).

Step 2: Solving Eigenproblem and Obtaining K-dimensional Coordinates
The process of Laplacian graph embedding in Section 2 considered each vertices' weight using a diagonal matrix B . However, in this study, each feature class has the same importance so that B is set to an identity matrix and Equation (13) is applied instead of Equation (7).

Step 1: Constructing Edge Weight Matrix W
The proposed method begins with a weighted bipartite graph represented by a similarity matrix S ∈ R n×m , where n and m stand for the numbers of feature classes in two datasets A and B respectively, and cell values are calculated by Equation (11). (11) where, N( ) is a function that returns the number of spatial objects, f i and f j represents feature class i and j in two datasets A and B, respectively. This similarity measure effectively explains a partial and complete relationship of two feature classes, which is necessary to find complicated corresponding pairs such as N:1, 1:M, or N:M [23,24]. Since Laplacian graph embedding assumes a normal graph, an edge weight matrix W ∈ R N×N , where N = n + m, is obtained by Equation (12). With this matrix W, its Laplacian matrix L is obtained by Equation (3).

Step 2: Solving Eigenproblem and Obtaining K-dimensional Coordinates
The process of Laplacian graph embedding in Section 2 considered each vertices' weight using a diagonal matrix B. However, in this study, each feature class has the same importance so that B is set to an identity matrix and Equation (13) is applied instead of Equation (7).
Although all the eigenvectors of Equation (13) are orthogonal and convey distinct information, we need to determine the optimal dimensionality k, because eigenvectors corresponding to small eigenvalues are appropriate for the embedding problem, as shown in Equation (9). The optimal dimensionality k for an expected number of clusters was proposed by [25]. Assuming each eigenvector has information to partition vertices into at least two clusters, he determined k as log 2 c , where c is the expected number of clusters.
is a function to present the minimum integer larger than a given value. Similarly, we determine k with Equation (14), because the maximum number of corresponding feature class pairs could not exceed the numbers of feature classes in either of two datasets.
Accordingly, the embedded coordinates of the vertices in datasets A and B are obtained by k-rescaled eigenvectors corresponding to the k smallest eigenvalues other than 0, as in Equation (10).

Step 3: Agglomerative Hierarchical Clustering Analysis and Assessment of Clusters
Given clusters (at the initial condition, each feature class are considered as clusters), the agglomerative hierarchical clustering method searches the two closest clusters and merges them into one cluster. These searching and merging steps are repeated until all entities are merged into a single cluster. Thus, it presents a sequence of nested partitions of hierarchical cluster structure in the form of a dendrogram [26]. To apply the method, it is necessary to determine the criteria to measure the distance between two clusters. Among the several criteria, a single-link measure which considers the average distance of all entity pairs between clusters, as given in Equation (15), is chosen. The single-link measure defines the dissimilarity as the minimum distance among all the entity distances between two entity clusters and tends to find elongated clusters.
where, D (C a , C b ) is a cluster distance of cluster C a , C b , d e i , e j is an entity distance between embedded coordinates of feature class f i , f j . A dendrogram is a tree diagram that shows a structure of clusters where the bottom row of nodes represents individual entities (in this study, feature classes of two datasets) and the remaining nodes represent the merging of their sub-nodes. Thus, by analyzing the feature types in the remaining nodes, semantically corresponding feature classes between two datasets could be obtained. The clustering analysis in the above step presents a clustering sequence, but not obtained are clusters from which semantically corresponding feature class-set pairs are determined. Thus, statistical assessment of these clusters is necessary. Given an lth cluster, C (l) needs to be divided into two feature class-sets, C  (16), which is often used in the field of semantic engineering and information retrieval [27]. F-measures of each and every cluster are calculated, and then the clusters whose F-measure is higher than a threshold are determined as semantically corresponding feature class-set pairs.

Experimental Dataset
To evaluate the proposed method, two representative areas have been chosen, Seoul city and the Jeonnam Province, as shown in Figure 3, as the first one is the most urbanized area in the country, and the other is the Southwestern part of the country, which is well-known for fertile farmlands with vast plains. Land parcel datasets of the two areas were extracted from UIS and KLIS. Thereafter, each parcel's land-use record and category were compared as shown in Tables 1 and 2. In these tables, 1 to 21 refer to the record index, and A to T refer to the category index. The values of cells in the tables represent the number of land parcels having a certain index pair with record and category. There are several pairs whose record and category index names are the same, such as (9, A) of a dry paddy field, (11, B) of a paddy field, (20, H) of a parking lot, (17, N) of a river, which seem to be 1:1 corresponding pairs. However, for the land parcels of "N (River)", there are 1416 parcels with "17 (river)" and 1345 parcels with "Road (16)", which means that in terms of the land-use category in KLIS, the land parcels with "N (river)" are currently used for hydrology or transportation purpose with similar proportions. In terms of the land-use record in UIS for "17 (river)", the land parcels were mainly registered as "N (River)" (1416 parcels); however, significant number of parcels (756 parcels) were registered as O (Ditch). This demonstrated that the corresponding land-use record and category pairs can be unexpectedly expanded according to concatenated one-to-many corresponding relations; consequently, a new method is required to identify complicated M:N corresponding feature class pairs between geospatial datasets. This method also needs to be based on the data itself, not on the geographic background knowledge of the area under consideration. In Table 2, the above relations are not valid and show completely different relations. This means that a data-driven learning method such as the one proposed in the present paper is required to obtain distinctive results for each area, for example, such as Seoul city and the Jeonnam Province. To evaluate the proposed method, two representative areas have been chosen, Seoul city and the Jeonnam Province, as shown in Figure 3, as the first one is the most urbanized area in the country, and the other is the Southwestern part of the country, which is well-known for fertile farmlands with    0  28  172  14  156  14  289  38  0  1254  395  1  10  0  0  0  24  1  0  8  16  1  1  50  1  208  15  26  6  0  415  35  0  8  0  0  0  5  0  1  2  14  0  9  9  17  22  16  2732  1865  1  2624 30,767 110  197  13  5  2  69,102 1460  346  1345  2977  80  301  291  45  1764  17  219  261  0  104  65  2  1  0  0  0  120  19  348  1416  756  59  6  4  0  97  18  119  79  0  247  1834  1  3  1  1  0  161  30  10  75  38  1  31  2200  8  109  19  120  89  0  29  114  5  1  0  0  0  44  12  2  14  7  1  1  15  0  14  20  46  104  0  36  381  1  0  387  2  0  45  2  3  6  30  5  0  7  0  52  21  61  10  0  36  144  0  5  0  1  0  19  16  3  157  12  3  3 0  22  0  1  2  346  75  5  27  103  143  47  7  19  159  14  7919  4437  30  79,618 722  0  15  0  0  0  80  26  2  14  7  47  4  0  7  378  15  169  83  5  5155  64  2  9  1  0  19  11  37  1  0  2  21  77  9  98  255  16  14,357 20,976  47  8066 10,027  68  179  8  3  24  90,897 1605  121  916  774  459  751  8  39  1301  17  3182  8267  8  1752  450  5  11  0  0  2  343  20  1474  6083  9342  9110  97  1  2  792  18  169  140  0  109  194  0  0  1  0  0  6  4  0  4  3  10  0  240  3  69  19  114  201  0  83  311  0  212  0  0  0  23  29  0  2  4  1  5  7  0  0  20  124  151  1  45  231  1  9  214  5  0  20  2  0  0  3  2  2  0  2  152  21  885  889  5  313  713  2  123  0  1  7  19  18  17  5  5  221 18 0 20 1343 Figure 4 shows the projection of the data provided in Table 1 onto the three-dimensional space using the proposed method. Although the projection was originally done onto five-dimensional space, the coordinates of up to three principle dimensions are used for the visual analysis. As described above, the land-use record and category that are close to each other in this space share more land parcels, such as (11, B), as can be seen at the bottom left in the figure (this cell corresponds to a paddy field).  Figure 4 shows the projection of the data provided in Table 1 onto the three-dimensional space using the proposed method. Although the projection was originally done onto five-dimensional space, the coordinates of up to three principle dimensions are used for the visual analysis. As described above, the land-use record and category that are close to each other in this space share more land parcels, such as (11, B), as can be seen at the bottom left in the figure (this cell corresponds to a paddy field).  Table 1 onto a 3-dimensional space with the proposed method for visual presentation. Feature class pairs that share more objects, such as (11, B) and (20, H) in Table 1, have closer coordinates. Figure 5 shows the dendrogram of agglomerative hierarchical clustering on the embedded coordinates of the data provided in Table 1. In the dendrogram, nodes and links represent the process used to identify the clusters. For example, "8 (industrial building)" and "J (Warehouse site)" first constitute a cluster C1, to which "F (Factory site)" is clustered sequentially to transform the cluster into C26. According to this clustering process, the corresponding land-use record and category pairs between UIS and KLIS were analyzed, and subsequently, the corresponding feature class clusters could be derived and analyzed accordingly. This clustering process allows the identification of not only 1:1 correspondences (at the right side of the dendrogram), but also complex correspondences. In addition, clusters such as C18 and C19 are combined to define a supercluster for higher-level geographic concepts for a so-called trans-hydro network. From the clustering results provided in Figure 5, it can be seen that the following feature correspondences could be obtained:  Table 1 onto a 3-dimensional space with the proposed method for visual presentation. Feature class pairs that share more objects, such as (11, B) and (20, H) in Table 1, have closer coordinates. Figure 5 shows the dendrogram of agglomerative hierarchical clustering on the embedded coordinates of the data provided in Table 1. In the dendrogram, nodes and links represent the process used to identify the clusters. For example, "8 (industrial building)" and "J (Warehouse site)" first constitute a cluster C 1 , to which "F (Factory site)" is clustered sequentially to transform the cluster into C 26 . According to this clustering process, the corresponding land-use record and category pairs between UIS and KLIS were analyzed, and subsequently, the corresponding feature class clusters could be derived and analyzed accordingly. This clustering process allows the identification of not only 1:1 correspondences (at the right side of the dendrogram), but also complex correspondences. In addition, clusters such as C 18 and C 19 are combined to define a supercluster for higher-level geographic concepts for a so-called trans-hydro network. From the clustering results provided in Figure 5, it can be seen that the following feature correspondences could be obtained:  Table 1 using the proposed method.

Results and Discussion
• C1 (8:J): Although a small portion of "8 (industrial building)" are located in "J (Warehouse site)", these two feature classes have the closest embedded coordinates, as shown in Figure 4. This is because the proposed method performs data normalization in the form of relative frequency, as in Equation (11). Thereafter, "F (Factory site)" is clustered sequentially to process the cluster into C26 • C8 ({2,3,4,6,7}:E): Seoul city is a typical megacity and the capital of the Republic of Korea with a population equal to approximately 10 million, therefore, there are so many residential buildings constructed on the land with land-use category registered as "E (Building site)". It should be noted that, according to its high land price, detached houses are not popular in the city, except suburban areas. Thus, C8 represents this residence characteristic of the city. • C21 ({2,3,4,5,6,7}:E), C22 ({1,2,3,4,5,6,7}:E): During the clustering process, "5 (commercial building)" and "1 (detached house)" are sequentially merged into C8. As previously explained, "1 (detached house)" is subsequently merged into the cluster after "5 (commercial building)". • C27 ({1,2,3,4,5,6,7,8}:{E,F,J,I,S): C27 is a combination of C24 and C26, which together constitute the main urban development area. Then, "I (Gas station)" is merged into this cluster, which seems to be an isolated land-use category in the urban development area. This is because the safety regulation and high land prices of Seoul city lead to the fact that gas stations are located at a significant distance from central residential and/or commercial sites.   Table 1 using the proposed method.
• C 1 (8:J): Although a small portion of "8 (industrial building)" are located in "J (Warehouse site)", these two feature classes have the closest embedded coordinates, as shown in Figure 4. This is because the proposed method performs data normalization in the form of relative frequency, as in Equation (11). Thereafter, "F (Factory site)" is clustered sequentially to process the cluster into C 26 • C 8 ({2,3,4,6,7}:E): Seoul city is a typical megacity and the capital of the Republic of Korea with a population equal to approximately 10 million, therefore, there are so many residential buildings constructed on the land with land-use category registered as "E (Building site)". It should be noted that, according to its high land price, detached houses are not popular in the city, except suburban areas. Thus, C 8 represents this residence characteristic of the city.  26 , which together constitute the main urban development area. Then, "I (Gas station)" is merged into this cluster, which seems to be an isolated land-use category in the urban development area. This is because the safety regulation and high land prices of Seoul city lead to the fact that gas stations are located at a significant distance from central residential and/or commercial sites. : "17 (river)" and "M (Bank)" are firstly clustered and then, "N (River)", "P (Marsh)", and "O (Ditch)" are clustered to form the water-system area. In Seoul city, central and local governments have constructed the banks along most of rivers and streams to prevent flood damage, which explains why "17 (river)" and "M (Bank)" are firstly clustered together, rather than remaining as three considered land-use categories.
• C 5 (16:K): This cluster shows that in the two datasets of the land-use record and category, feature classes named "Road" represent nearly the same real-world entity, which means that they have similar geographic concepts for roads. In an urban area such as Seoul city, many small streams have been covered to construct more roads as a part of the continuous urbanization process. In this process, the original land-use category of many land parcels have not been properly changed according to the substantive land-use condition. The inclusion of "G (School site)" seems to be erroneous. In Table 1, there is no proper land-use record class for educational facilities, and this means that the UIS does not manage these facilities. This is due to the fact that according to the Korean administrative legal system, the management of elementary school, middle school, and high school should be governed by local education offices, and not by the local government; therefore, the relevant data is not sufficiently reflected in the UIS, which is managed by local governments.  Table 3 shows the clusters in Figure 5 and their F-measure with Equation (16). The above cluster analysis does not consider a quantitative criterion. In Table 3, some feature class-set pairs such as C 1 , C 8 , and C 21 have low F-measure values; meanwhile, other pairs such as C 5 , C 9 , C 12 have high values. When the proposed method is applied to identify exact corresponding feature class-set pairs, a proper F-measure threshold needs to be determined. In the case of Table 3, 0.700 seems to be such a threshold, considering the above analysis. However, the determination of this threshold requires sufficient statistical experiments. The following feature class-set pairs have been identified for the Jeonnam Province: • C'1 (17:N): In the clustering process, the first pair of feature classes identified is "17 (river)" and "N (River)". In the previous clustering analysis performed for Seoul city, it had a low weight (125/577 = 0.22) according to Equation (11), however, it has a high weight (6083/8225 = 0.74) for the Joennam Province. This is because, in urban areas such as Seoul city, many roads are constructed along rivers or banks; however, in rural areas such as the Jeonnam Province, river-side areas are reserved undeveloped; consequently, the above feature classes are clustered firstly. Although the order of clustering is different, the result of the analysis is similar to that of Seoul city. It can be confirmed that the 1:N feature class correspondence is the same for the city and the province; however, there is a difference in the correspondence priority of the sub-feature class depending on the regional characteristics.   Table 3. Feature class-set pairs in Figure 5 and their F-measures. The above clustering sequence describes local characteristics because even though they have the same land category, the results of substantive land development or land-use could be different across the regions. Figure 6 shows such a difference clustering result of the Jeonnam Province data in Table 2. Table 4 shows F-measure values of the clusters in Figure 6, similar to Table 3. Comparing to Table 3, the clusters related to suburban and agriculture areas, such as C' 14 and C' 16 , have high F-measure values. Meanwhile, those related to the development area, such as C' 5 , C' 6 , C' 7 , and C' 8 , have low values.
Considering the above analysis results provided for Seoul city and the Jeonnam Province, the characteristics of the proposed method could be identified as follows. First, it is possible to explore the various semantic correspondences of the feature classes through analyzing the clustering order in the embedded space. Adjacent feature classes in the space share more spatial objects, which means that they have a high probability to represent the same real-world entity or phenomena. According to the assumptions of this research and many previous related studies, these feature classes can be classified as semantically corresponding pairs. Therefore, applying agglomerative hierarchical clustering, hierarchical semantic relations of the feature classes such as "is_subset_of", "is_superset_of", or "is_same_to" could be obtained, similarly to [13].
Second, it is possible to infer regional characteristics of the feature classes. For example, the lands for which the land-use category is T (Miscellaneous) were generally used for the transportation area in Seoul city, and for the suburban and agricultural areas in the Jeonnam Province, as shown in Figures 4 and 5, respectively. This is because there is high land development demand for transportation services in urban areas such as Seoul city. However, in the Jeonnam Province, where only a small part of its area is urbanized, there is no specific land development demand, and thus, the lands for which the land-use category is T (Miscellaneous) were developed in various forms. However, the water-system area and the forest area showed very similar clustering results. This can be explained by the natural environment protection due to the intervention of the central government, which results in similar land development tendencies for both urban and rural areas.  Table 2, using the proposed method. Table 4. Feature class-set pairs in Figure 6 and their F-measures.   Table 2, using the proposed method. Table 4. Feature class-set pairs in Figure 6 and their F-measures.

Conclusions
In this article, we proposed a new method to identify semantic correspondences between two datasets by means of finding hierarchical M:N corresponding feature class-set pairs. Applying the overlapping analysis to the object sets within the feature classes, the similarities of the feature classes are estimated and projected onto a lower-dimensional vector space after applying the graph embedding method. Thereafter, as the feature classes of high similarity are distributed close to each other in the projection space, distance-based clustering is conducted to identify the semantically corresponding feature class pairs. The above method was applied to the cadastral parcels' land-use record in UIS and the corresponding land-use category in KLIS for two different test sites, Seoul city and the Jeonnam Province. As a result, it was possible to find various semantic correspondences of the feature classes between UIS and KLIS. In addition, hierarchical structures of the correspondences could be obtained. Moreover, upon analyzing these structures to obtain sequential clustering orders, regional characteristics of the feature classes were also inferred.
The proposed method is based only on the results of the overlay analysis between datasets. Therefore, aside from the location information, other prior information related to the construction of similarity measures was not required. This is an advantage in terms of generality as the proposed method can be applied to various geospatial datasets. Moreover, an advanced method could be developed by combining various similarity measures, such as lexical similarity, structural similarity, category similarity, shape similarity, and so on [18,28,29] into the co-occurrence matrix, in which rows and columns represent entities under analysis, such as feature classes in this study. To combine these various similarity measures between these entities, it is necessary to determinate their weight. We will consider these aspects to improve the proposed method in future studies.