Centrality as a Method for the Evaluation of Semantic Resources for Disaster Risk Reduction

Clear and straightforward communication is a key aspect of all human activities related to crisis management. Since crisis management activities involve professionals from various disciplines using different terminology, clear and straightforward communication is difficult to achieve. Semantics as a broad science can help to overcome communication difficulties. This research focuses on the evaluation of available semantic resources including ontologies, thesauri, and controlled vocabularies for disaster risk reduction as part of crisis management. The main idea of the study is that the most appropriate source of broadly understandable terminology is such a semantic resource, which is accepted by—or at least connected to the majority of other resources. Important is not only the number of interconnected resources, but also the concrete position of the resource in the complex network of Linked Data resources. Although this is usually done by user experience, objective methods of resource semantic centrality can be applied. This can be described by centrality methods used mainly in graph theory. This article describes the calculation of four types of centrality methods (Outdegree, Indegree, Closeness, and Betweenness) applied to 160 geographic concepts published as Linked Data and related to disaster risk reduction. Centralities were calculated for graph structures containing particular semantic resources as nodes and identity links as edges. The results show that (with some discussed exceptions) the datasets with high values of centrality serve as important information resources, but they also include more concepts from preselected 160 geographic concepts. Therefore, they could be considered as the most suitable resources of terminology to make communication in the domain easier. The main research goal is to automate the semantic resources evaluation and to apply a well-known theoretical method (centrality) to the semantic issues of Linked Data. It is necessary to mention the limits of this study: the number of tested concepts and the fact that centralities represents just one view on evaluation of semantic resources.


Introduction
Disaster risk reduction activities consist of collecting, processing, and visualizing large spatial data sets [1][2][3][4] which can be created as a combination of existing data with links to other data (Linked Data approach [5][6][7]).The Linked Data approach is one of the most efficient to deal with spatial data in terms of data volume, speed of processing, or intelligibility of data presentation and visualization.Linked Data, semantics (which is an integral part of Linked Data), and relevant tools (thesauri, ontologies, knowledge bases, controlled vocabularies, etc.) can contribute to one of the main tasks of disaster risk reduction as well as early warning activities.This task is connected with the necessity of fast communication, intelligibility, and common understanding of essential concepts, including their machine processing, or the development of advanced tools such as decision support systems [8,9].
This study focuses on geographic and geography-related concepts [10] used in the disaster risk reduction domain.Geography and related disciplines motivated by the very important role of geography (including geoinformatics, geomatics, and cartography) dealing with spatial information play a crucial role in crisis management and disaster risk reduction [1][2][3][4]8,9], because knowledge related to localization or position are crucial for all crisis management and risk reduction activities, and geography is essential in the Linked Data space [11].Moreover Reference [7] mentions: "geography is another factor that can often connect information from varied topical domains" [7].Geographical data are also a very important part of the Linking Open Data cloud diagram, which contains specific resources of spatial and geographic data (such as GeoNames.orgor LinkedGeoData.org), but other very important Linked Data resources (such as DBpedia, AGROVOC, or Wikidata) also include spatial components (for example, data with coordinates or geographical concepts).
The objective of this this research is to analyse identity links (details in [7]) in Linked Data resources containing terms from the disaster risk reduction domain and to identify suitable semantic resources.The process of finding a suitable semantic resource is not only important from the communication point of view, but also from the metadata description point of view.Identity links represent the highest level of Linked Data according to the 5-star ranking system [5].These links enable the interconnection of independent data resources and construct a network of identical objects.This approach is very important from the point of view of data sharing, understanding, common communication among subjects participating in disaster risk reduction activities, automated data processing, or the derivation of new information or consequences in crisis management (detailed information on the importance of links between Linked Data resources are published in [6,12,13]).As the quantitative criteria for identical links evaluation, various types of centrality [14] (details in Sections 2 and 3 ) were chosen.The particular types of centrality evaluate resources based on their position in the Linked Data space.This is the main benefit of this research, because the selection of fitting semantic resources is usually driven by the subjective opinion of users, national priorities, or the number of terms published in a resource.The interconnection of resources to semantic information in other Linked Data databases can provide a complex view on the Linked Data structure and choose an appropriate resource of concrete type of information or data.
The article is structured as follows.The introduction to Linked Data and semantics, including their benefits for disaster risk reduction, are mentioned in the first part.This section also contains the constraints of the described research and detailed structure of the article.The Materials and Methods section describes related works, details of selected metrics, and ways of collecting and processing sample data.The Results section focuses on the implementation of particular metrics on a selected sample of geographical concepts used in the disaster risk reduction domain.The results are commented on in the Discussion section.This part contains recommendations for appropriate thesauri or other semantic resources -the main goal of this paper.The last part summarizes the conclusions and introduces opportunities for further studies and research.

Materials and Methods
The research was realized in the following steps (workflow in Figure 1):

2.
Downloading identical representations of geographic objects from various Linked Data resources.

3.
Development of Data Networks representing particular concepts (see an example in Figure 2).

4.
Application of centrality metrics for resources evaluation.

5.
Summarizing information from particular Data Networks.

6.
Recommendation of thesauri or other semantic resources based on the results of the quantitative evaluation.Steps 1, 2, and 3 are described in this section.The implementation of metrics to sample data (statistical evaluation of particular Data Networks) and summarizing the acquired data (Steps 4 and 5) are the crucial part of this section of the article, and they are published in the Results section.Recommendations (Step 6) are mentioned in the Discussion.Ad 1.The data for the research were selected from keywords of relevant articles focused on disaster risk reduction.The publications were chosen by a method based on Snowball sampling (details and mathematical background of this method in [15]).This method is based on depth-first search of tree or graph structures.In the case of this study, the structure is composed of important publications and their references (bibliography).As the first-level input, the publication "Three-dimensional maps for disaster management" [16] (recommended as a reference paper for the journal special issue) was selected.Three iterations of searching were used.The second level consists of publications [17][18][19][20][21][22][23][24][25][26] (the references of [16] were more numerous, but only publications with keywords were taken into consideration).
Finally, a set of 160 items related to disaster risk reduction and its interconnection with geoinformatics, geomatics, cartography, and similar disciplines was created.These items are divided into concepts (from very general items such as collaboration or usability to specific issues such as Web Map Service or participatory GIS) and concrete objects and terms (such as Germany, M. F. Goodchild, Rhinopithecus bieti, Twitter, or Oder).Originally, the set of concepts and objects selected from the keywords was larger (about 350 terms), but the items on the list which were not represented in DBpedia (see below) or did not contain any identical links were replaced.
Ad 2. The searching of identical links and semantic resources containing equivalent (or very similar) representations of the same object or concept was realized by the script developed by authors.The script is driven by Bash script language.It uses XSLT language for data transformations and open software components: Saxon (XSLT processor), wget (file retrieving and downloading), grep (text processing), xmlstarlet (transformation between CSV, comma-separated values, and XML, extensible markup language, formats) and Graphviz (export of graphic schemas generated in DOT graph description language).As an input, the table contains the name of each data item and the identifier of the representation of the concept or object in the DBpedia knowledge base.DBpedia was used as the starting point of all searching processes because DBpedia is the crucial central point of the Linked Data space (see Linking Open Data cloud diagram; http://lod-cloud.net/).The script produces an XML file for each item.This file contains all identical links between particular representations, including acronyms of subject and object of the relation (in the terminology of RDF, resource description framework, triples), type of the relation, and possible error influencing the object of the relation.Moreover, it produces a graphic schema for all objects and concepts (Figure 2).The searching of the Linked Data network is realized by the "Follow Your Nose" approach (mentioned for example in [7] or [27]), which is based on sequentially scanning standardized identical links.
The script collected 1171 identical links, which were divided into 3 groups (Figure 3): 1.
Links leading to correct nodes (Linked Data resources); 2.
Links directed to data resources influenced by a semantic error (e.g., HTML view on data instead of real RDF data); 3.
Links targeting to data resources containing a technical error (usually not working URI, uniform resource identifier).
Ad 1.The data for the research were selected from keywords of relevant articles focused on disaster risk reduction.The publications were chosen by a method based on Snowball sampling (details and mathematical background of this method in [15]).This method is based on depth-first search of tree or graph structures.In the case of this study, the structure is composed of important publications and their references (bibliography).As the first-level input, the publication "Threedimensional maps for disaster management" [16] (recommended as a reference paper for the journal special issue) was selected.Three iterations of searching were used.The second level consists of publications [17][18][19][20][21][22][23][24][25][26] (the references of [16] were more numerous, but only publications with keywords were taken into consideration).
Finally, a set of 160 items related to disaster risk reduction and its interconnection with geoinformatics, geomatics, cartography, and similar disciplines was created.These items are divided into concepts (from very general items such as collaboration or usability to specific issues such as Web Map Service or participatory GIS) and concrete objects and terms (such as Germany, M. F. Goodchild, Rhinopithecus bieti, Twitter, or Oder).Originally, the set of concepts and objects selected from the keywords was larger (about 350 terms), but the items on the list which were not represented in DBpedia (see below) or did not contain any identical links were replaced.
Ad 2. The searching of identical links and semantic resources containing equivalent (or very similar) representations of the same object or concept was realized by the script developed by authors.The script is driven by Bash script language.It uses XSLT language for data transformations and open software components: Saxon (XSLT processor), wget (file retrieving and downloading), grep (text processing), xmlstarlet (transformation between CSV, comma-separated values, and XML, extensible markup language, formats) and Graphviz (export of graphic schemas generated in DOT graph description language).As an input, the table contains the name of each data item and the identifier of the representation of the concept or object in the DBpedia knowledge base.DBpedia was used as the starting point of all searching processes because DBpedia is the crucial central point of the Linked Data space (see Linking Open Data cloud diagram; http://lod-cloud.net/).The script produces an XML file for each item.This file contains all identical links between particular representations, including acronyms of subject and object of the relation (in the terminology of RDF, resource description framework, triples), type of the relation, and possible error influencing the object of the relation.Moreover, it produces a graphic schema for all objects and concepts (Figure 2).The searching of the Linked Data network is realized by the "Follow Your Nose" approach (mentioned for example in [7] or [27]), which is based on sequentially scanning standardized identical links.
The script collected 1171 identical links, which were divided into 3 groups (Figure 3): 1. Links leading to correct nodes (Linked Data resources); 2. Links directed to data resources influenced by a semantic error (e.g., HTML view on data instead of real RDF data); 3. Links targeting to data resources containing a technical error (usually not working URI, uniform resource identifier).The further processing concerns only correct links and resources as well as links and resources affected by a semantic error (777 links in total).Although the last mentioned category of data was not The further processing concerns only correct links and resources as well as links and resources affected by a semantic error (777 links in total).Although the last mentioned category of data was not able to find any other interconnected resources, it is taken into consideration, because these resources can provide interesting new information, which is the reason for using semantic resources.
Ad 3.The development of the Data Network [13] (alternatively SameAs Network, e.g., in [28]) is ensured by the script created in R software with integrated igraph library.The script transforms the input CSV file containing particular identical links (coming from XML file generated in previous step) to the form of a directed graph.Then, the script processes the Data Network and computes quantitative metrics based on centrality described in the Results section.
The authors realize that centrality is just one method supporting the selection process of relevant semantic resources for disaster risk reduction activities.The research will continue by comparing explicit semantics contained in particular resources (together with domain experts; principles are mentioned in [29,30]) or by testing metrics for whole network or edges.The achieved results could be improved by processing a larger number of concepts and objects.

Results
Ad 4. Centrality could in general be described as the importance of a position of a node in a graph [13,31,32].Therefore, this approach could be used to find the most relevant semantic resources for selected concepts and objects.Examples illustrating application of centrality in the domain of data semantics are available in [13] (mentions the degree, closeness, and betweenness centrality), and [13,[33][34][35][36] (deals with the indegree, closeness, and betweenness centrality).References [14,37] mention the history of graph centrality research.
The four types of centrality including degree, closeness, betweenness, and indegree are computed for each Data Network, representing particular tested concepts selected from the disaster risk reduction domain.The following mathematical formulas (adopted from [37]) illustrate particular types of centrality of a vertex v, which is the part of directed graph G = (V, E), where V is a set of nodes (vertices) v and E is a set of edges e (for the Linked Data purposes the weight assigned to each edge is 1; the graph is not weighted).
Outdegree centrality (a part of degree centrality which is described in [37][38][39][40][41][42][43]) of the vertex v is measured as the number of edges leading from the node v.The values of the outdegree centrality in the network of semantic resources built on the basis of identity links means the number of other semantic resources, which are linked from the resource represented by the vertex v. Outdegree(v) = ∑ deg out (v)   Indegree centrality is similar to the previous type of centrality computed as the number of edges leading to the node v from other nodes of the graph G.In the described case, this type of centrality shows how many semantic data resources refer to the resource represented by the vertex v.
Closeness centrality [37][38][39][40][41][42][43][44] is defined as the average shortest path length between a particular vertex v and other nodes in the graph G. High values of closeness centrality in the case described in this text mean that the concrete semantic resource is close to other resources.It causes simple movement through the network of resources and acquiring of new information.
where d(y,v) is the shortest way between nodes y and v in the graph G.
Betweenness centrality [13,31,37,[39][40][41]43,44] is defined in terms of how "inbetween" a vertex is among the other vertices in the graph [14].High values of the betweenness centrality in the network of semantic resources mean that the node could be a "bridge" among several independent (not directly interconnected) parts of the network.
where σ st is the total number of shortest paths in the graph G from node {s} s to node {t} and σ st (v) is the total number of shortest paths from node {s} s to node {t} passing through the vertex v.
An optimal semantic resource from the view of centrality has following properties: • It is connected to many other resources.

•
It is referenced from many other resources.

•
It is close to other resources.

•
It interconnects independent subgraphs of the network.
Table 1 summarizes the properties of particular types of centrality.It is not possible to assess the described centralities, because they do not represent various variants of one method, but they are complementary expressing different kind of position of the node in a graph.

Indegree
It shows the normalized value of the amount of nodes of the graph being connected to the vertex for which the centrality is computed.
In the case of this article, the high value of indegree centrality means that this resource is directly referenced by other resources.

Outdegree
It expresses the normalized value of the amount of nodes of the graph being connected to the vertex by directed edge from the node for which the centrality is computed.
In the case of this article, the high value of indegree centrality means that this resource contains many links to other resources.

Closeness
This type of centrality shows how close the node is to the other vertices in the graph.In the case of Linked Data it does not play a very important role, because the data networks are not very large (tens of nodes).

Betweenness
It identifies weak positions of the graph-nodes (resources) representing the bridges among independent parts of data network.
The centrality as well as the development of the Data Network are computed in R software with use of the igraph library.The normalization is performed by multiplying the raw values by n − 1, where n is the number of vertices in the graph.
Ad 5. Centrality values for particular Data Networks representing the occurrence of concepts and objects are summarized by computing the average of each centrality values.This step is realized by XSLT (Extensible Markup Language -Transformation) templates, which are able to find relevant values for each semantic resource as well as to compute the averages.
Table 2 shows the results of the centrality computation.Particular columns contain average values of the four used types of centrality calculated for each selected term related to disaster risk reduction.In the first column, there are acronyms of semantic resources.The highest values in each category (type of centrality) are emphasized in Table 2.These results are discussed in the next chapter.

Discussion
The results published in the previous section indicate the following information related to fitting semantic resources for disaster risk reduction:

•
Disaster risk reduction is a very large and multi-disciplinary field.Therefore, the portfolio of tested terms (keywords) is very heterogeneous.It contains specific terms (e.g., disaster response), general terms (e.g., accessibility, attention), geographical or personal names, and many concepts from other domains (information technologies, cartography, economics).

•
The automated searching process found 30 relevant semantic resources using Linked Data approach (Figure 4 and Table 2).However, the average resource contains just 24 tested concepts or objects (from 160).Only eight resources have better-than-average values of occurrence of the tested keywords.

•
Only two semantic resources contain all tested terms (Figure 4).In the case of DBpedia this fact is given by the selected system of searching of the Linked Data network (see Materials and Methods).
Wikidata is the second most important resource from the view of occurrence of concepts or objects related to disaster risk reduction.This information shows that the role of Wikidata in the world of Linked Data is much more significant and it competes with DBpedia [45].Both resources (DBpedia and Wikidata) represent the most complex semantic knowledge bases for disaster risk reduction purposes.It is evident not only from Figure 4, but also from Table 2, where Wikidata and DBpedia have the highest values in all types of centrality.

•
Because all values in the Table 2 are normalized, just the simple sum can be used as the overall indicator.In addition to DBpedia and Wikidata mentioned above, there are other interesting semantic resources: Biblioteca Nazionale Centrale di Firenze, Yago, Deutschen Nationalbibliothek, Library of Congress Name Authority File and NDL (National Diet Library).Except for high centrality values (especially closeness centrality), these resources have better-than-average occurrence of tested concepts.It is also interesting to note that all of these resources (except Yago) come from the domain of libraries.

•
There are important data sets missing in the set of semantic resources, such as AGROVOC, EuroVoc, GEMET (GEneral Multilingual Environmental Thesaurus), NAL (National Agricultural Library) or STW (Standard-Thesaurus Wirtschaft) Thesaurus for Economics.This is caused by the selected method of data exploitation, because none of them is connected to DBpedia or other resources related to DBpedia.This isolation of the group of the above-mentioned thesauri or ontologies is also evident from other research (e.g., [29]).The authors tested searching process starting in AGROVOC, but results were not satisfying due to the low number of tested terms contained in AGROVOC.

•
Geographical concepts [10] and objects represent a specific case of disaster risk reduction terms.
In addition to the above-mentioned semantic resources, they are contained in specific thesauri, ontologies, or gazetteers such as GeoNames.org,LinkedGeoData (a Linked Data version of OpenStreetMap), or FAO Geopolitical Ontology (it is not mentioned in this research).Except for high centrality values (especially closeness centrality), these resources have betterthan-average occurrence of tested concepts.It is also interesting to note that all of these resources (except Yago) come from the domain of libraries.


There are important data sets missing in the set of semantic resources, such as AGROVOC, EuroVoc, GEMET (GEneral Multilingual Environmental Thesaurus), NAL (National Agricultural Library) or STW (Standard-Thesaurus Wirtschaft) Thesaurus for Economics.This is caused by the selected method of data exploitation, because none of them is connected to DBpedia or other resources related to DBpedia.This isolation of the group of the abovementioned thesauri or ontologies is also evident from other research (e.g., [29]).The authors tested searching process starting in AGROVOC, but results were not satisfying due to the low number of tested terms contained in AGROVOC.


Geographical concepts [10] and objects represent a specific case of disaster risk reduction terms.
In addition to the above-mentioned semantic resources, they are contained in specific thesauri, ontologies, or gazetteers such as GeoNames.org,LinkedGeoData (a Linked Data version of OpenStreetMap), or FAO Geopolitical Ontology (it is not mentioned in this research).

Conclusions
Linked Data are very important for all disciplines related to spatial data and geographic concepts.Linked Data in general (through the explicit semantics quite often provided by identical links between various semantic resources) support better and more intelligible communication.Fast and clear communication is very important for disaster risk reduction and early warning activities to prevent risk situations or minimize the impact of a risk situation.Therefore, the presented research is focused on evaluating identical links between semantic resources in the Linked Data space to find the most optimal resources for disaster risk reduction purposes.
As a quantitative criteria for identical links evaluation, various types of centrality (indegree, outdegree, closeness, and betweenness centrality) were chosen.Centrality is able to find a node

Conclusions
Linked Data are very important for all disciplines related to spatial data and geographic concepts.Linked Data in general (through the explicit semantics quite often provided by identical links between various semantic resources) support better and more intelligible communication.Fast and clear communication is very important for disaster risk reduction and early warning activities to prevent risk situations or minimize the impact of a risk situation.Therefore, the presented research is focused on evaluating identical links between semantic resources in the Linked Data space to find the most optimal resources for disaster risk reduction purposes.
As a quantitative criteria for identical links evaluation, various types of centrality (indegree, outdegree, closeness, and betweenness centrality) were chosen.Centrality is able to find a node (representing semantic resource) in a graph (Data Network in case of this study) with the most advantageous position with regard to other vertices of the graph.The developed scripts coded in R language and XSLT search for identity links of relevant concepts and objects connected to disaster risk reduction, compute values of centrality for particular concepts and objects, and summarize these values for semantic resources.The authors found more than 350 concepts and objects from keywords of essential publications dealing with the topic domain of this article; 160 relevant concepts and objects were selected and processed by the above-mentioned scripts.
The wide scope of the disaster risk reduction domain includes not only specific terms, but also concepts for information technologies, management, demography, geomorphology, geographical, biological, or personal names.
There are four essential conclusions following from this study: 1. DBpedia and Wikidata (as the most important resources in the Linked Data space) are the most relevant resources for the studied domain as well.Wikidata plays the role of a hub (a resource interlinked to other resources) and a bridge (a component connecting not-interlinked groups of resources).These conclusions follow from the values of the outdegree and betweeness centrality.DBpedia represents an authority among Linked Data resources in the field of disaster risk reduction (derived from the indegree centrality values).Based on the closeness centrality, DBpedia is also a central node of the Linked Data space in the case of the domain processed in this article.

2.
There are several interesting resources (e.g., Biblioteca Nazionale Centrale di Firenze, Deutschen Nationalbibliothek, or Library of Congress Name Authority File) usually coming from library science.

3.
Many interesting semantic resources related to agriculture or environmental protection (e.g., AGROVOC or GEMET) contain several disaster risk reduction concepts, but they are not linked to DBpedia. 4.
There are several specific semantic resources for geographical objects, such as GeoNames.orgor LinkedGeoData.
Information from Linked Data is undoubtedly useful.However, low reliability is identified (e.g., missing identical links between identical objects, technical errors of semantic resources, missing explicit semantics-definitions and description).This fact should be interpreted not as a problem of the Linked Data approach, but as an opportunity for domain experts to participate in Linked Data initiatives and improve shared information as well as awareness of their domain.

Figure 1 .
Figure 1.The workflow of the research.

Figure 2 .
Figure 2. Data Network of the term "M.F. Goodchild".Steps 1, 2, and 3 are described in this section.The implementation of metrics to sample data (statistical evaluation of particular Data Networks) and summarizing the acquired data (Steps 4 and 5) are the crucial part of this section of the article, and they are published in the Results section.Recommendations (Step 6) are mentioned in the Discussion.

Figure 1 .
Figure 1.The workflow of the research.

Figure 1 .
Figure 1.The workflow of the research.

Figure 2 .
Figure 2. Data Network of the term "M.F. Goodchild".Steps 1, 2, and 3 are described in this section.The implementation of metrics to sample data (statistical evaluation of particular Data Networks) and summarizing the acquired data (Steps 4 and 5) are the crucial part of this section of the article, and they are published in the Results section.Recommendations (Step 6) are mentioned in the Discussion.

Figure 4 .
Figure 4. Occurrence of the tested concepts and objects in semantic resources.

Figure 4 .
Figure 4. Occurrence of the tested concepts and objects in semantic resources.

Table 2 .
Average centrality values of semantic resources.
Because all values in the Table2are normalized, just the simple sum can be used as the overall indicator.In addition to DBpedia and Wikidata mentioned above, there are other interesting semantic resources: Biblioteca Nazionale Centrale di Firenze, Yago, Deutschen Nationalbibliothek, Library of Congress Name Authority File and NDL (National Diet Library).