Tsunami-Related Data: A Review of Available Repositories Used in Scientiﬁc Literature

: Various organizations and institutions store large volumes of tsunami-related data, whose availability and quality should beneﬁt society, as it improves decision making before the tsunami occurrence, during the tsunami impact, and when coping with the aftermath. However, the existing digital ecosystem surrounding tsunami research prevents us from extracting the maximum beneﬁt from our research investments. The main objective of this study is to explore the ﬁeld of data repositories providing secondary data associated with tsunami research and analyze the current situation. We analyze the mutual interconnections of references in scientiﬁc studies published in the Web of Science database, governmental bodies, commercial organizations


Introduction
Tsunamis are long waves with periods ranging from a few minutes to about an hour and wavelengths from tenths to hundreds of kilometers depending on the type and dimensions of the causative source [1]. Various sources can produce tsunamis as longpropagating waves. Seismically triggered tsunamis represent approximately 80% of all tsunamis worldwide [2]. This means that most of the sudden displacements of the water column are associated with earthquakes as the main trigger. Furthermore, volcanic activity, submarine and subaerial mass wasting, atmospheric disturbances, and cosmic impacts can generate tsunamis [3,4]. Once generated, tsunamis travel at high speed and spread over a large area of water. In deep water, the tsunami wave amplitude may remain small, typically ranging up to a few meters. The waves become higher and shorter in shallow water and may have run-up heights exceeding several tens of meters. After reaching coastal areas, waves inundate land up to several kilometers in the case of large tsunamis.
shallow water and may have run-up heights exceeding several tens of meters. After reaching coastal areas, waves inundate land up to several kilometers in the case of large tsunamis. Consequently, casualties and damage or ruining of infrastructure and built-up areas occur. Moreover, environmental issues, such as the destruction of the alongshore topography, erosion of the sea bottom, and devaluation of terrestrial soil, are associated with this event [5]. Figure 1 depicts the three major tsunami occurrences that played a significant role in tsunami research. The devastation of coastal areas during the 2004 Indian Ocean tsunami can be considered the first event, which has activated intensive research on the phenomenon. Then, an increase in scientific publications can be identified in connection with the 2009 South Pacific Tsunami, 2010 Maule Tsunami, and especially with the 2011 Tohoku-oki tsunami, which triggered the other wave of interest [6]. The last occurrence is associated with tsunamis on the island of Sulawesi in Indonesia in 2018. Anpalagan and Woungang [7] stated that almost 90% of the geological records of tsunamis in the Mediterranean Sea area were incorrectly interpreted. Because historical records in that area were accurately recorded [8], they were used to analyze the three types of predictions focused on the probability of tsunami occurrence, fatal casualties, and financial losses. The National Oceanic and Atmospheric Administration (NOAA) datasets served as sources of data. Studies of this type confirm how the availability of proper datasets and processable formats represents the primary assumption of research. Additionally, various theoretical models must be quantified to obtain the expected added value (e.g., [9][10][11]). The topics and use cases range from the analysis of impacts in the case of prevalent ex-post assessment of tsunami hazards and related damages [12] to the maximum earthquake magnitude scenario [13], analysis of historical events such as the 1755 CE Lisbon earthquake and the largest historical tsunami ever impacting the Europe's Atlantic coasts [14], or confirming the existence of low-level tsunamis [15].
The remainder of this paper proceeds as follows. Section 2 introduces the general context of this study and the identification of the main research issue. Section 3 describes the methodology applied in this study. Section 4 presents the achieved results with emphasis on the identified organizations and attributes of the provided repositories. Moreover, a focus on the existing data formats is provided. Section 5 discusses the main findings and outlines possible solutions in the form of an ontology. The final section concludes the study. Anpalagan and Woungang [7] stated that almost 90% of the geological records of tsunamis in the Mediterranean Sea area were incorrectly interpreted. Because historical records in that area were accurately recorded [8], they were used to analyze the three types of predictions focused on the probability of tsunami occurrence, fatal casualties, and financial losses. The National Oceanic and Atmospheric Administration (NOAA) datasets served as sources of data. Studies of this type confirm how the availability of proper datasets and processable formats represents the primary assumption of research. Additionally, various theoretical models must be quantified to obtain the expected added value (e.g., [9][10][11]). The topics and use cases range from the analysis of impacts in the case of prevalent ex-post assessment of tsunami hazards and related damages [12] to the maximum earthquake magnitude scenario [13], analysis of historical events such as the 1755 CE Lisbon earthquake and the largest historical tsunami ever impacting the Europe's Atlantic coasts [14], or confirming the existence of low-level tsunamis [15].
The remainder of this paper proceeds as follows. Section 2 introduces the general context of this study and the identification of the main research issue. Section 3 describes the methodology applied in this study. Section 4 presents the achieved results with emphasis on the identified organizations and attributes of the provided repositories. Moreover, a focus on the existing data formats is provided. Section 5 discusses the main findings and outlines possible solutions in the form of an ontology. The final section concludes the study.

The Issue Description
There are two types of data: primary and secondary. The former is linked to existing monitoring or surveillance systems such as the Global Navigation Satellite System (GNSS), which is used in seismology to study ground displacements [16], or the Seafloor Observation Network for Earthquakes and tsunamis along the Japan Trench (S-net), which is currently the world's largest network of ocean bottom pressure sensors for real-time tsunami monitoring [17]. These data are considered primary, as they are original and collected from the primary resources with the help of sensors. Although web-based services provide these data in real time, the availability and ability to process them are not straightforward and easy for researchers. Therefore, secondary data, that is, data collected by someone else and stored in a repository, are used. These data can be either experimental or empirical. The former is connected to the experiments and acquired results. For instance, Mulia and Satake [17] analyzed the efficacy of tsunami forecasting through exhaustive synthetic experiments. They considered 1500 hypothetical tsunami scenarios from megathrust earthquakes with magnitudes ranging from 7.7 to 9.1. These types of data are associated with published papers and studies. They enable the testing of various scenarios without the necessity of possessing empirical data. Empirical data are collected in the environment of interest, for instance, in the form of field surveys.
For certain types of digital objects, well-curated, deeply integrated, special-purpose datasets such as those provided by the NOAA. Various organizations and institutions store large volumes of data, the availability of which should benefit society, as it can improve decision-making before the tsunami occurrence, during the tsunami impact, and after coping with the aftermath. Nevertheless, the existing digital ecosystem surrounding tsunami research prevents researchers from extracting the maximum benefit from their research investments. We see the emergence of numerous general-purpose data repositories, at scales ranging from institutional to open globally scoped datasets. Furthermore, other specifics such as geographical location, data formats, or applied data models make the situation even more complicated from a technical perspective. The wide scale and multipurpose nature of repositories is understandable. Multidisciplinary research has been perceived as a mode of exploration or investigation with great potential to uncover new knowledge, understanding, and insight for a long time. In tsunami research, profit is expected from bridging different disciplines, helping advance disaster-related science. Tsunami research is multidisciplinary, as it is explored from the perspectives of not only natural science disciplines (e.g., geology, geomorphology, volcanology, meteorology, seismology, and geochemistry), but also technical disciplines (e.g., civil engineering or computer science) or social science disciplines (e.g., psychology or decision science). Thus, repositories associated with tsunamis are plentiful from various perspectives, such as research domains, data format, access mode, and type of institution.
This study emphasizes the existence of data-related issues in tsunami research. Pararas-Carayannis [18] provides a brief insight into the history of tsunami research, showing a significant role of data generation, storage, and sharing. Regardless of the existing volume of data, several studies stress the existing limitations in data availability and the effectiveness of their handling. To provide two illustrations, studies written by Behrens et al. [3] and Trinaistich, Mulligan, and Take [19] are reviewed. Behrens et al. [3] explored existing research gaps in the field of probabilistic tsunami hazard and risk analysis. They prioritized research gaps and evaluated whether closing a gap is a data-related issue or a problem of missing theoretical understanding. Several findings have been reported. For instance, we believe that the lack of tsunami exposure data is just as important as is modeling complicated aspects of inundation, but the former is assumed to be easier to achieve. Several other similar examples are found in this study. The second example is associated with the run-up of landslide-generated waves, which can significantly damage the environment. Although data focused on the run-up of non-breaking waves are available, there is a lack of data on the run-up of waves at the point of breaking before interaction with the opposing shore [19].
The availability of sufficient data in the required quality remains a principal bottleneck in tsunami research. There is an urgent need to improve the infrastructure to support data reuse [2]. What constitutes "good data management" is, however, largely undefined, and is left as a decision for the data or repository owner. The main objective of this study is to explore the environment of datasets and data repositories associated with tsunami research, analyze the current situation in associated data management, and propose possible ways of coping with identified issues.

Process of Selection
As this study intends to review data repositories and their datasets used in scientific research, papers published in the Web of Science database were analyzed. Using the keyword "tsunami" the database returns an initial set of papers that can be further filtered. First, the search term "water" had to be added to avoid papers dealing with tsunamis in different contexts, for example, the tsunami of obesity among children. Further, filters including language (English), document type (article), and publishing date (last five years from the search date, i.e., 1 August 2020) were applied. This procedure returned 1047 research papers from the Web of Science categories, such as Geosciences Multidisciplinary, Engineering Multidisciplinary, Civil Engineering, Meteorology, Atmospheric Sciences, Oceanography, Engineering Ocean, Geochemistry Geophysics, Engineering Marine, and Multidisciplinary Sciences. References or acknowledgments of any data repository were identified and collected. A data source was not used when the found data were no longer available, a dataset was stored on a private server without any further description, or a citation led only to another article. When a found citation pointed at a dataset in a data repository, this source was added, explored, and described by the criteria presented in the section below. When the paper did not provide a direct link to a data repository, information was searched in the relevant resources (e.g., government agencies, national and international institutions, including universities or NGOs).

Evaluation of Data Repositories
Different repositories provide different sets of features for browsing and searching datasets. This section lists and explains all the parameters used to describe and compare repositories. Since the availability and general overview of repositories vary from wellstructured catalogs to obsolete web pages without any searchability, some details about repositories could not have been conclusively acquired.
The parameters used for the repository comparison were selected based on several aspects. First, parameters were collected either from databases focusing on data repositories or our experience with found repositories. Some of these characteristics were excluded as they were unhelpful to users in searching datasets (e.g., institution type and funding). Second, additional parameters were added during the search. Some repositories offer features specific to this field of research. Filters based on the time or location of the event that a dataset describes can serve as an example. A few repositories help users by offering a preview of data before downloading or even by the manual rating of dataset content regarding its openness and usability.
After identifying features and criteria for comparing dataset repositories, some were removed because they are either unobtainable or getting their values would cost an excessive amount of time. Comparing the overlap of datasets in all repositories is such an example. Because various formats and availability of metadata hinder automatic processing of most repositories, manual searching and comparisons of individual datasets are required. The evaluation of datasets and data repositories was based on the attributes presented in the following subsections.

Repository Content
Datasets total: The total number of datasets in a repository. The number is shown when accessing the repository or is often listed as the number of results for an empty search. (Note that the number of datasets differs from the number of files as datasets usually contain more than one file.) Tsunami datasets: The number of datasets that the repository lists as search results of the term tsunami. Therefore, only datasets with this word in the title, description, or keywords were counted. Note that many valuable datasets for tsunami research are not tagged by the tsunami keyword (e.g., general bathymetry data). Hence, this number is not conclusive because none of the repositories offer a usability-for-tsunami filter. It serves solely as an approximate indicator of the orientation of the repository. If a repository lacks a search feature, then the number is not investigated unless it contains only units of datasets allowing manual count.
Last update: The year of the last update or addition of dataset directly related to tsunami (i.e., set of datasets described in the previous paragraph). It provides a rough estimate of the activity of content creators on a repository of tsunamis. The condition for acquiring this repository parameter is a sorting feature based on the last updates or additions. Alternatively, it was collected for repositories with a limited number of datasets, thus allowing a manual check.
Repository domain: Repositories have varying breadth of focus. This characteristic describes a preference for a specific area of interest if there is one. Generally, catalog repositories are either completely general or have broad topics covering various research areas to encourage public contributors to share data on their platforms. Databases and web presentations typically keep the narrow focus of the organization responsible for the data, which is also allowed by the unified format of all presented data.

General Availability
Availability: Online availability of repositories. Most of them had free access without restrictions. Some repositories may be behind a paywall or require logging. For restricted repositories, there may be different levels of access to the data. While the search feature and viewing metadata of datasets may be free, it requires a paid account to download data.
Downloadable data: Datasets usually need to be downloaded to obtain data. However, some listed repositories are structured collections of links to datasets on other platforms. Here, the repositories contained only metadata. To download data, a user should acquire data from the linked storage.
Data usability rating: Repositories rate the usability of stored datasets based on various criteria. The objective of this rating is to indicate the level of potential obstacles during data processing. Standardized rating schemas exist (e.g., Tim Berners-Lee's Five Stars of Openness), but repositories create their own system to incorporate their specific features.
Metadata: Metadata is a structured description of datasets. It may be downloadable as a JSON or XML file or listed in a table on the dataset profile page. This feature enables the automatic processing of datasets outside a repository or reading of additional details that are not explicitly stated in a repository's dataset profile.
Dataset preview: Some repositories allow users to preview data before downloading it. This feature enables users to investigate the structure of data and make the selection of appropriate data easier. However, this is impossible in every format.

Search:
The ability to search for terms in title or metadata is an essential feature of all databases or online catalogs. However, some included repositories were in the form of a simple list of links to the datasets. These repositories usually contain only a few datasets and focuses on presenting a project or organization rather than offering a structured catalog of datasets.
Dataset filter: If the repository contains items other than datasets, this parameter describes whether it is possible to filter results to datasets only. This feature simplifies the search for users interested in data. Repositories with sections dedicated to datasets and repositories containing only datasets are appropriately marked (e.g., datasets only), as they are missing this filter.
The location filter: Unlike repositories with geological focus, general repositories usually miss the location filter because data validity is not usually limited to a location. However, this recognition is important for searching for historical data about tsunamis in a specific region.
Field/topic filter: The ability of filter-selected topics or research fields is useful for general repositories with a great variety of datasets. It allows users to browse among possibly related datasets without necessarily knowing the exact keywords that label the required data.
Format filter: Datasets contain data in various formats; some are proprietary, and some are not suitable for automatic processing. Therefore, this filter helps to narrow a set of datasets to one fitting an intended use.
License filter: Datasets are shared under various licenses. Research or educational use is usually not limited and requires only a citation, but some datasets may be restricted for noncommercial use. Thus, some repositories offer a license filter along with a license statement in the dataset details.
Year/Date filter: Whether the user searches for data from a specific historical event or prefers to look through recent data only, a time-based filter is a useful tool to narrow datasets to the most relevant set. Note that only repositories focusing on historical events offer this filter based on the date of an event. The vast majority of repositories that allow this filtering consider only dates of addition or updates of datasets, which does not correspond to the date of the event that the dataset may be describing.
Tsunami magnitude filter: When browsing tsunami data, some tsunami databases (other types of repositories do not offer such specific features) allow users to filter results based on the magnitude of tsunamis. All magnitude and intensity scales were included in this characteristic because this feature is rare, and repositories in some occurrences do not specify the exact scale they used.

Ontological Contribution into the Data Repositories and Datasets
To outline possible solutions to the identified data-related issues, an approach based on ontological engineering as a subarea of artificial intelligence is provided. Ontology represents a well-defined collection of concepts that describe a specific domain. Concepts are the abstractions of a particular set of instances, that is, in ontology engineering, concepts are regarded as classes and instances as individuals. An example of a class can be a data repository, and an example of an individual is the Japan Tsunami Trace Database. Ontology also encompasses the links between individuals that are described as special types of concept properties. A dataset containing wave parameters can thus be linked to a dataset containing geospatial parameters. Even more advanced hierarchical relations can be represented, allowing concepts to be generalized or aggregated into other concepts to decompose even the most complex domain. Concepts and properties can be associated with various types of logical constraints that enable the inference of facts not explicitly stated in the ontology. Ontology is expressed as a graph-based structure in which the nodes represent concepts, and the edges represent relations. In this respect, ontology can be viewed as a semantic map of a given domain that can serve to navigate in that domain using complex queries. For instance, we have a statement: "Oceanography is Earth and Environmental Science.". This statement can be easily expressed in the resource description framework (RDF), a data model, which is the actual standard for semantic graph database development. Oceanography (a subject/an instance) and Earth and Environmental Science (a parental class) are uniquely identifiable "resources." Natural science is the parental class for Earth and Environmental Science (a child class). Various methodologies can be used to develop formal ontologies [20][21][22]. The Noy and McGuiness methodology [23] was used for the ontology development in this study. It is based on an iterative developmental approach and can be used for any kind of application domain and developmental tool. This methodology is influenced in part by the Protégé environment, which was also used in our study for ontology development. It provides the following seven developmen-tal phases for ontology building: domain and scope specification, the reuse of existing ontologies, enumeration of important terms with their properties, class definitions and class hierarchy development, modeling properties of classes, and inclusion of details for properties and instance modeling. The categories are based on the International Disaster Database EM-DAT [24], which is slightly customized for tsunami research, and the Library of Congress Recommended Formats Statement (2020-2021), providing categories of creative content [25].

Results
This section presents the presentation of required results. A list of identified repositories and data resources is provided, and an evaluation of a single resource is provided on a pre-defined scale of alternatives. Then, an analysis of existing formats is presented.

Repositories
Altogether, 60 repositories with tsunami-relevant datasets were identified. Table A1 in Appendix A provides basic information on the resources found. The acquired list represents the number of datasets that the repository lists as the search results of the term tsunami. Therefore, only datasets with this word in the title, description, or keywords were counted. Many more useful datasets for tsunami research exist; however, they are not tagged with the tsunami keyword (e.g., general bathymetry data). Hence, this number is not conclusive because none of the repositories offer a usability-for-tsunami filter. It serves solely as an approximate indicator of the orientation of the repository. If a repository lacks a search feature, then the number is not investigated unless it contains only units of datasets that allow manual counting.
There are three types of repositories: catalog, database, and presentation. Catalog offers a sortable list of items usually accompanied by search and filter features. The items in the catalog represent individual datasets. Here, datasets are usually uploaded by multiple organizations. A database is also a sortable list of items, but these items already represent individual records. A database can be perceived as a single-structured dataset. They are usually focused on a narrow topic (e.g., database of tsunamis and water level). The organization operating the database is responsible for inserting data into the database. In this context, the presentation refers to a static web page presenting a single dataset or a non-sortable list of links that lead to projects or datasets. There are usually no searching or filtering features, as the presentation pages show few items. Its purpose is to share results from a project or organization; therefore, the organization running the presentation page is responsible for the data as well.
From a general perspective, repositories incorporate distinct volumes of datasets ranging from hundreds of thousands (global multidomain resources such as Data.gov, Mendeley, or OSF share) to single units (e.g., the Novosibirsk Tsunami Laboratory or the Japan Tsunami Trace database). From the perspective of tsunami research, the volume of datasets is significantly lower, ranging from thousands (e.g., Pangea) or hundreds (e.g., Data.gov or Data World) to single units (e.g., the Queensland Government database or the Humanitarian Data Exchange). Repositories were created and maintained by private organizations, public institutions, or governmental bodies. Twenty-nine data repositories have been updated during the last three years, which indicates the general usability of the current research. It is possible to find data at a global scale, that is, data are associated with various geographical locations from Australia to the Mediterranean Sea and the United States. Apparently, data related to tsunami-jeopardized regions with advanced technologies such as Japan (IRIDeS), European countries (EMODnet), and the United States (NOAA, NASA) are available in large volumes. Multinational organizations such as the World Bank Group or the World Health Organization support tsunami-related research with data repositories. This is not the case for regions with less developed countries. Therefore, global technologically intensive initiatives and activities are crucial for tsunami research. NOAA or the Japan Tsunami Trace Database with tens of thousands of records can serve as examples. Unsurprisingly, considerable heterogeneity is the main attribute of the generated list. From the domain perspective, repositories contain data related to various disciplines, such as seismology, meteorology, hydrology, and bathymetry. This makes the list of data repositories difficult to compare, and evaluation bias is almost inevitable.
Most of the datasets had free access without any restrictions. Some repositories may be hidden behind a paywall or require logging in, for example, the European Marine Observation and Data Network. For restricted repositories, there may be different levels of access to the data. Moreover, websites such as Study of the Tsunami Aftermath and Recovery (STAR) does not always respond. While the search feature and viewing metadata of datasets may be free, it requires a paid account to download data. Datasets need to be downloaded to obtain the data. However, some listed repositories do not enable downloads (e.g., the Japan Tsunami Trace database) or are structured collections of links to datasets on other platforms. Here, the repositories contained only metadata. To download data, the user must acquire data from linked storages. Data usability is likely to represent the most serious issue associated with datasets. Some repositories rate the usability of stored datasets based on various criteria. The objective of this rating is to indicate the level of potential obstacles during data processing. Standardized rating schemas exist (e.g., Tim Berners-Lee's Five Stars of Openness), but repositories often create their own system to incorporate their specific features. Metadata represent the structured description of datasets. It may be downloadable as JSON or XML files (see the discussion section) or listed in a table on the dataset profile page. This feature allows the automatic processing of dataset outside a repository or reading additional details that are not explicitly stated in the dataset profile in a repository. Only a few datasets enable metadata download (e.g., Queensland Government or the Spanish National Center of Geographic Information). However, these repositories are mostly general data repositories with a small fragment of tsunami-focused datasets. This is also the case for datasets that only enable the view of metadata. As for the search or filtering abilities, the capability to search for terms in the title or metadata and use filter queries are essential features of all databases or online catalogs. We identified only one repository that enabled the filtration of a search query by location, topic, file format, license, and time. However, most datasets were only in Portuguese. However, some of the identified repositories were in the form of a simple list of links to datasets. These repositories usually contain a few datasets and focuses on presenting a project or organization rather than offering a structured catalog of many datasets. The evaluation of repositories is presented in Table A2.

Data Formats
Repositories and datasets are associated with dozens of data formats. In this section, we introduce and explain the main characteristics. Table 1 presents all identified formats available for the tsunami topic as they occur in the two biggest and most established data repositories, NOAA and Data.gov. The frequency of occurrence of tsunamis was introduced. Of this number, eight formats proved to be the most common in the "tsunami" category (they represent more than 90% of records in particular databases). We provide examples of datasets using these data formats (all accessed on 8 August 2021).

General Formats NetCDF
The Network Common Data Form (NetCDF) represents a community standard for sharing scientific data. It is a set of software libraries and machine-independent data formats. These formats support the creation, access, and sharing of array-oriented scientific data.
• https://data.noaa.gov/metaview/page?xml=NOAA/NESDIS/NGDC/Collection//i so/xml/NTWC_Waterlevel_Collection.xml&view=getDataView&header=none HTML Unlike other formats, this is usually not a downloadable file and links to project web pages, metadata, or even file downloads from different sources.
• https://catalog.data.gov/dataset/port-of-los-angeles-tsunami-evacuation-routes-sig ns • https://data.doi.gov/dataset/sediment-grain-size-distributions-of-three-carbonatesand-layers-in-anahola-valley-kauai-hawaii ZIP This is a general archive file containing files of any other format (including formats unsupported by the repository). It includes batches of supplementary files for specialized software, an additional description of how to use data, or all other files connected to a dataset for easy download.
• https://catalog.data.gov/dataset/port-of-los-angeles-tsunami-evacuation-routes-sig ns • https://data.doi.gov/dataset/radiocarbon-age-dates-for-sections-of-19-sediment-cor es-from-offshore-puerto-rico-and-the-u-s-v PDF This is a printable text/presentation file. Because of the structure of this format, its content is not designed for automatic processing. It is a paper or manual explanation dataset and other complementary or legal information.

Mapping Formats
These formats contain a data layer meant to be placed over a map to mark points/areas of interest with additional information. Some formats may be downloadable and viewable as XML, but they need to be viewed over a map with the appropriate tool for human readability. It provides an overview of historical tsunamis, sea levels, or other geospatial data on a map and offers additional information.

KML
Keyhole Markup Language (KML) is an XML notation made for 2D and 3D Earth browsers (e.g., Google Earth). It is one of the international standards of the Open Geospatial Consortium.

Discussion
This study builds on the work of Gusiakov, Dunbar, and Arcos [26], who outlined and discussed the existing issues with data compilation, cataloging, and distribution, as well as the incompleteness of certain types of data. Hence, we intend to support the improvement of data management in tsunami research. The importance of archiving data in this domain is in fact the same as in other disciplines: verification of published results, better meta-analysis, new questions, increased citation and credit, new opportunities for teaching and learning, and reducing the risk of loss [27].

Identified Issues and Perspectives
Sharing data usable in tsunami research has several advantages. Different interpretations or approaches to existing data contribute to scientific progress, especially in a multidisciplinary setting characteristic of tsunami research. Proper management and long-term preservation help retain data integrity. Furthermore, when data are available, re/collection of data is minimized, optimizing resource use. Finally, the availability of data enables replication studies, which can be used as training tools for new tsunami researchers [28]. While sharing data is the first step toward reuse, it is also critical that the data be simple to understand and use [29]. However, proper data management is not a goal, but rather is the key conduit leading to knowledge discovery and innovation [30], and subsequent data and knowledge integration and reuse by the community after the data publication process.
White et al. [29] suggested nine recommendations for improving data management in research: sharing your data, providing metadata, providing an unprocessed form of the data, using standard data formats, using good null values, making it easy to combine your data with other datasets, perform basic quality control, use an established repository, and use an established and open license. Unfortunately, this study reveals that most recommendations are not met in the case of tsunami-related data repositories. This is somewhat in contradiction to the opinions of scientists within specific disciplines [28].
The analysis of data repositories reveals issues with which researchers searching for reliable data should cope with. First, the most common technique for data management in the form of metadata description is insufficient for many datasets. For our analysis, their existence was not as necessary as it was for researchers who needed data for their experiments or decision-makers for their decisions. There is also an overlap among various databases, which may seem to be an advantage. However, in some cases, redundancy can lead to confusion because research may need the latest version of the dataset or work in distributed teams. Thus, coordination or synchronization might be a more significant issue than expected. Initially, there were other evaluation criteria on the list, which, in the end, remained unused. To give two examples, the detailed orientation of the repository would be an interesting piece of information. The problem is that it is not usually specified, and it is necessary to go inside and through datasets to determine whether there is no topic filter available. Furthermore, licenses associated with repositories are rarely specified, as they are usually a property of individual datasets. This criterion would be applicable if an organization running the repository is also the author of its content.
We encountered issues during the evaluation process, which exhibited weak points of existing repositories:

1.
Data resources are heterogeneous and poorly arranged, which prevents automatic machine processing. Moreover, in some cases, even searching or filtering tools are missing, which significantly reduces the effectiveness of manual work with the source repository.

2.
Even the most significant actors in the field, such as NOAA or data.gov, change the form of presentation or search in their repositories from time to time [31]. Although this issue seems minor, user interface or interface usability plays a significant role when a huge volume of data needs to be searched and processed.

3.
Research papers and studies refer to datasets that are not directly associated with tsunamis (e.g., general geography), but their data can be used, and it is impossible to identify them when searching with relevant search terms. This reveals that the demarcation line between the tsunami and non-tsunami fields of study is difficult to define. The multidisciplinary nature of tsunami-oriented research makes the analysis of datasets and repositories more complicated.

4.
The semantic differences among concepts of datasets, data, resources, and repositories generate confusion. These concepts are used in various contexts. The development of a virtual data collection system can help improve the organization of tsunamirelated datasets.

5.
There are many deactivated, nonfunctional, or unavailable files, even found during the search in the dataset. This issue is typical of the outcomes of research projects. Project documents or data are available only within the sustainability period, after which websites or interfaces are not managed or maintained. 6.
Although there are datasets offering one or more formats of the same data, there are specific formats of data associated with specific software applications unreadable for standard available SW solutions. Typically, old data prepared for obsolete applications are impossible to run in existing operating systems. 7.
Noise is often present in the data that must be filtered out, and void data that need to be dealt with (at least from the modeling side). 8.
Not all datasets are the primary resources and only contain a reference. However, their features can be used as catalogs or guideposts as they work with datasets more appropriately than pages in which datasets are originally uploaded.

Demonstration of Ontology-Engineering Help
Various efforts leading to improved data have already been made. For instance, Murnane et al. [32] considered the lack of a consistent data structure, which hinders the development of tools that can be used with more than one set of data. They report on an effort to solve these problems through the development of extensible, internally consistent schemas for risk-related data. This study contributes to these endeavors and outlines possible solutions in the form of ontology-based systems. In the domains of natural hazards, natural disasters, disaster management, or emergency management, ontologies are mentioned in two lines of research: 1.
The first line of research is focused on the usage of ontologies for categorization of concepts related to the above-mentioned domains, sharing of these ontological structures between interested parties (humans, humans and computers, or between computers), and system interoperability.

2.
The second line illustrates how ontologies can be directly integrated or connected to the designed system.
As for the first line, the Wikipedia project provides a huge collection of information related to different application domains, including disaster or emergency management, categorization of natural hazards, and natural disasters. The issue is that the information presented by Wikipedia is not well machine processable. More specific queries defined by the user often fail. The DBPedia project [33] solves this issue by encoding facts (found in the Wikipedia info-boxes) into more formal structures expressed in RDF. Wikidata is a complementary project to DBPedia, which is continuously updated by users and bots (computational autonomous robots). The content of the DBPedia was automatically extracted from Wikipedia. Vocabularies related most to natural hazards or disasters are cross-domain or geography-related. If we are interested in vocabularies related to emergency management, we can visit the Linked Open Data (LOV) web. The vocabulary used for the annotation dataset repositories expressed in RDF is available in [34]. The authors of the paper did not find any ontologies (vocabularies) used for annotating datasets expressed in other formats (or datasets repositories).
As for the second research line, an ontology-based conceptual framework is proposed for improving shared situation awareness among teams of rescuers in case of emergency incidents. Mass evacuation during a tsunami event is a case study for framework demonstration [35]. Infrastructure failures caused by natural hazard events were modeled using the InfraRisk ontology. A software prototype using infrared ontology was introduced in [36]. It also provides a visualization of the data published using the ontology. Zhong et al. [37] presented a meteorological disaster system in which an ontological approach was used to model the domain knowledge of meteorological events, emergency management, disaster-specific knowledge, and geographical (geospatial) characteristics. Sermet and Demir [38] presented a different system called the flood artificial intelligence system. It is based on the flood ontology, which covers geological hazards, meteorological haz-ards, diseases, wildfires, floods, monitoring devices, and environmental concepts. It is a question-answering and decision support system that can provide factual responses using domain-specific ontological knowledge in case of flood-related events. Voice-based and text-based communication channels are available to users. The unified knowledge-based Crisis Response Ontology (CROnto) was introduced in [39]. This ontology provides a sharable vocabulary for facilitating communication and problem-solving between emergency response organizations during disaster events. It is obvious that during disaster hazards, such as earthquakes, fast reactions are inevitable for mitigating damage to life and property. Formal ontology has been developed and integrated into a rule-based and casebased reasoning system that develops recommendations based on similar cases (disaster events) from the past. Ontology has been used to manage earthquake data in intelligent systems [40]. Liu et al. [41] presented a knowledge model called the geologic hazard emergency response (GHER) used for modeling emergency knowledge, which is inevitable for providing a fast emergency response during geological hazards. This model has been implemented in the GHERS system.
The main purpose of the ontology is to provide a semantics-based structure that can help the user.

•
To receive fundamental insights into tsunami-related and tsunami-not-directly related data repositories.

•
To discover which characteristics are shared by more data repositories.

•
To explore the backbone of the ontology consisting of core ontological classes together with the relationships between them.

•
To ask concrete questions on data repositories and related facts.
We have developed a semantic graph database according to the Noy and McGuiness methodology [23] (Section 3.3). All phases of the methodology are fulfilled, except for the modeling definitions of the classes. Only descriptions of the selected classes were modeled. The semantic graph has 105 classes. These classes model type of access into data repositories (access class), application domains of datasets (domain class), types of formats of datasets (Format class), languages in which datasets are expressed (LanguageFamily class), owners of data repositories (owner class), categories of data repositories (repository class), possible datasets (dataset class), locations from which the datasets are received (location class), and various types of disasters (disaster class). The domain class, format class, dataset class, and disaster class are the most structured parts of the class-based layer. Therefore, these parts are shown in Figures 2-5. The domain class is extended by more specific application domains, where the main attention is paid to the domains related to tsunamis. Because of the readability of the whole domain class taxonomy, only classes without a data-based layer (instances) are presented.
More specific formats are added to the format class. This categorization of formats is based on the Library of Congress Recommended Formats Statement (2020-2021) [25]. Because of the readability of the entire format class structure, only classes without the data-based layer (instances) are visible. The dataset class categorizes datasets according to the data, that is, what is the content of the datasets and what the user can directly find in them. The dataset class is "divided into" SeaDataset and AboveSeaLevelDataset classes. SeaDataset class model kinds of data received from the sea-level and below sea-level measurements. AboveSeaLevelDataset class model kinds of data measured above sea-level, for example, from the atmosphere. The disaster class extends the ontology by various disaster types, including tsunamis and meteotsunami. The user can receive wider insight into how the tsunami can be classified next to other disasters and which relationships exist between them. Categorization of natural disasters is based on the International Disaster Database (EM-DAT) [24], which is slightly customized for tsunami research.
Formal statements are modeled using predicates (properties). The following are distinguished in the semantic graph database ( Table 2): Object property hasTsunamiDataset is a general relationship indicating whether it makes sense to visit a dataset repository for browsing tsunami-related datasets. If the user wants to know which dataset repositories, including datasets provide e. g., data collected by tsunami detection buoys or tide gauges, the rdf:type (is-a/instanceOf) relationship is used for this purpose. Datasets are categorized into more specific classes, based on which it is possible to determine the nature of the dataset, i.e., which types of data the datasets include. Figure 6 depicts the relationships between one concrete dataset repository (Kaggle) and one tsunami-related dataset (data related to the volcano-induced tsunami) volcano tsunami.csv. Figure 6 is prepared in the OntoGraf Protégé plugin, which does not provide datatype properties. Datatype property containsDataOfTimeScaleMin and containsDataOfTimeScaleMax (see Table 2 for a deeper explanation) are an inherent part of the datasets modeling in the ontology.         Figure 5. Categorization of natural disasters (adjusted according to [24]).
More specific formats are added to the format class. This categorization of formats is based on the Library of Congress Recommended Formats Statement (2020-2021) [25]. Because of the readability of the entire format class structure, only classes without the data-based layer (instances) are visible. The dataset class categorizes datasets according to the data, that is, what is the content of the datasets and what the user can directly find in them. The dataset class is "divided into" SeaDataset and AboveSeaLevelDataset classes. SeaDataset class model kinds of data received from the sea-level and below sea-level measurements. AboveSeaLevelDataset class model kinds of data measured above sealevel, for example, from the atmosphere. The disaster class extends the ontology by various disaster types, including tsunamis and meteotsunami. The user can receive wider insight into how the tsunami can be classified next to other disasters and which relationships exist between them. Categorization of natural disasters is based on the International Disaster Database (EM-DAT) [24], which is slightly customized for tsunami research.
Formal statements are modeled using predicates (properties). The following are distinguished in the semantic graph database ( Table 2): Object property hasTsunamiDataset is a general relationship indicating whether it makes sense to visit a dataset repository for browsing tsunami-related datasets. If the user wants to know which dataset repositories, including datasets provide e. g., data collected by tsunami detection buoys or tide gauges, the rdf:type (is-a/instanceOf) relationship is used for this purpose. Datasets are categorized into more specific classes, based on which it is possible to determine the nature of the dataset, i.e., which types of data the datasets include. Figure  6 depicts the relationships between one concrete dataset repository (Kaggle) and one tsunami-related dataset (data related to the volcano-induced tsunami) volcano tsunami.csv. Figure 6 is prepared in the OntoGraf Protégé plugin, which does not provide datatype properties. Datatype property containsDataOfTimeScaleMin and containsDataOfTimeScaleMax (see Table 2 for a deeper explanation) are an inherent part of the datasets modeling in the ontology.

Domain-Specific Property Type of Property Purpose
containsDataOfTimeScaleMin datatype property When were the data of datasets measured (min. year-month-day)?
containsDataOfTimeScaleMax datatype property When were the data of datasets measured (max. year-month-day)?
alternativeName annotation property Expression of an alternative name for the data repository.
description annotation property Specification of more details of data repository.
identifier annotation property Identifier of the data repository (if it is available).
url annotation property URL link to the data repository.
Water 2021, 13, x FOR PEER REVIEW 17 of 31 Figure 6. Relationships between datasets repositories and datasets (the example).  The RDF-based semantic graph database in Figure 7 depicts the fragment of the semantic graph database, where 15 web-based data repositories are visible: Data.gov Catalog, US gov-Department of the Interior, OSF Share, OSF Home, Japan Tsunami Trace database, NCEI (formerly NGDC), Queensland Government, EM-DAT Public, Fishare, Science Data Bank, Kaggle, Data World, Harvard Dataverse, Google-Dataset Search, and World Bank Water data-Data catalog. This data repository has links to various resources (individuals), which provide more detail about the repository, e.g.,: Consequently, Protocol and RDF Query Language (SPARQL), a W3C standard for querying RDF graphs, can be used for "mining" the content of the semantic graph database with data repositories. Examples follow. Figure 8 shows how SPARQL-based querying is realized in the Protégé ontological editor. The user is the author of the SPARQL queries encoded in the SPARQL Query tab available in Protégé. The query is the input for the query engine, which searches the content of ontological and data layers. Specific results are provided in the resulting table based on the SPARQL-based query structure.
Consequently, Protocol and RDF Query Language (SPARQL), a W3C standard for querying RDF graphs, can be used for "mining" the content of the semantic graph database with data repositories. Examples follow.    We can zoom the SPARQL query part in Figure 8 and uncover a hidden structure. As an example, we would like to know which dataset repositories can provide datasets related to tsunamis. Additionally, ordering results according to the number of tsunami datasets in descending order are required. The first three results are presented in the output section. The complete query written in SPARQL is described below.
SELECT ?repository ?countOfTsunamiDatasets We can zoom the SPARQL query part in Figure 8 and uncover a hidden structure. As an example, we would like to know which dataset repositories can provide datasets related to tsunamis. Additionally, ordering results according to the number of tsunami datasets in descending order are required. The first three results are presented in the output section. The complete query written in SPARQL is described below. The SPARQL query begins with the name spaces identified by prefixes that tell us which external vocabularies must be used in the query. Our ontological structure was identified using the dronto prefix. Results of the query are stored in the variables (identified by "?" symbols) mentioned in the "select" section (?repository, ?countOfTsunamiDatasets). The required content is written in the "where" section. Additional modifiers can be mentioned below, such as ordering (ORDER BY) or visualization limited amount of results (LIMIT). Figure 8 shows that only a fragment of the knowledge structure corresponds to the SPARQL query. This fragment is depicted as a gray dashed part on the left side of Figure 8. This fragment is visualized as the resulting table on the right side of Figure 8.
The second example demonstrates how to answer the question of which data repositories are focused on meteorology and atmospheric sciences. The philosophy of querying is the same as the aforementioned example. This semantic graph database does not need to be interpreted only as a pure collection of statements about data repositories. The inference process can be used to discover hidden facts inside the database. As an example, the inference engine can "ingest" the statements about data repositories and cluster them according to the application area of the data which are provided by these repositories. The semantically richer language is inevitable for this process and has been slightly introduced above. The classes-based layer is general, so it can be used as a backbone for projects not only aimed at data repositories. It can be customized freely for its own purposes.

Conclusions
Increased connectivity has accelerated progress in global research, and estimates indicate that the scientific output is doubling approximately every decade [42]. As science is becoming data-intensive and collaborative, this rise in research activity increases research data output [28,43]. However, efficient work with data is increasingly deemed an important part of the scientific process because approximately 80% of research data are inaccessible or unpublished [42]. Making data publicly available allows original results to be reproduced and new analyses to be conducted [27,29]. Nevertheless, there is a growing debate about how quickly scientific findings can and should influence disaster mitigation policies [44]. Although a relatively new research area, tsunami science depends on data from various disciplines falling within the scope of geosciences, oceanography, engineering, physics, mathematics, and disaster management, including politics, media, communication, and education [4]. Furthermore, tsunami hazard assessment and mitigation plans based on numerical modeling and simulation of tsunamis have gained increasing importance.
In the realm of tsunami research, a growing number of scientists are trained in surveying techniques. Thus, more data will be collected [45]. Because data are the infrastructure of tsunami research, this study provides a unique, extensive review of tsunami-related datasets and repositories. Existing data repositories have several issues, ranging from missing updates or dysfunctional webpages to limited search filtering, metadata downloadability, and data usability. Although the list of repositories presented in this study is not exhaustive because of the applied methodological approach, diverse sets of experts and practitioners can take advantage of the repositories identified and evaluated in this study as datasets contain data related to volcanology, geoscience, water research, or civil engineering. By using data, their teams and institutions will provide various types of datasets and repositories, which can be used for analysis, modeling, simulation, or prediction of tsunami occurrence. Thus, multidisciplinary research, suggested by the latest research [46], to design and propose practical solutions can be supported. Hopefully, the primary outcome of this study will catalyze the data lifecycle in tsunami research.
Indeed, there are ways to mitigate these issues. Computer science techniques and methods from artificial intelligence, such as k-Nearest Neighbors (k-NN) models [47], C4.5 algorithm [48], the Random Forest method [49], Bayes classifier [50], or non-linear parametric model [51] are worth mentioning. They all process data to acquire new knowledge and insights. We briefly outline ontology engineering as an approach that we believe can help the global research community reduce the analytical workload of all stakeholders and enhance the quality of the recorded data. It plays the role of a metadata provider that can improve orientation in existing tsunami-related data repositories. The proposed solution is presented only as an outline. It shows viability, usability, and feasibility of solving the issues identified in this review. Its development to a full version with all suggested functionalities can improve work with tsunami-related data repositories.  Acknowledgments: The VES20 Inter-Cost LTC 20020 project supported this research. The authors also express gratitude to the COST Action AGITHAR leaders and team members.

Conflicts of Interest:
The authors declare no conflict of interest. with the initial support of the World Health Organization (WHO) and the Belgian Government. The main objective of the database is to serve the purposes of humanitarian action at national and international levels. The initiative aims to rationalize decision making for disaster preparedness, as well as provide an objective base for vulnerability assessment and priority setting. EM-DAT contains essential core data on the occurrence and effects of over 22,000 mass disasters in the world from 1900 to the present day.

Science Data Bank
Computer Network Information Center, Chinese Academy of Sciences https://www.scidb.cn/en Note: Science Data Bank (ScienceDB) is a public, general-purpose data repository aiming to provide data services (e.g., data acquisition, long-term preservation, publishing, sharing, and access). ScienceDB is devoted to becoming a repository of long-term data sharing and data publishing in China. According to authors key features of ScienceDB, for example are data findability, open and sharing data, data traceability, and permanent accessibility. ScienceDB offers 484 datasets; one of them is related to tsunamis.

Kaggle
Kaggle https://www.kaggle.com/datasets Note: Kaggle was founded in 2010, and it is a subsidiary of Google LLC. Kaggle is an online platform for a community of data scientists. Focus on finding and publishing datasets and models. Kaggle offers more than 70,000 datasets; 14 of them are related to tsunamis. 12.
Data World data.world, Inc. https://data.world/ Note: Data World was founded in 2016, it is public benefit corporation, which aims at makes data easily understandable for the public. This company has three main goals: (1) build the most meaningful, collaborative and abundant data resource in the world in order to maximize data's societal problem-solving utility. (2) advocate publicly for improving the adoption, usability, and proliferation of open data and linked data. (3) serve as an accessible historical repository of the world's data. It offers more than 600 datasets related to the tsunami. 14.
Google-Dataset Search Google https: //datasetsearch.research.google.com/ Note: Google-Dataset Search is a search engine for data sets created by Google, it was launched in 2018. Data Search has two main goal: (1) foster a data sharing ecosystem that will encourage data publishers to follow best practices for data storage and publication. (2) give scientists a way to show the impact of their work through citation of data sets that they have produced.