Global open data initiatives have received support from both public and private sector organizations in recent years. Changes can be seen in government policies, funding agency requirements, community guidelines, and technologies and in facilities for data management, curation, and sharing. The efforts on mechanisms of data publication [1
], data cataloging [2
], data citation [3
], and alternative metrics [4
] are incubating a new socio-technical system that promotes both the culture and the practice of open data. Within such a system, data are going to be shared and reused across the boundaries of nations, sectors, disciplines, repositories, and formats, as well as between levels of details. Data interoperability arises as a major challenge in those cross-boundary activities, which poses requirements for methods and technologies to make data discoverable, accessible, decodable, understandable, and usable [5
The motivation of the research presented in this paper is to promote the decodability, understandability, and usability of data. Consider the scenario of a researcher wishing to use scientific data from an open data repository. Prior to retrieving a file from the data repository and using it, they will need to know the format, structure, parameters, and meaning of the data, and perhaps also the tools and services that can be used to process the data. In the world of open data, the researcher often receives no direct support or help from the data producers, which indicates that the metadata of the retrieved data may be the only source to obtain the information needed. Among the various metadata elements available, such as those in the Dublin Core Metadata Elements [6
] and the DataCite Metadata Schema [7
], the elements describing data types are the most relevant ones for providing such information.
Data typing has been a research topic in computer programming for decades, whereby a data type is regarded as a collection of computational entities that share common properties. Data types in programming languages support three main uses: (1) naming and organizing concepts; (2) coordinating consistent interpretation of bit sequences in computer memory; and (3) providing information about data to the compiler [8
]. There are primitive data types (e.g., integer, float, character, string, and Boolean), composite data types (e.g., array, union, set, and object), abstract data types (e.g., queue, stack, tree, and graph), as well as data types derived from the above types, such as utility types that address specific real-world uses. Knowledge of those data types can provide part of the information a researcher needs to work with the retrieved data, but are insufficient to fully address the requirement of understanding the data. For example, a researcher retrieves a table and knows that it is about thermodynamics of a chemical by reading the table name. The researcher reads words and numbers in the table but is not able to understand the meaning of those records because there is only an acronym in the title of each column, without further definitions. Moreover, the relationships between those columns are not clear to the researcher. Can we extend the content coverage of data types so that they can present an unambiguous, useful model of what the data represent? Under this scope, the specification of data types will be able to help the researcher understand the meaning of the data in a given science context.
The aim of this paper is to present our work of a conceptual model for the semantic specification of data types, as well as the implementation of the model in an existing production data portal for a decadal international science program: the Deep Carbon Observatory [10
]. In this work, we regard a data type as the representation of particular qualities or features that a group of datasets shares, such as thermodynamics of chemicals and minerals, volcanic gas composition, or geologic contexts. Our model allows people to add domain specific meanings to a data type, to register the data type as an object in a data portal, and to annotate a dataset by associating it with one or more data types. Each registered data type has a unique identifier that is resolvable on the Web, and the information describing the data types is machine-readable and is accessible on the Web. The data type model adds new features to the data portal and enables better data curation and efficient data reuse. The remainder of the paper is organized as follows: Section 2
introduces the context of this work (i.e.
, the Semantic Web) and details of our model design. Section 3
describes the implementation of the model in a data portal and the new functions created for the portal. Section 4
compares this work with relevant studies and discusses directions for future work. Finally, in Section 5
, we provide a concluding discussion of the work presented in this paper.
3. Implementation and Results
The DCO data portal adapts the VIVO platform [20
] for metadata management and the Handle System [21
] for assigning unique identifiers (i.e.
, the DCO ID) to all objects. The data portal also uses Drupal [22
] to develop user-friendly front end webpages for data resource navigation. Before the development of the model for data type, the DCO data portal already reused several ontologies in the DCO Ontology, such as the FOAF Ontology and the Bibliographic Ontology (see Table 1
for links to those ontologies). In order to link those ontologies to the provenance parts in the designed model for data type, we asserted a few existing classes as subclasses of corresponding classes in the PROV-O Ontology. For example, we added assertions “dco:DataType rdfs:subClassOf prov:Entity” and “foaf:Person rdfs:subClassOf prov:Agent” (Figure 1
With the conceptual model of data type, we developed functions of data type registration, data type browsing and dataset annotation in the DCO data portal. For the data type registration, we used the default user interface of the VIVO platform, which follows the general workflow of creating an instance for any class. Once a data type instance is created, the data portal will assign a unique DCO ID for it. Then, on the VIVO profile of the data type, a user can fill in records for the properties describing the data type and the links that connect the data type with other objects. Figure 2
show a part of the profile of the registered data type “Thermodynamics of chemicals and minerals.”
We also developed a faceted browser for all the registered data types by adapting the Elasticsearch [23
]. Figure 3
shows a screenshot of the faceted browser with a few data types we registered as test examples. On the left of the user interface there is a list of facets, which are related to the corresponding properties of registered data types in the portal. A user can search among the data types by choosing records in those facets. A feature of the browser is that, once the chosen records in a facet are changed, all records in other facets as well as the data type results will change correspondingly. The user can make selections in several facets and/or enter text-based search terms to search for one or more certain data types. Figure 3
shows the two resulted data types that have “Mineralogy” as research area (i.e.
We can use the registered data types to annotate data resources such as datasets. The only work a user needs to do is to fill in data type records for the property “dco:hasDataType” on the VIVO profile of a dataset. We also developed a faceted browser for datasets and listed “Data Types” as a facet in that browser. Figure 4
shows a part of returned datasets when “Thermodynamics of chemicals and minerals” is selected in the data type facet.
Science today is increasingly facilitated by open data. As Fox [24
] defined, “data science is doing science with someone else’s data.” Our work on the conceptual model of data type and the implementation of it in a data portal enriches the semantic description of datasets. The information in such description is not only human-readable but also machine-readable, which will provide valuable help to people who access and use datasets that are retrieved from the world of open data.
There have been many works on machine-readable models of data types. In the RDF Schema (RDFS, follow the namespace URI in Table 1
for more details), there is a class rdfs:DataType, and, in the concepts vocabulary of RDF, there are a few instances of it, such as rdf:HTML, rdf:langString, rdf:PlainLiteral, and rdf:XMLLiteral, which show that the work still focuses on the syntactic part of data types. The ISO standard ISO/IEC 11179-3:2013(E) [25
] defines a data type as a “set of distinct values, characterized by properties of those values and by operations on those values.” This definition, as well as the definitions of primitive and composite data types in that standard, is compatible with definitions of similar concepts in programming languages [8
]. In the world of open data, a convention is to publish metadata together with data to describe the structure and contents of the data. Examples can be seen in netCDF headers [26
] and Data Packages [27
]. Most recently, W3C released the recommendation of a metadata vocabulary to support annotating, discovering, and displaying tabular data on the Web [28
]. The recommendation provides metadata items for describing objects at several levels of detail, such as groups of tables, inter-relationships between tables, single tables, and individual columns within a table. Much of the work in that recommendation can be adopted to extend the model in this paper to a finer scale, especially the part surrounding the class dco:DataTypeParameter. Previous works on markup languages for harmonizing heterogeneous datasets, such as the Ecological Metadata Language [29
], the Earth Science Markup Language [30
] and the GeoSciML [31
] can help provide use cases from the geoscience domain on how to represent data structures in a machine-readable format.
There have also been works that annotate datasets with domain specific data type information. For example, the EarthChem data portal [32
] proposed a hierarchical list of data types to be used for tagging a registered dataset. However, the data types in EarthChem are specified at the text level, i.e.
, as keywords. The work presented in this paper goes a step further from the “keyword” level by enabling the semantic specification of data types through a conceptual model. Currently, we do not have a full list of data types that meet all requirements of the DCO community, and we do not intend to register all the data types just by ourselves. Instead, the functionality we built for data type registration is open to the DCO community and any user can register specific data types of interest. Besides data type registration, we can also organize the explicit relationships among registered data types, i.e.
, through the property “dco:sourceDataType.” We can also organize the implicit relationships between data types. For example, the keywords used to describe data types can provide clues on the categories or disciplines of data types.
Our work was initiated with the adoption of the output of the Data Type Registry (DTR) working group of the Research Data Alliance (RDA) [33
]. Each DTR is a self-contained portal for data type registration and curation. A registered data type is assumed to be resolvable to some useful information about that type. According to the vision of the DTR working group, there will be multiple DTR instances, and each governed by its own project, group, or community. All those DTR instances reuse some common basic types, which are called “primitives.” Those primitives will be registered in a type registry presumably managed by the Corporation for National Research Initiatives (CNRI). So we can expect a two-level hierarchical federation of the DTR. The higher level is a list of primitives, and the lower level is the specific data types defined within a DTR.
Our differentiation of basic data types and specific data types are comparable to the thoughts of primitives and specific data types in the DTR working group. However, we adopted a different technological approach to realize the data type registration. From the point of view of ontology engineering, both primitives and specific data types in the DTR design are at the instance level, i.e., they both are registered data types. In our work, the basic data types are at the class level, i.e., they are classes in ontologies, and data resources are instances of them. The specific data types in our work are at the instance level, i.e., they are all instances of the class dco:DataType. If we put the specific data types at the class level, then we need to update the ontology frequently to include the new data type classes created. Using a single class dco:DataType and making all specific data types as instances of it significantly reduces the efforts needed to update the ontology and to maintain the framework of the data portal.
Focusing only on the case of the DCO data portal, because the portal is underpinned by ontologies, by making the model as a part of the DCO ontology, we deployed it in the DCO data portal quickly and smoothly. This shows the advantage of the conceptual model for data portals based on Semantic Web technologies. Mapping to the PROV-O Ontology allows the data type information to be connected with the broad provenance graph, such as the two example assertions “dco:DataType rdfs:subClassOf prov:Entity” and “foaf:Person rdfs:subClassOf prov:Agent” described in the previous section. We should note that, in these two example assertions, the former was fine because the DCO Ontology was developed by ourselves and we can make updates to it; however, for the latter assertion, we had the issue of “ontology hijacking” [35
]. That is, newly developed ontologies re-defining the semantics of existing concepts resident in other ontologies. To reduce the negative impact in our work, we had those “hijacking” assertions only work for the DCO data portal and did not use them for other purposes.
Our work is just a first attempt in applying semantic technologies to enhance the meaning of and relationships between data types and datasets. In addition, we propose some possible future work that could further enhance the semantics of data types. First, we can extend and update the conceptual model to improve its ability to represent domain-specific meaning. We can do this by working with domain scientists within the DCO community, and the broader geoscience community, in the development of use cases specific to their science domains. A work of particular interest is to extend the specification of the class dco:DataTypeParameter and the relationships between parameter instances. A few existing works in the Semantic Web community (e.g., see [28
]) can be adopted to accomplish this. We can even seek the opportunity to push the model to a more general level and build a data type ontology that can be used in various domains of studies outside of the DCO community. Second, the “Data Types” facet in the dataset browser can be enriched with more features to help users find datasets of interest. For example, a visualization gadget may be added to show the interrelationships among registered data types. We can develop a way to compute the similarity [36
] between a researcher’s interests and data types based on the researcher’s profile in the data portal. Then, the data portal will be able to recommend data types of interest to that researcher. Third, once a certain number of data types are registered, methods can be developed to use the keywords in their description to explore the implicit relationships among those data types. Fourth, to leverage the value of data types and formalize their use and reuse, the possibility of setting up an Application Program Interface that serves (1) machine-readable information about the structure and contents of registered data types and (2) structured metadata for citation, both through the unique identifier of each data type, is worth exploring. In other words, it is an effort to promote the data type to be a first-class object in the world of open data.
Science is, in large, driven by data. The global open data has created great opportunities for science, but presents challenges in data interoperability. The clear identification of a meaningful data type is a key factor in solving data interoperability across scientific domains. Conventionally, a data type is often treated at the syntactic level, such as integer, float, Boolean, string, etc. Syntactic definitions of data types do not associate sufficient domain-specific semantic meaning to the data types. In this work, we describe the application of Semantic Web technologies for the specification of data types. With this approach, a data type can convey complex meaning, such as who creates the data type, the source standard that the data type derives from, the operations that can be done on datasets of that data type, typical scientific domains, software programs and/or instruments that use the data type, and more. The implementation of our model in a production scientific data portal enables data producers to register data types and use them to annotate data resources. The data type information is both human and machine-readable. For the data users, they can receive explicating assumptions or information inherent in a dataset through the records of specific data types associated with that dataset. In this way, they can quickly see and understand the details within a dataset without even downloading it.