Based on the considerations described in the previous section, we define the following requirements for a data management system to address the mentioned issues. For reference, we use the abbreviations from the original FAIR publication [5
] to link our statements to the four principles:
The system must be built in a client/server architecture for separating the high-performance workload on the database and filesystem from the lightweight clients. Create/Read/Update/Delete (CRUD) transactions on the server side must be ACID1
compliant in order to keep the structure consistent at any time. The communication Application Programming Interface (API) must be built around a transparent human-readable protocol with RESTful2
]. This API can then be used by libraries and clients that can be integrated into existing data management workflows. These requirements are needed to comply with A1–A1.2 of the FAIR principles.
Access control, file system. Heterogenous scientific environments require fine-grained access-control on object level (A1.2). In order to seamlessly integrate into existing data acquisition and data analysis workflows the system must be able to incorporate an existing file system with its grown folder structure. Separation between file system storage and (meta)data storage factilitates data management in compliance with A2.
Query language. One of the most important requirements is the query language which has to fulfill several properties that guarantee that heterogenous data in big amounts can be searched and retrieved easily. The logic behind the query language can also have a major impact on the data models used. To spell this out more precisely, the data model (for implementing F1–F3) and the query language (F4) must support:
Entities with subtyping
User-defined n-ary relationships and properties
Integration of files and directories as entities
Native support for primitive data types which include several numeric data types with their physical units and uncertainties, standard compliant date and time values, booleans, strings, and undefined values
Compound data types for lists, sets, tuples, and dictionaries
This general nature of the data model enables data management in compliance with I1–I3, R1, R1.2, and R1.3.
Extensibility. The system must be able to adapt to new software and hardware requirements. Furthermore, the system must be flexible enough to adapt to continously changing scientific workflows. The simplest way to ensure this extensibility is to implement a server-side API for extensions and plug-ins.
Although we highlighted, that our requirements are feasible for implementing data management in accordance with the FAIR guiding principles, we acknowledge that data management standards might evolve in the future. Our requirements therefore have a strong focus on extensibility.
2.4. Data Model
CaosDB has a general purpose object-oriented data model, depicted in Figure 3
, which is not tied to any particular scientific field or structure of data. It has a base object called Entity
are either Record Types
, or Abstract Properties
and every Entity
has a unique, server-generated Id
Record Types and Abstract Properties are used to define the ontology for a particular domain in which the RDMS is used. Records are used to store the actual data and therefore represent individuals or particular things, e.g., a particular experiment, a particular time series, etc. Record Types define classes or types of things, e.g., persons, experiments, time series, etc. Records can be viewed as members of the class defined by its Record Type. These classes can contain Abstract Properties which define key-value relationships for properties of the things along with the expected data type and possibly the default unit, a default value, or a range of permitted values. As files on the back-end file system are a major focus of this database management system, there is a special entity File that encapsulates typical file properties like path, size and checksum. Entities can be related via binary, directed, transitive is-a relations which model both subtyping and instantiation, depending on the relata. These relations construct a directed graph of the Entities. If A is-a B we call A the child of B and B the parent of A. No adamant restrictions are imposed on the relate of the is-a relation and thus, Entities can be children of multiple Entities.
Each Entity has a list of Entity Properties, or in short just Properties. An Entity Property is not an Entity of its own, but a triple of an Abstract Property, a value or Null, and an Importance. The values can be numericals, strings, dates, any other valid value that fits into one of several built-in data types, or, most notably, references to other Entities. The importance is either obligatory, recommended, suggested, or fix. A valid child of an Entity implicitly inherits its parent’s Properties according to their Importance, which means that it is obliged, recommended, or only suggested to have a Property with the same Abstract Property (or any subtype thereof). As opposed to Properties with other priorities, Fixed Properties have no effect on the Entity’s children. During the creation or update of Entities, the importances of the parents are being checked by the Server. Missing obligatory Properties invalidate the transaction and result in an error, by default. Missing Properties, when they are recommended, result in a warning, but the transaction is considered valid. Entities with missing suggested Properties are silently accepted as valid.
This novel approach to ontology standardization is inspired by the operators from deontic logics
, the logics of obligation and permission [13
]. It is designed to guide the users without restricting them too heavily and ensures that they do not insert their data wrongly by accident
. Furthermore, it helps them to find the most relevant or best fitting Properties
for their Entity
based on the supertype(s).
CaosDB thus facilitates the definition and observation of standards for data storage.
2.5. Query Language
Existing data management technologies already provide very comprensive and expressive query languages. Two prominent examples are SQL and SPARQL. SPARQL is a language for querying RDF triple stores and therefore also suited for complex queries of semantic data models.
However, SPARQL statements for simple requests often result in long and complex statements [8
] which motivated the need for a simpler but similarly expressive query language also suited for scientists without computer science background.
We would like to illustrate this with the following example:
Suppose we would like to retrieve all datasets from experiments that were conducted in 2017 at a room temperature of 293.15 K. This simple request would result in a highly complex SPARQL statement. The filter for the dates alone would read as:
(?date >= xsd:date("2017-01-01") && ?date < xsd:date("2018-01-01"))
The implementation of a temperature filter covering unit conversion would even rely on external unit conversion extensions.
In contrast to SPARQL, the CaosDB Query Language (CQL) which we implemented in CaosDB allows for a much simpler expression for the whole request:
Find Experiment with date in 2017 and room temperature=293.15K
CQL is translated into SQL statements by the CaosDB server. These statements are then passed on to the MySQL backend which carries out the actual request.
CQL is designed to express simple questions with simple queries resembling English. Its syntax is illustrated in Figure 4
. The language is case-insensitive, but for clarity some terms are explicitly spelled in upper or mixed case here.
The first term (query prefix
in Figure 4
) in a CQL expression is the desired return type of the query:
A query starting with Count returns a non-negative integer.
A query starting with Find returns a list of entities.
A query starting with Select returns a table containing the values of selected Properties.
This is optionally followed by an entity type which restricts the query to specific entities. The most important information searched for is probably the entity name which specifies the actual “thing” searched for. This term makes use of the object-oriented structure of the database and—in addition to searching for all entities having a specific name—also returns subtypes and Records being of that type. In a CQL expression, entity name is followed by a list of filters which are connected by filter separators. Filters can address any possible Property of an Entity and restrict the values to ranges or particular values, use a range of comparison operators, and even search with wildcards or regular expressions. Furthermore, relations between Entities can be expressed precisely. Filters can be combined with logical operators like And, Or, and Not. The query processor is able to interpret and convert physical units. This unique feature simplifies working with scientific data and sets CQL apart from SQL and various modern query languages for RDF(S), OWL or graph data.
We will illustrate the basic concepts by giving some typical examples:
COUNT Experiment with date in 2017
will return the number of experiments from 2017. In this query, Experiment
is typically the name of a Record Type
with a possibly large number of subtypes and instances. All Entities
which have the name Experiment
or have a parent with this name are filtered for those which have a Property
with the name date
and a date value in the year 2017
. CQL filters can also express the equivalence of complex SQL joins in an easily understandable syntax:
FIND Person which is referenced as an Author by an Article which has aTitle like *terminating ventricular fibrillation*
In this example, Person is a Record Type. Article is another Record Type having an Author and a Title as Properties. The statement would therefore return all Records, if they are a Person, that are assigned as values of an Author Property of a Record of type Article with a specific title. Since the returned objects are themselves Records of Record Type Person, they have Properties, presumably a name, affiliation(s), possibly an ORCiD, an email-address or some other contact information.
Another special feature are Select
queries which follow an SQL-like syntax and represent their results as a table, e.g., the result of
SELECT first name, family name from person with date of birth > 2000
will appear as an HTML table in the WebUI (downloadable as a tsv table), with three columns—id, first name, and family name. This feature is intended to provide one of the interfaces between CaosDB and existing scientific workflows.
CQL is inspired by SQL and therefore probably feels familiar to users with knowledge of prevalent database management systems. It should be clear from the aforementioned examples that the query language is structured, precise and powerful, but nevertheless resembles English sentences. This makes it easier to learn for users without SQL experience.
2.6. User Management and Access Control
CaosDB provides a fine-grained role-based access control system with access control lists. It is possible to define the permissions for insertion, update, retrieval and deletion of Entities, single Properties, and is-a relations, as well as the access to the transaction log and the user management.
CaosDB has a built-in user database where users can sign up or be registered by administrators. Furthermore, users can login with the credentials of their user accounts from PAM (Pluggable Authentication Modules). Access roles—which are relevant for the authorization—can be assigned to clients based on various criteria including their authentication status, the Unix groups of the user—if PAM is used—and connection details, like IP address and others. This makes it possible to share subsets of the data base with collaborators and even a greater audience of anonymous users.