CaosDB - Research Data Management for Complex, Changing, and Automated Research Workflows

Here we present CaosDB, a Research Data Management System (RDMS) designed to ensure seamless integration of inhomogeneous data sources and repositories of legacy data. Its primary purpose is the management of data from biomedical sciences, both from simulations and experiments during the complete research data lifecycle. An RDMS for this domain faces particular challenges: Research data arise in huge amounts, from a wide variety of sources, and traverse a highly branched path of further processing. To be accepted by its users, an RDMS must be built around workflows of the scientists and practices and thus support changes in workflow and data structure. Nevertheless it should encourage and support the development and observation of standards and furthermore facilitate the automation of data acquisition and processing with specialized software. The storage data model of an RDMS must reflect these complexities with appropriate semantics and ontologies while offering simple methods for finding, retrieving, and understanding relevant data. We show how CaosDB responds to these challenges and give an overview of the CaosDB Server, its data model and its easy-to-learn CaosDB Query Language. We briefly discuss the status of the implementation, how we currently use CaosDB, and how we plan to use and extend it.


Background
Despite the technological advances over the last decades, the scientific community still faces the problem of storing and accessing scientific data in a structured and future-proof manner [1,6,7,9]. Although principles for good scientific data management have since been formulated under the acronym FAIR [8] and are now widely recognized in the community, real-life obstacles tend to prevent their wide-spread adoption. Especially in cross-disciplinary environments, the interaction between different user groups, e.g. numerical scientists conducting simulation studies and experimenters working in the laboratory, often leads to highly inhomogeneous approaches to data management. For such heterogeneous data, inefficiencies become inevitable when different kinds of data have to be combined in a joint research project or when data has to be accessed by scientists who were not involved in the recording and storage procedure. In the worst case, this can lead to data being de facto inaccessible after their creators can no longer be reached.
The ongoing issues are rooted in some ubiquitous properties of scientific environments themselves: • Scientists use special or customized tools, software and data formats with good reason. A research data management system (RDMS) must be built around their workflows and practices and be open for change.
• If the system imposes too many restrictions on individual scientists they are likely to be unwilling or even unable to use it.
• If the RDMS requires too much extra work for learning and understanding, it is likely that the individual scientist will be unable to use it efficiently or just be unwilling to use it at all. This holds in particular for the construction of queries which should retrieve data according to powerful criteria while being simple and intuitive at the same time.
• The system should strongly encourage to develop and use standards for workflows and data models without being overly restrictive. Users of the database can only profit from the system when data is organized sufficiently structured to enable everyone to search and retrieve data easily and to understand the structure of the data intuitively. At the same time the database has to be prepared for constantly evolving data models and standards.
• File systems and other types of storage usually organize data into some kind of hierarchy which can be folders or projects. In scientific environments this can raise issues, especially when data belongs to multiple projects or is part of cooperations.

Requirements
Based on the considerations described in the previous section, we define the following requirements for a data management system to address the mentioned issues.
Architecture The system must be built in a client / server architecture for separating the highperformance workload on the database and filesystem from the lightweight clients. Create / Read / Update / Delete (CRUD) transactions on the server side must be ACID 1 compliant in order to keep the structure consistent at any time. The communication API must be built around a transparent human-readable protocol with RESTful 2 identifiers [2]. This API can then be used by libraries and clients that can be integrated into existing data management workflows.
Access control, file system Heterogenous scientific environments require fine-grained access-control on object level. In order to seamlessly integrate into existing data acquisition and data analysis workflows the system must be able to incorporate an existing file system with its grown folder structure.
Query language One of the most important requirements is the query language which has to fulfill several properties that guarantee that heterogenous data in big amounts can be searched and retrieved easily. The logic behind the query language can also have a major impact on the data models used. To spell this out more precisely, the data model and the query language must support: • Entities with subtyping • User-defined n-ary relationships and properties • Compound data types for lists, sets, tuples and dictionaries Extensibility The system must be able to adapt to new software and hardware requirements. The simplest way to ensure this extensibility is to implement a server-side API for extensions and plug-ins.

Implementation
CaosDB [3] is our in-house solution for fulfilling these conditions, to our knowledge it is currently the only existing software to satisfy the mentioned requirements.
CaosDB is an object oriented database with a powerful query language based on English natural language and a flexible and adaptive data model. For example, a typical query could look like this: SELECT flavour , rating , ingredients FROM Experiment WHICH HAS A room_temperature > 26 C AND WHICH IS REFERENCED BY ExperimentSeries WHICH HAS A name LIKE * ice cream testing * It also integrates efficient management of large data files directly into the core functionality to accomodate specific requirements by the scientific users: • offers a flexible data model • can be seamlessly integrated into existing workflows • allows search for values of specific fields (not just full text search), with automatic unit conversion and search for (back)references of linked objects.

Architecture
The software design follows a server/client architecture. The CaosDB server handles all CRUD requests, implements consistency checks and translates the requests into SQL commands which are redirected to the MySQL backend. It furthermore provides a transparent layer for interactions with the file system. The server frontend is written entirely in Java and is accessed using an RESTful API over HTTP with XML messages. The frontend also serves a web user interface (WebUI, shown in Fig. 1) written in XSLT, HTML and JavaScript that can be used for browsing data and maintenance operations.
The server is complemented by client libraries for Python and C ++ that encapsulate the XML API for usage in scripting, data acquisition and data analysis tools. Fig. 2   can communicate with the server via the protocol API and provide interfaces in several programming languages for automatable data exchange with data aquisition software and analysis tools. The WebUI for convenient database access is directly integrated into the core application. These also facilitate the manual data exchange with non-customizable third-party tools and data sources.
of the software architecture.

Data Model
CaosDB has a general purpose object-oriented data model which is not tied to any particular scientific field or structure of data.
It has a base object called Entity. Entities are either Record Types, Records, or Abstract Properties and every Entity has a unique, server-generated Id.
Record Types and Abstract Properties are used to define the ontology for a particular domain in which the RDMS is used. Records are used to store the actual data and therefore represent individuals or particular things, e.g. a particular experiment, a particular time series, etc.
Record Types define classes or types of things, e.g. persons, experiments, timeseries, etc. Records can be viewed as members of the class defined by its Record Type. These classes can contain Abstract Properties which define key-value relationships for properties of the things along with the expected data type and possibly the default unit, a default value, or a range of permitted values. As files on the back-end file system are a major focus of this database management system, there is a special entity File that encapsulates typical file properties like path, size and checksum.
Entities can be related via binary, directed, transitive is-a relations which model both subtyping and instantiation, depending on the relata. These relations construct a directed graph of the Each Entity has a list of Entity Properties, or in short just Properties. An Entity Property is not an Entity of its own, but a triple of an Abstract Property, a value or Null, and an Importance. The values can be numericals, strings, dates, any other valid value that fits into one of several builtin data types, or, most notably, references to other Entities. The importance is either obligatory, recommended, suggested, or fix. A valid child of an Entity implicitly inherits its parent's Properties according to their Importance, which means that it is obliged, recommended or only suggested to have a Property with the same Abstract Property (or any subtype thereof).
As opposed to Properties with other priorities, Fixed Properties have no effect on the Entity's children. During the creation or update of Entities, the importances of the parents are being checked by the Server. Missing obligatory Properties invalidate the transaction and result in an error, by default. Missing Properties, when they are recommended, result in a warning, but the transaction is considered valid. Entities with missing suggested Properties are silently accepted as valid.
This novel approach to ontology standardization is inspired by the operators from deontic logics, the logics of obligation and permission [4]. It is designed to guide the users without restricting them too heavily and ensures that they do not insert their data wrongly by accident. Furthermore, it helps them to find the most relevant or best fitting Properties for their Entity based on the supertype(s).
CaosDB thus facilitates the definition and observation of standards for data storage.

Query Language
The CaosDB Query Language (CQL) is designed to express simple questions with simple queries resembling English. Its syntax is illustrated in Fig. 3 using EBNF 3 . The language is case-insensitive, but for clarity some terms are explicitely spelled in upper or mixed case here.
The first term (query prefix in Fig. 3) in a CQL expression is the desired return type of the query: • A query starting with Count returns a non-negative integer.
• A query starting with Find returns a list of entities.
• A query starting with Select returns a table containing the values of selected Properties. This is optionally followed by an entity type which restricts the query to specific entities. The most important information searched for is probably the entity name which specifies the actual "thing" searched for. This term makes use of the object-oriented structure of the database and -in addition to searching for all entities having a specific name -also returns subtypes and Records cql = query prefix , [ entity type ] , entity name , [ filter separator , filter ] ; query prefix = " FIND " | " COUNT " | select clause ; select clause = " SELECT " , field , {" ," , field } , " FROM " ; entity type = " ENTITY " | " RECORDTYPE " | " RECORD " | " PROPERTY " | " FILE " ; entity name = ? any string ? ; filter separator = " WHICH " , [" HAS A "] | " WITH " ; filter = conjunction | disjunction | negation | propery name , operator , value | back -reference | ... ... Figure 3: The first levels of the CQL syntax in EBNF. This is only a schematic overview and does not include the syntactic sugar or white spaces. However, it should be noted that the top level of this syntax is not too complex and has only very few keywords. Yet even simple queries are very powerful, mainly due to the transitivity of the is-a relation.
being of that type.
In an CQL expression, entity name is followed by a list of filters which are connected by filter separators. Filters can address any possible Property of an Entity and restrict the values to ranges or particular values, use a range of comparison operators, and even search with wildcards or regular expressions. Furthermore, relations between Entities can be expressed precisely. Filters can be combined with logical operators like And, Or, and Not.
The query processor is able to interpret and convert physical units. This unique feature simplifies working with scientific data and sets CQL apart from SQL and various modern query languages for RDF(S), OWL or graph data.
We will illustrate the basic concepts by giving some typical examples: COUNT Experiment with date in 2017 will return the number of experiments from 2017. In this query, Experiment is typically the name of a Record Type with a possibly large number of subtypes and instances. All Entities which have the name Experiment or have a parent with this name are filtered for those which have a Property with the name date and a date value in the year 2017.
CQL filters can also express the equivalence of complex SQL joins in an easily understandable syntax: FIND Person which is referenced as an Author by an Article which has a Title like *terminating ventricular fibrillation* In this example, Person is a Record Type. Article is another Record Type having an Author and a Title as Properties. The statement would therefore return all Records, if they are a Person, that are assigned as values of an Author Property of a Record of type Article with a specific title. Since the returned objects are themselves Records of Record Type Person, they have Properties, presumably a name, affiliation(s), possibly an ORCiD, an email-address or some other contact information.
Another special feature are Select queries which follow an SQL-like syntax and represent their results as a table. E.g. the result of SELECT first name, family name from person with date of birth > 2000 will appear as an HTML table in the WebUI (downloadable as a tsv table), with three columns -id, first name, and family name. This feature is intended to provide one of the interfaces between CaosDB and existing scientific workflows.
CQL is inspired by SQL and therefore probably feels familiar to users with knowledge of prevalent database management systems. It should be clear from the aforementioned examples that the query language is structured, precise and powerful, but nevertheless resembles English sentences. This makes it easier to learn for users without SQL experience.

User Management and Access Control
CaosDB provides a fine-grained role-based access control system with access control lists. It is possible to define the permissions for insertion, update, retrieval and deletion of Entities, single Properties, and is-a relations, as well as the access to the transaction log and the user management.
CaosDB has a built-in user database where users can sign up or be registered by administrators. Furthermore, users can login with the credentials of their user accounts from PAM (Pluggable Authentication Modules). Access roles -which are relevant for the authorization -can be assigned to clients based on various criteria including their authentication status, the Unix groups of the user -if PAM is used -, and connection details, like IP address and others. This makes it possible to share subsets of the data base with collaborators and even a greater audience of anonymous users.

Discussion
CaosDB is currently in beta testing stage and handling around 40TiB of experimental data from biomedical physics in 250000 Files along with detailed meta data contained in about 320 Record Types and 95000 Records. Data file types include video recordings from optical imaging, electrophysiological time series, scanned lab notes and image files. Furthermore, data and parameters from simulations and information about source code is stored along with analysis results of experimental and numerical data. The analysis results are thereby linked to the data from which they stem. Many file types are automatically parsed and integrated as parts of Records into our data model. The hash sums computed and stored for every file allow for comprehensive consistency checks. One advantage of our strict separation between file system and data model is that the system can be directly used on top of the established file system structure without the need to move or modify any existing file.
For data analysis mainly the CaosDB Python client is used which can directly query and retrieve the relevant data and use it for more specific analyses. The power of CQL leads to much complexer SQL queries in the background than what users would typically enter manually, and subsequently to perceived slower responses. Still it already proves to be faster than using simple file operations for finding specific data. However, the query language processing is still subject to algorithmic optimizations.
Our current efforts to improve the data model focus on connecting experimental data to intermediate and final results of data analysis, and the integration of data from cardiac simulations.

Conclusions
In this article we presented our approach to improve research data management in heterogenous scientific environments. We presented our perspective on the current situation and proposed a list of requirements an RDMS must fulfill and further described how our research data management system lives up to these requirements.
The most important differences to existing solutions include a smart data model framework which enforces the development of standards while allowing for enough flexibility to adapt to rapidly changing scientific workflows. Furthermore, the powerful and intuitive query language enables quick access to data and simplifies data retrieval and data analysis.
We conclude that our database management system could provide a solution for ongoing issues with research data management in heterogenous environments and promote the development of standards of data storage and retrieval. It will thereby improve the FAIRness of research data management.

Software availability
The public git repository with the source code is available at https://gitlab.gwdg.de/ bmp-caosdb as version v0.1, the program version described here can be accessed at http: //dx.doi.org/10.17617/3.1s [3]. The software requirements are: Java 1.7 or higher, MySQL or MariaDB, Python. The software is licensed under the GNU AGPLv3.