The RADAR Project — A Service for Research Data Archival and Publication

The aim of the RADAR (Research Data Repository) project is to set up and establish an infrastructure that facilitates research data management: the infrastructure will allow researchers to store, manage, annotate, cite, curate, search and find scientific data in a digital platform available at any time that can be used by multiple (specialized) disciplines. While appropriate and innovative preservation strategies and systems are in place for the big data communities (e.g., environmental sciences, space, and climate), the stewardship for many other disciplines, often called the “long tail research domains”, is uncertain. Funded by the German Research Foundation (DFG), the RADAR collaboration project develops a service oriented infrastructure for the preservation, publication and traceability of (independent) research data. The key aspect of RADAR is the implementation of a two-stage business model for data preservation and publication: clients may preserve research results for up to 15 years and assign well-graded access rights, or to publish data with a DOI assignment for an unlimited period of time. Potential clients include libraries, research institutions, publishers and open platforms that desire an adaptable digital infrastructure to archive and publish data according to their institutional requirements and workflows.


Introduction: Digitalization of Research Workflows
In principle, the concept of data sharing and reuse is not new: before the digital revolution, journal articles could not feasibly include all the underlying data.Thus, if a researcher wanted to have access to such (external) data, e.g., of another research group, he/she had to contact them.The success of such a data request then often depended on the issue of trust, reputation and intentions for reuse.Hence, in itself, the sharing of research data along with other scientific outputs such as journal articles and software connects to the very basis of science: building on, reusing and openly discussing and evaluating published scientific findings.With digitalization and the resulting global connectivity, the way research is conducted has changed.A rapidly growing digital data production, accompanied by new developments with regard to equipment, software and scientific methods, challenge the scientific community: scientists, policy makers, funding bodies, journalists and the interested general public alike have to keep up with scientific progress and rapidly expanding availability of digital information.Until now, much work has been done to improve the accessibility of so-called "big data".Big data include extensive datasets produced through large science projects, such as particle physics research carried out at the European Organization for Nuclear Research (CERN, e.g., using the Large Hadron Collider) [1].However, there are thousands of research studies that produce "smaller" datasets.In 2011, a survey of 1700 researchers across disciplines was undertaken by the journal Science.They found that 48.3% of respondents were working with datasets less than 1 GB in size, and over half of those polled only store their data in their laboratories [2].Such heterogeneous data collections occur across various scientific disciplines, often associated with the so-called "long tail" of science.Characteristic features of these disciplines are hypothesis-driven studies led by small investigation groups, who produce and analyze their own datasets [3,4].In this long tail of scientific fields, a formal data culture is rarely found: there, the use of standards and best practices for data management is extremely dependent upon the community, and datasets often lack a defined structure.Consequently, established data management infrastructures, such as appropriate data repositories in these fields, are few and far between.
One basic requirement with both small and big data that emerged within the last years is the need for all scientific data to be "open"-open to be searched, traced, cited, downloaded and ideally supplied with a suitable license to indicate their potential for further re-use in-and outside of the respective community boundaries and context.Sharing this point of view, new funding agency mandates for a formal data management have come into play.Such mandates have been implemented, e.g., by the National Science Foundation, which require researchers to include a data management plan with their proposals for funding [5].In Germany, similar developments are expressed, e.g., in the guidelines for "Safeguarding Good Scientific Practice" of the German Research Foundation [6].As such, data underlying scientific studies starts to be recognized as a primary research output and researchers are asked to participate in a more active research data management.This may include the establishment of data management plans in the beginning of research projects, a diligent search and collection of scientific information from established literature sources and from adequate repositories, as well as the publication of own data, e.g., by their deposition in an appropriate repository in order to allow the reuse of data and to complete the data lifecycle [7][8][9].Consequently, research institutions, universities and libraries are becoming more interested in collecting and providing access to datasets produced at their institution that do not fall within the scope of big data or discipline-based repositories.In addition, researchers themselves start to look for data services.This situation brings new opportunities to establish a support and data service infrastructure, e.g., by forming new co-operations between data centers, research institutes and libraries.The established cooperation may result in a collaborative infrastructure that provides support services ranging from consultations for researchers, assistance with data management plans, to the provision of actual storage space for data preservation and data publication services (including the assignment of persistent identifiers to datasets).Ideally, such data infrastructures allow research data to be stored, managed, annotated and curated in a digital repository available at any time and to be used by multiple disciplines.
This article presents the RADAR (Research Data Repository) project.With RADAR, a generic research data infrastructure for data preservation and publication in the above-mentioned fields of the long tail of science will be developed and established.

RADAR-Scope, Collaborations, Goals and Architecture
RADAR is an interdisciplinary digital data repository that provides both preservation and publication services, primarily for disciplines without a tradition of data sharing, including the fields of the so-called long tail.RADAR offers data preservation and publication services for academic, research and cultural heritage organizations and industrial customers.RADAR welcomes data from specialized research disciplines of all areas, i.e., natural, life, economic, social and cultural sciences.With the start of the service in 2016, RADAR will focus on institutional users (i.e., typically libraries of universities and research organizations).Later on the offering will be supplemented with solutions for research projects and scientific publishers.In the data lifecycle context, RADAR is a system with a service placed in the "Persistent Domain" of the conceptual data management model described in the "domains of responsibility" [7].The domains of responsibility are used to show duties and responsibilities of the actors involved in research data management.Simultaneously, the domains outline the contexts of shared knowledge about data and metadata information, with the goal of a broad reuse of preserved and published research data.With RADAR, we decided to introduce a service oriented architecture that allows a modular research data management.Therefore, several options on how users might employ RADAR are provided: the most obvious way is to rely on the hosted service, as described in the two-staged service model in Section 3. Furthermore, RADAR will offer an API access, so users can integrate the archival backend into their own systems and processes.A third option is to install the RADAR software locally, with either deploying only the management and User Interface part and archiving the data in the hosted RADAR service via the API, or running everything locally.Furthermore, there is the option to run the complete software stack locally and use the hosted RADAR service as a replica storage solution.
The repository is being developed as part of a three-year project funded by the German Research Foundation from 2013 to 2016 (http://www.radar-projekt.org)and is placed within the program "Scientific Library Services and Information Systems (LIS)" on restructuring the national information services in Germany.

Collaboration
RADAR is developed as a cooperation project of five research institutes from the fields of natural and information sciences.The technical RADAR infrastructure is provided by the FIZ Karlsruhe-Leibniz Institute for Information Infrastructure and the Steinbuch Centre for Computing (SCC), Karlsruhe Institute of Technology (KIT).The sustainable management and publication of research data with DOI-assignment is provided by the German National Library of Science and Technology (TIB).The Ludwig-Maximilians-Universität Munich (LMU), Faculty for Chemistry and Pharmacy and the Leibniz Institute of Plant Biochemistry (IPB) provide the scientific knowledge and specifications and ensure that RADAR services can be implemented to become part of the scientific workflow of academic institutions and universities.

Goals
The goal of RADAR is to establish an interdisciplinary research data repository, which is sustained by research communities and supported by a stable business model.The data management processes and tools needed include: The heterogeneity of research data is a serious issue for many research data repositories.RADAR is facing this problem by focusing on real scientific workflows and elaborates a generic best practice approach that is evaluated and tested with data provided by scientific partners from different research areas.

Architecture
E-research projects often require comprehensive collaborative features.These include data storage, access rights management and version control.In 2004, a collaboration of FIZ Karlsruhe and external partners developed the e-research platform eSciDoc, which is a flexible repository solution offering a wide range of functionality for global scientific collaboration.In 2014, the development of "eSciDoc Next Generation" began [10], which is a fully revised version of eSciDoc and used in the RADAR project.eSciDoc Next Generation does not include Fedora any longer, but is (as is RADAR) open source and licensed under ASL 2.0.
Within RADAR, data storage is managed using repository software that consists of two parts: the back end regulates general tasks such as storage access, bitstream preservation or regular reports on data integrity, whereas the front end manages RADAR-specific workflows (Figure 1).These workflows include various data services: metadata management, access control, data ingest processes, as well as the licensing for reuse and publishing of research data with DOI.The corresponding RADAR architecture is based on an expandable API structure, referred to as "Archive API" (Figure 1).This structure allows an integration of multiple computing centers that use various storage systems (e.g., TSM, SamQFS, DMS, and HPSS).To reach a uniform archiving interface, the API hides these various storage systems and technologies.
"eSciDoc Next Generation" began [10], which is a fully revised version of eSciDoc and used in the RADAR project.eSciDoc Next Generation does not include Fedora any longer, but is (as is RADAR) open source and licensed under ASL 2.0.
Within RADAR, data storage is managed using repository software that consists of two parts: the back end regulates general tasks such as storage access, bitstream preservation or regular reports on data integrity, whereas the front end manages RADAR-specific workflows (Figure 1).These workflows include various data services: metadata management, access control, data ingest processes, as well as the licensing for reuse and publishing of research data with DOI.The corresponding RADAR architecture is based on an expandable API structure, referred to as "Archive API" (Figure 1).This structure allows an integration of multiple computing centers that use various storage systems (e.g., TSM, SamQFS, DMS, and HPSS).To reach a uniform archiving interface, the API hides these various storage systems and technologies.
The RADAR architecture was developed in accordance with the OAIS reference model [11].Within the architecture, Archival Information Packages (AIP) and Dissemination Information Packages (DIP) will be provided as a BagIt-structure in ZIP container format; each container will include data and metadata.As part of the import/export strategy, an API for RADAR will be provided: The API will allow the import/export of data as well as metadata.Export of metadata will be available in different formats.

Metadata Schema
Metadata are essential to the traceability, access, and effective use of scientific data.In RADAR, submitted data must be accompanied by a set of basic descriptive metadata parameters that document and describe the respective resource.The RADAR metadata schema (Figure 2) aims to enhance the traceability and usability of research data by maintaining a discipline-agnostic character and simultaneously allowing a description of discipline-specific data.For this purpose, the schema includes a set of generic parameters that allow an accurate and consistent identification of a resource for citation and retrieval purposes, while at the same time meeting the requirements of more discipline-specific datasets.
The schema includes nine mandatory fields, which represent the general core of the scheme.These mandatory fields contain the main requirements for the DOI registration, in accordance with the DataCite Metadata Schema 3.1 [12].Additionally, 12 optional metadata parameters serve the purpose of describing discipline-specific data.The parameters were implemented with a combination of controlled-vocabularies and free-text entries, thereby covering heterogeneous data produced by a multitude of disciplines.The controlled vocabulary entries were defined in accordance with The RADAR architecture was developed in accordance with the OAIS reference model [11].Within the architecture, Archival Information Packages (AIP) and Dissemination Information Packages (DIP) will be provided as a BagIt-structure in ZIP container format; each container will include data and metadata.As part of the import/export strategy, an API for RADAR will be provided: The API will allow the import/export of data as well as metadata.Export of metadata will be available in different formats.

Metadata Schema
Metadata are essential to the traceability, access, and effective use of scientific data.In RADAR, submitted data must be accompanied by a set of basic descriptive metadata parameters that document and describe the respective resource.The RADAR metadata schema (Figure 2) aims to enhance the traceability and usability of research data by maintaining a discipline-agnostic character and simultaneously allowing a description of discipline-specific data.For this purpose, the schema includes a set of generic parameters that allow an accurate and consistent identification of a resource for citation and retrieval purposes, while at the same time meeting the requirements of more discipline-specific datasets.
established regulations in mind (for example, ISO standards for language and country of origin of the data).The discipline-agnostic applicability of the metadata schema has been tested with various data, including test datasets from humanities, sport sciences and applied chemistry.Clients who wish to enhance the prospects of their metadata being found, cited and linked to original research are strongly encouraged to submit the optional parameters in addition to the mandatory set of properties.The metadata of datasets that are published in RADAR will be available under the Creative Commons Zero license [13].The schema will provide recommended use instructions along with appropriate examples on how to efficiently describe research data from various disciplines, as well as a support service for data harvesting of published metadata via OAI-PMH interface.

Two-Stage Service Model
The heterogeneity of research data is a significant challenge for many research data repositories.Depositing research data in RADAR ensures that the requirements of funding agencies and of Good Scientific Practice are met.Thus, other researchers will be able to find, reuse and cite published data.To facilitate the submission to and integration of research data into the digital repository, RADAR offers detailed author guidelines and step-by-step explanations on how to choose between the offered preservation and publication services, how to prepare and how to submit the data.Published data will be assigned a persistent identifier (DOI), which aids their citeability as part of the researcher's publications record.As a generic service RADAR will accept all types of digital data that are collected in the course of scientific research studies.A dataset deposited in RADAR may comprise raw data, primary data (intermediate working data), secondary data and files describing the data and documenting the research process.RADAR accepts both data underlying scientific articles and standalone data publications, e.g., "negative data".RADAR does not accept pre-prints, doctoral theses or other grey literature.However, if it is part of raw data used for analyses, RADAR strongly encourages data depositors to provide information on related content using the metadata schema.Data may be submitted in any file format.Recommendations reflecting the requirements for longterm accessibility of digital content will be provided in the author guidelines (e.g., the use of XML based formats for text files).A RADAR test system was designed and implemented in June 2015.The schema includes nine mandatory fields, which represent the general core of the scheme.These mandatory fields contain the main requirements for the DOI registration, in accordance with the DataCite Metadata Schema 3.1 [12].Additionally, 12 optional metadata parameters serve the purpose of describing discipline-specific data.The parameters were implemented with a combination of controlled-vocabularies and free-text entries, thereby covering heterogeneous data produced by a multitude of disciplines.The controlled vocabulary entries were defined in accordance with established regulations in mind (for example, ISO standards for language and country of origin of the data).The discipline-agnostic applicability of the metadata schema has been tested with various data, including test datasets from humanities, sport sciences and applied chemistry.
Clients who wish to enhance the prospects of their metadata being found, cited and linked to original research are strongly encouraged to submit the optional parameters in addition to the mandatory set of properties.The metadata of datasets that are published in RADAR will be available under the Creative Commons Zero license [13].The schema will provide recommended use instructions along with appropriate examples on how to efficiently describe research data from various disciplines, as well as a support service for data harvesting of published metadata via OAI-PMH interface.

Two-Stage Service Model
The heterogeneity of research data is a significant challenge for many research data repositories.Depositing research data in RADAR ensures that the requirements of funding agencies and of Good Scientific Practice are met.Thus, other researchers will be able to find, reuse and cite published data.To facilitate the submission to and integration of research data into the digital repository, RADAR offers detailed author guidelines and step-by-step explanations on how to choose between the offered preservation and publication services, how to prepare and how to submit the data.Published data will be assigned a persistent identifier (DOI), which aids their citeability as part of the researcher's publications record.As a generic service RADAR will accept all types of digital data that are collected in the course of scientific research studies.A dataset deposited in RADAR may comprise raw data, primary data (intermediate working data), secondary data and files describing the data and documenting the research process.RADAR accepts both data underlying scientific articles and standalone data publications, e.g., "negative data".RADAR does not accept pre-prints, doctoral theses or other grey literature.However, if it is part of raw data used for analyses, RADAR strongly encourages data depositors to provide information on related content using the metadata schema.Data may be submitted in any file format.Recommendations reflecting the requirements for long-term accessibility of digital content will be provided in the author guidelines (e.g., the use of XML based formats for text files).A RADAR test system was designed and implemented in June 2015.Interested users and stakeholders are welcome to test and evaluate the provided RADAR services for their utility in their daily scientific workflows.As such, RADAR will be evaluated by researchers from different scientific domains.RADAR pursues a two-stage approach with a discipline-agnostic basic service for preserving research data (Section 3.1) and an extended service for data publication (Section 3.2).Detailed service concepts, roles and dataset status of the RADAR repository system are shown in Table 1.
Table 1.Service concepts, roles and dataset status of the RADAR repository system.

General Service Concepts in RADAR (A) Contract
A customer (e.g., institution, publisher, research project) enters a contract and agrees to pay for services provided by RADAR

) Review
The dataset is "frozen" for the duration of the peer review process and receives a secure "review-URL" (3.) Archived (= service: Data Preservation) The dataset is archived and identified via Handle.As soon as this status is chosen by the curator, no further editing within a dataset is possible (4.)Published (= service: Data Publication) The dataset is published and identified via DOI.As soon as this status is chosen by the curator, no further editing within a dataset is possible.A landing page is created and the corresponding metadata can be found using discovery services

Basic service: Data Preservation
For data providers, RADAR offers format-independent preservation services to store data in compliance with specified long-term storage periods (e.g., 10 years, according to DFG recommendations).This includes secure preservation of up to 15 years with the data remaining unpublished, and the requirement of a minimum set of metadata.By default, the data and associated metadata will not be published, unless specified otherwise by the data provider.A flexible data and metadata access management will be offered, so that data providers are able to share preserved datasets with other RADAR users if desired and manage the external visibility of the associated metadata.The bitstream preservation will produce backup copies of the resource to ensure its preservation.

Extended Service: Data Publication
For making data citable, traceable and reusable, RADAR offers a combined service of research data publication and permanent preservation.Datasets published in RADAR are identified by DOI.Using the DOI, datasets can persistently and unambiguously be referenced.The service also includes an optional embargo period for the publication of submitted data that can be subsequently prolonged if necessary.The metadata describing the dataset will be published already during the embargo and datasets will be allocated a DOI.This ensures that datasets can be found and cited already when they are deposited, while downloads will only be possible once the embargo period has expired.Within the publication service, a peer review option may be used: In this case, the respective dataset is "frozen" for the duration of the peer review process and receives a secure "review-URL" provided by RADAR which may be forwarded to an editor or reviewer responsible for a corresponding manuscript submission.As such, manuscript and data may be inspected simultaneously during a review process.
The two-staged service structure offers additional data management services (Table 1).To assist researchers in the submission of descriptive metadata (Figure 2), detailed author guidelines along with appropriate examples from various research disciplines are given.RADAR also provides a technical quality control of research data and the corresponding metadata during the transfer of uploaded data objects into the repository.Data providers will be notified through both your e-mail and user account when the preservation and, if applicable, the publication process was successfully completed.RADAR will not check datasets with regard to their scientific content.Thus, the responsibility to describe datasets and to comprehensibly document the research process remains with the data producers.The consistency of datasets preserved in RADAR is regularly checked and documented.Datasets can be retrieved by the data providers at any time after deposition.

Data Management within RADAR: Access and Usage
As part of the RADAR service, data curators may receive regular reports on usage statistics such as number of downloads.The terms on which data may be reused depend on the respective copyright laws in effect and the respective license, which has been assigned to data upon deposit in a repository.For published datasets in RADAR, the usage of standardized Creative Commons (CC) licenses (version 4.0) is recommended.However, customized licenses or other descriptions that specify the terms of reuse may also be given.The license information will be displayed to users on the landing pages, together with other descriptive metadata, such as author(s), title, year of publication, related information and the download link.

The Business Model: Cost and Pricing
The services presented in Section 3 will be part of this business model, which will ensure a sustainable operation environment for RADAR as well as a tool for scientists to apply for data management funds.
Who will pay for public access to research data?A service to preserve and publish research outputs can take up a significant part of an institutional strategy and budget planning process.Basic funding of data infrastructure may not keep pace with increasing costs.This forces the operator(s) of data repositories to consider alternative cost recovery options and multiple revenue streams.The question is globally addressed by different institutions and approaches, e.g., within the RDA/WDS Interest Group Publishing Data Cost Recovery for Data Centres [14] and the APARSEN project, which maps and compares the various models [15].
Many cost models have become available over the last years and within the European 4C project [16] a lot of valuable work has been done to analyze these models and develop a generic cost comparison tool, the Curation Costs Exchange.A central finding from APARSEN [15] was that a common basis for comparison and apportionment of costs across national and institutional borders is hard to find.Because cost models are quite specific to the organizations where they were created, they differ in terms of activities, services and workflows.DANS [17] and DP4lib models [18] provide useful tools for developing third party preservation services as additional revenue streams for repositories.For RADAR, we selected a "cost by service" approach.The calculation sheets are dependent on the three central phases (ingest, curation and access) of the repository service.This assures the project's financial independence from third-party companies or institutions and consequently its long-term sustainability.
Based on the expected costs, a RADAR pricing model was created.The pricing model includes yearly payment plans based on institutional contracts depending on required storage volume and duration.This provides flexibility for institutional customers, as contracts can be adjusted, e.g., to varying data volumes.
A cost estimation tool will enable both researchers and institutions to receive quotes before using any of the offered storage services.With this tool, RADAR provides the opportunity to analyze costs for data preservation during the project planning phase and to implement these estimates in data management plans.Furthermore, researchers are encouraged to include the quotes in grant proposals to receive funding for research data preservation and publication.With this approach, RADAR follows the increasing requirement of research outcome to be open access.The pricing model for RADAR is available on the RADAR homepage: http://www.radar-projekt.org/display/RE/Home.

Conclusions and Outlook
An important lesson learned is that enforced mandates in science are not helpful when attempting to enhance the usage of (data) repositories and publication of research data.It became apparent that trust, along with knowledge regarding data curation, and compliance with the rules of good scientific practice in research organizations like DFG (German Research Foundation), HGF (Helmholtz Association) and MPG (Max Planck Society) is the key for encouraging personal motivation to publish primary data and other products of the research cycle.Furthermore, scientific data should be citable as publications and be linked to the corresponding article.This would allow the author to benefit from being credited for data publications and by increased citation rates.With RADAR, we present a project to establish an interdisciplinary research data infrastructure.The novel two-stage service and business model combined with a trustworthy repository for researchers, librarians, institutions and publishers will provide a contribution to ensure a better availability, sustainable preservation and publishability of research data for present and future scientific communities.We aim to provide a functional research data repository system featuring the services described in this article by summer 2016.Open interfaces (API) of the repository may also promote the interoperability of RADAR on a disciplinary, national, and international scale.

‚
guidelines for researchers to introduce and facilitate research data management in general and to store and/or publish their research data; ‚ a secure data preservation service including adequate storage periods (5, 10 and 15 years as well as permanent storage) by the use of distributed data storage mechanisms; ‚ (optional) data publication with Digital Object Identifier (DOI)-assignment to secure traceability, access and citeability; and ‚ technical implementation support for research institutions (e.g., by open API, the possibility of front end branding as well as the option for data peer-review)

Figure 1 .
Figure 1.Scheme of the RADAR-Research Data Repository-architecture that shows the API structure.

Figure 1 .
Figure 1.Scheme of the RADAR-Research Data Repository-architecture that shows the API structure.

Figure 2 .
Figure 2. Descriptive RADAR metadata schema including mandatory (left column) and optional (right column) parameters.

Figure 2 .
Figure 2. Descriptive RADAR metadata schema including mandatory (left column) and optional (right column) parameters.

(
B) WorkspaceThe workspace represents an organizational entity in RADAR where research data and metadata can be added, modified and structured(C) DatasetRepresents a collection of digital data that-with regards to content-belong together and form a logical entityRoles and responsibilities(A) RADAR Administrator Supervision of contracts and provision of admin-accounts for customers (internal) (B) Workspace Administrator Creation of Workspaces and assigning Curator(s)-e.g., an institution, facility or project being a customer of RADAR (C) Curator Management within associated workspace(s) and optional set-up of additional (sub-)curators, e.g., being a scientist, data manager or editor within an institution Dataset status and two-stage service model (1.) Pending Initial mode of a dataset uploaded to RADAR; editing is possible (including modification, update,