Over recent years, increasing numbers of gatekeepers such as funding organizations (the European Commission, but also on the national level, French ANR [1
], US NIH [2
], UK Wellcome Trust [3
], and the Austrian FWF [4
]) and journals demand data sharing, data management and data stewardship according to FAIR (Findability, Accessibility, Interoperability and Reusability) principles in order to ensure transparency, reproducibility and reusability of research [5
]. By meeting those requirements, the implementation of FAIR principles provides advantages for different stakeholders such as (i) researchers receiving credit for their work and benefitting from shared data by other researchers, (ii) funding agencies aiming for long-term data stewardship, (iii) professional data publishers getting credit for their software, tools and services for data handling and (iv) data science communities for exploratory analyses. Making data FAIR requires skills and guidance [5
Although LS are quite advanced in developing disciplinary tools for data archiving and established metadata standards for data reuse, we detected that there is a lack of tools supporting the active research process. This leads to tedious and demanding work to ensure that research data after publication are FAIR and that analyses are reproducible. In an attempt to overcome the problem, we identified a US initiative from the University of Arizona called CyVerse US, which supports these processes from data generation, management, sharing and collaboration to analytics. CyVerse US was originally created by the National Science Foundation in 2008 under the name iPlant Collaborative. From its inception, iPlant quickly grew into a mature organization providing powerful resources and offering scientific and technical support services to researchers nationally and internationally. In 2015, iPlant was rebranded to CyVerse US to emphasise an expanded mission to serve all LS.
Merely registering Austrian researchers on a US platform to enable collaborations, data sharing and analytics is not reasonably practicable, due to GDPR restrictions in Europe for user access management, general data sharing regulations of researchers at Austrian institutions [9
] and required NSF funding for access to CyVerse US to name a few reasons. In addition, data intensive workload for HPC using commercial cloud providers can be expensive due to data storage and transfer charges [11
]. Given that fact, it seemed quite natural to deploy a similar, local platform for LS researchers in Graz, Austria, based on the local requirements and using institutional infrastructure. CyVerse Austria (CAT named hereafter, https://cyverse.tugraz.at
) is an extensible platform deployed within the frame of BioTechMed Graz in order to support LS researchers in Austria at Graz University of Technology (TUG), University of Graz (KFUG) and Medical University of Graz (MUG) in data management and complex bioinformatic analyses using high performance computing (HPC) and supporting containerisation with Docker and Singularity.
Cyberinfrastructure (also known as CI or computational infrastructure) [13
] provides solutions to the challenges of large-scale computational science. Analogous to physical infrastructures such as laboratories making it possible to collect data, the hardware, software, and people that comprise it, cyberinfrastructures make it possible to store, share and analyse data. In order to ensure reproducible research and data analytics and to get around the dependency hell (referring to frustration about software depending on specific versions of other software packages) [14
], CAT is based on Docker [15
] technology. Using cyberinfrastructure, teams of researchers can attempt to answer questions that previously were unapproachable as the computational requirements were too large or too complex. Moreover, collaborations are highly strengthened due to federated storage of research data from collaborating institutions. Finally, CAT enables FAIR data practices. Through adding corresponding metadata to each dataset, documentation of research data is ensured. Metadata are searchable and connected to the dataset, therefore, data are findable and accessible, meaning that users can either search their own (meta)data and/or all the (meta)data collaborators shared with them. In addition, CAT endorses and recommends the usage of international standards for data and metadata to make them interoperable. In CAT, interoperability is supported through making available standardised metadata templates. Metadata standards in CAT include common formats such as Dublin Core, Minimum Information for a Eukaryotic Genome Sequence (MIGS), Minimum Information for a Metagenomic Sequence (MIMS), NCBI BioProject, NCBI BioSample and Legume Federation. CAT users can decide whether they make use of the metadata standard templates or if they establish their own structure for metadata files. Data sharing, as well as adequate documentation, makes research data understandable, and hence, reusable. Apart from datasets, analytics tools are saved as Docker images in CAT, ensuring the accessibility, interoperability and reusability of code [16
The infrastructure of CAT includes (i) a data storage facility with the possibility to integrate existing storage facilities, (ii) an interactive, web-based, analytical platform for Docker containers, (iii) web authentication and security services to allow the usage of existing authentication solutions, and (iv) support for scaling computational algorithms to run on HPCs, also by using existing HPC resources.
Summarising this project helped to enhance and simplify collaborations between universities in Graz. It (i) created a distributed computational and data management architecture, (ii) identified and incorporated relevant data from researchers in LS, and (iii) identified and hosted relevant tools, including analytics software to ensure reproducible analytics using Docker technology for the researchers taking part in the initiative. In addition, it holds potential to serve other institutions as well as other disciplines. At this stage, the difference between CAT and CyVerse US lies in the management of storage and the available HPC resources. CAT offers distributed storage at the participating universities, whereas CyVerse US provides federated storage for all its users. CAT also utilises available HPC resources for affiliated universities by transferring and translating analysis jobs to this HPCs together with a core HPC available to all researchers. In addition, CAT is not accessible from outside the university network which is essential to align with university regulations on research data in Graz. Moreover, non-local user authentication is one of the next steps in the development of CAT. In contrast to that, CyVerse US is accessible for everyone but usage will be depending on ongoing NSF funding. Currently, CAT offers a limited number of modules (Data Store, Discovery Environment); additional modules from CyVerse US will be implemented in the near future (e.g., Atmosphere). Finally, CAT established a connection from HTCondor to an HPC cluster with the Son of Grid Engine resource management system which provides valuable insight for the team of CyVerse US and serves as a template for future HPC connections.
Therefore, CAT is a potent platform which is useful for researchers not only in the LS. It covers user requirements by (i) increasing efficiency in data management and sharing of data and tools, (ii) complying with funders’ requirements, (iii) scaling up analyses, (iv) ensuring reproducibility when re-running analyses, and (v) creating a network of researchers with different specialities.
With this publication, we would like to present the services CAT provides for researchers, introduce the available infrastructure in Graz and elaborate on the value it adds for researchers with short use case descriptions. Therefore, we are addressing with this publication on the one hand, the HPC community to provide knowledge about state-of-the-art platforms to make HPC systems easily accessible for researchers. In addition, we think that LS researchers are looking more and more for new or already established solutions for RDM and reproducible analytics because publishers and funders are asking for FAIR solutions. Therefore, we assume that it is also a useful resource for them to understand the setup for a sophisticated RDM solution; with the elaborated use cases, we help them to identify as part of the target group.
This initiative introduced locally in Graz for the BioTechMed universities a platform which ensures easy and straightforward data sharing, and thereby, ensuring fruitful collaborations. In addition, it supports handling of research data according to FAIR principles by providing a secure distributed data management system with metadata standards.
4.1. Benefits of CAT for LS and HPC
Taken together, regarding our experiences from the last few years, we could observe different benefits that CAT brought for LS researchers and for the HPC community. On the one hand, CAT disconnects HPC know-how for users, meaning that LS researchers without an extensive know-how about how to connect to HPCs can easily do their calculations on resources available via CAT, thereby simplifying and speeding up the research process. In addition, HPCs move more and more in the direction of big data, and, big data researchers need more access to HPC resources due to the required computational power. By combining the researchers with the HPC resources, both fields benefit through either simplified research processes (LS) or extending their user community (HPC).
4.2. Future Direction of CAT
Due to the very general design of the platform, DE not only has the potential to serve life scientists, but also diverse other disciplines without having to reengineer the entire system. Apart from this aspect, scaling up the existing platform opens up possibilities to expand the infrastructure to more Austrian universities and/or also to other countries. Therefore, one of the next steps could be the connection of national HPC clusters, e.g., the Vienna Scientific Cluster. Another very important issue is an easy way of data sharing for collaborations with external universities/institutions/partners by providing access using iRODS to upload, download and manage data without the need of being fully integrated in the CyVerse landscape.
For the future, computer science and data science experts as well as domain experts need to develop a collaborative strategy to ensure and streamline usage of platforms such as CyVerse. Those approaches are becoming even more important with the continuing increase of interdisciplinary research.
Furthermore, there is a large global community working on CyVerse and a lot of development is going on. All the functionalities of CyVerse are adjusted to user requirements in the course of time. This is also the reason why a new version of DE is currently under development and will be released by the end of this year. The new version of the CyVerse DE is providing a responsive and modern design, in contrast to the GWT based current version [19