1. Introduction
Research projects by their nature generate a lot of data and the amount increases over time as more advanced computational techniques become available [
1]. Maintaining and managing research data is of great importance, allowing for easy access, sharing and reuse, according to the FAIR (Findability, Accessibility, Interoperability, Reuse) principles [
2]. In the age of data-driven science, the reuse of data and the compilation of existing data from observing infrastructures has become an integral part of research in the majority of scientific disciplines [
3]. Especially in the context of interdisciplinary research, a common platform for the exchange of data is indispensable. To allow for data to be shared and found effectively, it must be kept in a centralized location and complemented with descriptive metadata. This centralized location is even more important when researchers work at different institutions, frequently in different countries and not at the same time. Furthermore, a central hub for research data is needed to ensure valuable research is not lost and can be reused. Nowadays, individual disciplines, universities or other institutions might already have systems for data sharing in place or are working on implementing them, such as efforts by the Nationale Forschungsdateninfrastruktur (NFDI) e.V. [
4]. Likewise, in recent years, the number of research data repositories has constantly increased as indicated by the Registry of Research Data Repositories (re3data;
www.re3data.org). However, most literature on the topic of research data management (RDM) systems still focuses on organizational processes rather than actual implementations of RDM infrastructures [
5].
In addition to a solid repository to store data and metadata, good RDM also requires that researchers know how to use the platform and have some motivation to do so. Lack of awareness about the benefits of RDM can be a major hurdle to adoption of such practices [
6]. Therefore, a complete approach to RDM also needs to include training sessions and communication with the researchers to explain the importance of RDM to them.
In general, we can see three overall approaches to manage research data in scientific research projects. Some projects use existing repositories, especially in areas where such exist and are well established. An example in Earth Science would be Pangaea [
7], a widely used public data repository. Another solution involves using existing software provided by third parties and modifying it to suit the specific needs of a project. The CRC806DB [
8] used this method. Recently, the NFDI [
4] is making an effort to provide centralized tools for research data management in various fields. For example, the DFG project TRR341 [
9] makes extensive use of these efforts. The third option, which we used (in part because the other options were not viable in 2007), is to build a custom solution from scratch using only basic technologies already available.
The development of the RDM system described in this paper started almost 20 years ago, as a self-built solution. It was established to support the interdisciplinary long-term research project Transregional Collaborative Research Center 32 (CRC/TR32) ‘Patterns in Soil-Vegetation-Atmosphere Systems: Monitoring, Modelling, and Data Assimilation’ funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) [
10]. The funding for these Collaborative Research Centres is split into four-year phases. After an initial review, a project is funded for a first phase. After each phase, the project is reviewed and the DFG may decide to fund it for another four-year phase, with each project getting up to three total phases [
11]. The DFG also requires that all research data produced by their projects be kept available at least 10 years after the end of the final funding period [
12]. As such, a reliable repository is necessary to host these data, which have to be available for up to 22 years (12 years of project duration and at least 10 years after its end) to allow for continuous use of valuable research.
To serve the needs of research projects and funding requirements for good scientific practice and research data management, the development of a comprehensive RDM platform was started in 2007 within the CRC/TR32 project and was launched as the CRC/TR32 database (TR32DB) in its first version in August 2007 [
13]. The entire system is based on a self-built solution that has been continuously expanded with additional features and functionalities over the years and has been adapted and used in three different DFG-funded research projects to date [
13]. In addition to the TR32DB, the system has been reused for the Collaborative Research Center 1211 (CRC1211) “Earth–Evolution at the dry limit” research project in an adapted version that was originally launched in 2016 as the CRC1211 database (CRC1211DB). As a third project, the CRC/TRR228 database (TRR228DB) has been further adapted to the requirements of the Collaborative Research Center CRC/TRR228 “Future Rural Africa: Future-Making and Social-Ecological Transformation” research project that started in 2018. All three databases have been registered as repositories on re3data.
Managing data from these three interdisciplinary environmental research projects is a significant challenge due to the diversity of data types involved [
14]. In our supported research projects, these range from geodata (satellite imagery, digital elevation models, topographical, geological, and soil maps), to measured datasets or samples collected in the field, data from external sources like weather data, modelling data or data generated from laboratory experiments, as well as social scientific data of surveys and interviews that require specific efforts in terms of data protection. Moreover, further project related research output like publications, presentation slides or posters, as well as images or videos, can be uploaded to the database. In order to serve its purpose as a central location for all data generated by the projects, the database needs to accept a wide range of datasets and have a sufficiently generic schema for metadata that can describe for example both a published paper and a collection of raw measurements.
Maintaining an RDM repository for a long period of time revealed several issues that need to be tackled, as the main aim is always improving usability. The ultimate goal of these efforts is to provide a comfortable system that encourages use by the entire project. After all, the primary metric for success is the amount of data the researchers ultimately share in the database, thus enabling their download, reuse and citation. Since 2007, many coding conventions and paradigms have shifted. The project databases were built up over time with new features and extensions, written by a variety of developers over many years, each bringing their own style and preferences into the codebase. As a result, there was a lot of clutter and various parts were written with very different approaches, a common issue with long-term software solutions. Consequently, parts of the code base were still built on paradigms from the late 2000s, written by a variety of programmers. There were intermittent updates, such as the creation of a metadata editor in 2013, but some original code remained in place. By 2019, the system’s code base was again outdated and ongoing revisions had made it convoluted and cumbersome. This was even more difficult as the requirements for these repositories partly changed over time. As an example, in a new project phase, subprojects change title and members or even new subprojects are added. In addition, novel file types or whole features that could support a subproject may be required.
None of these are major issues by themselves, but an overly complicated structure made them time-consuming to resolve. Our aim is therefore twofold. Firstly, overcoming this “tech debt” by rewriting the entire core of the database to standardize the code base, adhere to updated programming paradigms, and incorporate advances in technology. This would also streamline the programming work for future extensions to the system by having a more solid base to work from and a consistent style and design. Secondly, the usability for scientists should be improved to incentivize the sharing of research data. This requires extensive changes to the user interfaces and underlying functionalities that make use of modern paradigms in the realm of interface design.
In line with these goals, this paper presents the update of several core features of the RDM system, such as the data upload, the input form to enter metadata and the core search functionality to find datasets stored in the database. To showcase the changes that have been made, all mentioned features will first be presented in their original state. Subsequently, the changes made to these functionalities are presented and discussed.
2. Database Infrastructure
2.1. CRC/TR32 Database
The RDM system, the CRC/TR32 database, has been initially developed for the DFG-funded “Collaborative Research Center CRC/TR32” [
15]. During the planning of the project, a central subproject for data management was initiated in the first phase (Januray 2007–December 2010), which was responsible for establishing a sustainable RDM infrastructure [
10,
13]. In the second and third funding phases, this was continued as the Information Infrastructure subproject (INF project). The CRC/TR32 research focused on the pattern-based prediction of states and fluxes of water, CO
2 and energy in terrestrial systems across scales [
16]. The project was a multidisciplinary long-term project with cross-institutional research between the German Universities of Aachen, Bonn, Cologne, and the Research Centre Jülich where around 190 scientists have been involved in total. It was funded for a full 12-year period from 2007 to 2019 and involved numerous subprojects from various scientific disciplines, including meteorology, geography, hydrology, geophysics, modelling, chemistry and soil science. During the course of the project, many heterogeneous datasets were created and submitted to the TR32DB, from collected field data from hydrological or meteorological monitoring to laboratory-based model simulations [
17]. For example, annual classifications of the study area’s land use and crop maps based on space borne remote sensing satellite imagery have been conducted over a 10-year period (2008–2018) and are archived and accessible in the TR32DB [
18]. Moreover, further documents are stored in the database, such as publications, conference contributions, including posters and presentation slides, PhD reports, and images from field campaigns. By the time the project ended, the database held over 1800 datasets with corresponding metadata, and it remains available online at
https://www.tr32db.uni-koeln.de. The database launched in 2007 and thus has been active for almost 20 years, with no current plan for retirement. As stated before, projects funded by the DFG require data to be kept available for 10 years after the end of funding. We are currently on track for achieving this goal.
2.2. CRC1211 Database
The CRC1211 “Earth–Evolution at the dry limit” Collaborative Research Center (
https://sfb1211.uni-koeln.de) was funded in 2016 and has been extended twice in 2020 and 2024 for the maximum 12-year funding period. The project focuses on researching the conditions of life in the driest hot deserts on Earth, the Atacama in Chile and the Namib in Namibia. The project seeks to identify the key characteristics of biological activity in extremely water-limited habitats on Earth and to characterize surface processes that occur under nearly water-free conditions. This topic covers a multitude of fields, such as geology, geography, plant sciences, zoology, meteorology, and a variety of different data are collected (e.g., from drillings to weather station data). Since the first funding phase, the subproject “Data Management and Spatial Data Analysis” has been established as a central INF project within the research project with the aim to maintain the project database CRC1211DB (
https://www.crc1211db.uni-koeln.de) to store all related project data. At the start of the third funding period, the CRC1211DB hosted about 1000 datasets (complete with metadata), which are very heterogeneous in nature.
2.3. CRC/TRR228 Database
The Collaborative Research Center CRC/TRR228 “Future Rural Africa: Future-Making and Social-Ecological Transformation” (
https://www.crc228.de) launched in 2018 and aims to understand African futures and how they are ‘made’ in rural areas by investigating change in land use and social-ecological transformation [
19]. It investigates the connection between social-ecological changes in land use and future planning in rural Africa and employs scientists from both natural science and social science disciplines. The main topics of interest for this project are geography, agriculture, ecology, economy, infrastructure, politics, society, and culture. Accordingly, the data generated in the project are very diverse and qualitative and quantitative interviews are conducted, along with soil investigations and biodiversity monitoring. From the first funding phase onwards, the subproject “Data Management and Services” has been established as a central INF project within the CRC. Although the focus of the project is slightly different than of the other described projects, the basic database infrastructure built for the other two projects could be adapted to fit to this project as well, resulting in the launch of the TRR228DB (accessible at
https://www.trr228db.uni-koeln.de). The TRR228DB holds an extensive collection of household survey data conducted across various Sub-Saharan African countries for two different time periods. It also includes a set of historical topographic maps of Kenya from the mid-20th century, originally produced by the Directorate of Overseas Surveys, the General War Office (Geographic Section General Staff), and the Survey of Kenya that have been provided by the Bodleian Libraries of the University of Oxford. Some of these maps were used to create a detailed geo-dataset of the historical road network of Kenya from the time period of the 1950s to 1980s, which is likewise accessible through the database [
20]. This repository currently hosts more than 1300 datasets.
2.4. Overall System Architecture
To provide an adequate RDM infrastructure that serves the need of a long-term research project, various features needed to be implemented. The basic system design can be divided into three layers, as shown in
Figure 1. The overall architecture [
21] has basically remained unchanged (though modifications and additions have been made) since its initial development in 2007. The presentation layer comprises all public and internal functionalities implemented in an interactive user web-interface that are provided for the users and for administrative purposes. Because the presentation layer is the only layer that has direct contact with the user, it is also the most important to provide a smooth user experience. The elements of this layer should be intuitive and function as expected to encourage interaction with the database. Among other concerns, uploading data needs to be as easy as possible within the requirements of a good repository, searching data should be accessible and convenient and metadata editing should remove as many hurdles as can be removed while ensuring rich metadata is provided.
The application layer acts as an intermediary between the presentation and the data layers. The main purpose is to pass along requests from the user (such as attempting to download a file), validate them (e.g., checking access permissions or passwords) and then handing the response back to the user. Since it is possible to limit downloads for individual datasets, and some parts of the website are only accessible for members of the corresponding project or even subproject, this layer plays an important role in maintaining security and privacy where necessary.
The final layer is the data layer, which represents the part of the architecture where any user-supplied data is actually stored in a persistent and sustainable manner. The files are stored in a hierarchical folder structure on an Andrew File System (AFS) server, operated and regularly backed-up by the IT Center of the University of Cologne (ITCC), using Commvault Backup & Recovery to write backups to tape storage. These files can be accessed directly by the application layer. The folder structure is organized according to the structure of the research projects, subdivided into funding phases, project clusters and subprojects, and different data types. In total, the database system supports six different types of datasets, namely data, publications, presentations, pictures, reports, and geo-datasets. All uploaded data files are coupled with their storage path to a MySQL database (using MariaDB), which holds all the corresponding metadata about the datasets, which is required to be entered in a web-interface form after the upload process. Additionally, the MySQL database also handles all administrative information (e.g., user details) that are required by the database system.
2.5. Core Functionalities of the RDM System
This section briefly describes the overall functions of our RDM system and its web interface. The main interface of the TRR228DB can be seen in
Figure 2. Despite a different colour scheme, the style and structure of the two other databases’ websites is largely the same and therefore not explicitly shown here. User account management is handled by the ITCC (IT Center University of Cologne), meaning that we can use their high security standards for user accounts without having to handle passwords or any of the associated security risks related to a self-built user management or external libraries. Things like password resets and two-factor authentication are managed by the ITCC, and account provision is also handled on an individual level, making user accounts very secure.
To find datasets stored in the databases, several search functionalities have been implemented on the website. The Advanced Search is the general search page, which uses a filter-based system to allow fine-grained and very specific searches of several metadata types for datasets in a way that scientists would be familiar with from the literature research on other websites. Additionally, a Map Search is available that can be used for all datasets that have a geographic location given in the metadata. These datasets appear on a map at the set location and can be accessed by their geographic location. Finally, a variety of quick access links to all datasets sharing a common metadata value are provided, like seeing all uploads with the type “Presentation” or all datasets that are under the topic of “Remote Sensing”.
The core functionality of the system is the storage and provision of datasets, accompanied by metadata. It consists of a two-step submission system for the user to upload their data to the database. First, the upload of a dataset is required and second, metadata needs to be added to each uploaded dataset. To supply the data, there is an upload form for users to upload files to the AFS storage system and an extensive web-based editor that is used to input the metadata according to the schema. Both steps have to be completed until a dataset entry will be publicly visible and accessible on the website with its corresponding metadata. From the start, the system was designed to be file based, allowing users to upload datasets as files and then supply the corresponding metadata to describe those files. While it was originally written for the CRC/TR32 only, one advantage of such a generic approach is that it took relatively small modifications to use the same basic design for the other supported projects.
The input of metadata is enabled by an editor that provides a form of structured metadata fields, where the user can interactively insert all necessary information to appropriately describe the uploaded dataset. All entered metadata are openly accessible after their submission. While a narrowly focused project might predominantly produce DNA sequencing or rock sample analysis, we could not rely on such specific data, and thus our metadata schema had to be generic enough to cover anything from climate simulations to interview transcripts, but specific enough to adequately describe different types of datasets. In order to adequately describe these heterogenous datasets, an individual metadata schema has been developed, which is based on standard metadata schemas, such as DataCite [
22] and Dublin Core [
23]. It is extended with metadata elements from ISO 19115 [
24], INSPIRE, and DDI standards [
25]. The metadata schema is publicly available for the TR32DB [
26] and TRR228DB [
27], with the metadata schema for the CRC1211DB being largely identical and not published separately.
The schema includes mandatory core metadata properties (e.g., title, author, description, keywords, geographic location, download permission, licence), optional fields, and automatically generated metadata. The metadata schema is structured in two levels to describe the various data types. The first level, the ‘General’ level, contains metadata properties divided into seven categories: ‘Identification’, ‘Responsible Party’, ‘Topic’, ‘File Details’, ‘Constraints’, ‘Geographic Information’, and automatically generated metadata details. This level gathers all fundamental metadata information categories common to the six data types that can be uploaded into the database (Data, Geodata, Report, Picture, Presentation, and Publication). The second level of the schema, the ‘Specific’ level, includes additional individual metadata properties unique to each of the six data types. Thus, these specific metadata are dependent on the type of data that has been uploaded and a dataset of the type ‘data’ gets different metadata than a dataset of the type ‘publication’ to provide individual descriptions.
After successfully submitting a dataset, the provided metadata of a dataset are visible on the website. The accessibility of the dataset itself depends on the download permission set in the metadata. The metadata of a dataset can be edited and updated at any time by the data provider. Additionally, a data provider can optionally apply a Digital Object Identifier (DOI) for their dataset. The DOI registration service is provided by the Helmholtz Centre for Geosciences (GFZ) Potsdam, which offers this service for selected external geoscientific data repositories [
28].
In addition to the core features of the database, several project-specific features have been implemented to serve the demands and needs of individual projects. The TR32DB hosts an additional repository of purchased weather and climate data from numerous climate stations across Germany provided by the German Weather Service (DWD). The CRC 1211 project has set up and maintains a number of weather stations in the Atacama Desert in Chile [
29]. The data from these stations is stored in the database and visualized on the website. Additionally, the TephataDB has been implemented into the CRC1211DB, which is a database of volcanic ash samples with some specific features for visualizing the composition of these samples and registering them with a persistent International Generic Sample Number (IGSN) identifier. In a similar fashion, the TRR228DB includes the Transdisciplinary Diary, an internal diary feature allowing researchers to discuss and share experiences in a blog style format that would not be suitable for regular scientific publication.
3. Software and Methods
Overall, the whole code base was rewritten and restructured. For instance, the described specific project features were integrated into the common code base during the rewrite, with each website individually knowing which parts to display and which to hide. Several core features of the database system have been largely overhauled and rewritten to bring them on a more modern codebase and implement several convenience features with the aim of enhancing the user experience and ease the use of the database. The following section describes the functionalities of some core functions in their original version, as well as the weaknesses that resulted from technical limitations in their regular use. In consequence, some substantial updates to these features have been made and their rationales are described below.
The development team consisted of no more than five people, so only a small amount of coordination was necessary. The work was split up and common standards for data transfer between the different layers were agreed upon. We also decided where to use an object-oriented approach and where to stick to functional programming. Database tables were standardized (such as always using the same data size for the primary “ID” keys, having ID 0 always refer to a dummy entry and using a common naming scheme) and the PHP and JavaScript files were separated for ease of maintenance. This way, an error in JavaScript would always be found in the relevant .js file, while anything server side is located in the .php files. Due to the small team, only limited testing and code reviewing was possible. However, the migration process specifically ran all existing datasets through the new metadata validation function, ensuring compliance with the new system for all old data.
3.1. Data Upload and File Storage
When the TR32DB was launched, the data upload was handled by an AFS-based file upload procedure. A flowchart of the overall upload process as it was from the point of view of the user before the update can be seen in the top section of
Figure 3. Smaller files could be supplied via a standard HTML form directly from the website, but the size limit was severe, allowing nothing larger than 25 MB per file. For example, the CRC1211DB hosts more than 150 files that exceed this limit. Due to restrictions by the IT Center and technical limitations at this time, from 2008, larger files needed to be manually transferred by the researcher via an SFTP client to the server’s AFS folder structure which were then transferred to the proper target location overnight by a separate script. In contrast to small files where the metadata could be inserted immediately after the upload process, for large files, the researcher had to return to the website on the next day and add the metadata corresponding to the file. Although the data size limit could be raised to 100 MB for the browser based upload in 2020, the technical hurdles for larger datasets remained the same and the process had two major downsides. Firstly, requiring the use of additional software, and secondly by inserting a delay between file upload and metadata input.
Thus, this function was redeveloped. With the release of HTML5 in 2014 and widespread browser support for the “file” API (application programming interface) [
30], this upload process could be significantly improved by enabling the upload of all data records directly by the website with immediate forwarding to the metadata input editor after completion of the upload. By 2016, all major desktop and mobile browsers supported direct file interactions. With this, the file can simply be selected through a web browser, and the upload process requires no additional software. This function is now allowing files of up to 6 GiB, which is not actually a technical limitation on the upload system itself, but rather on the settings of the underlying file storage system on the server. A planned transition of all project data from the currently used AFS to an S3 based storage solution in the future will allow even larger files.
The library resumable.js (TwentyThree, Copenhagen, Denmark,
https://github.com/23/resumable.js (accessed on 23 April 2026)) is used to handle the transfer of the data to the server. The library enhances fault tolerance for uploading large files via HTTP by dividing each file into smaller chunks. If an upload of a chunk fails, the upload process is retried until successful completion. This mechanism allows uploads to automatically resume after a network disconnection, whether locally or on the server, as only the currently uploading chunks are affected, not the entire file. All uploaded chunks are temporarily stored in a specific folder on the server. Once all chunks have been successfully uploaded, they are reassembled and the file is moved to its final location. Simultaneously, an entry for the dataset is created in the MySQL database, which then allows metadata to be entered in the next step.
3.2. Metadata Editor
Once the data has been uploaded, the next step for the researcher is to add metadata to their dataset. Describing the data with metadata is a very important step, for example, in order to cite data correctly.
Figure 4 illustrates the layout of the original editor, which is organized into several tabs, each addressing a specific thematic category of metadata. Metadata is mostly entered into text fields or selected from predefined options in drop-down menus. Certain lists, like the author list, are expandable by the user within the editor. However, other lists, such as keyword lists, are managed exclusively by the database administrator to prevent term duplication and can not be expanded by users. The Geographical tab enables users to specify a geographical location for the dataset by setting a point location or using a rectangular selection. This can be done by manually entering the coordinates or choosing the area directly from the map interface.
Despite its functionalities, many of the technical implementations of the original editor launched in late 2013 were outdated by modern standards, and more advanced solutions are now available. When the original editor was written, JavaScript was viewed with suspicion and was not considered a safe technology, as can be seen from the search trends for “NoScript”, a formerly popular browser extension that disables JavaScript completely [
31]. Following this trend, the editor was originally created using a standard HTML form and the interactivity was handled entirely by PHP on the server side. While using these programming languages can provide all the required functionality, the usability suffers. For example, we expect researchers to supply all the authors for a research paper they upload to the database. In some cases, this can mean inputting several names. Using only PHP, each time a new author is added, the entire website needs to reload, making the entire experience feel sluggish and tedious. This would provide motivation to omit secondary authors to reduce this hassle. The same was the case for providing multiple keywords, measuring instruments and various other metadata entries. In the original version of the editor, there was also no direct error checking when entering metadata. As a result, missing or incorrect entries were only marked in red after the submit button was pressed, requiring a full page reload. Furthermore, aside from highlighting the field in red, no guidance was provided on how to resolve the error.
However the usage and acceptance of JavaScript (as well as the language’s features) have since expanded. The use of JavaScript extended the possibilities of a more interactive metadata editor, which should enable users to use it more intuitively. In an attempt to combat the user experience issues and bring the editor up to a modern standard, most functionality was offloaded to JavaScript, a programming language that runs locally on the user’s own computer and can therefore provide interactivity without requiring server communication for every click. Taking advice from the design of commercial form expertise, certain design principles were implemented. One such feature is inline form validation. Designing interfaces that prevent errors from occurring in the first place is crucial. This can include features like disabling buttons when actions are not possible, which helps users avoid mistakes and frustration [
32].
3.3. Data Search
Once metadata has been published, the dataset can be found on the website. To optimize the findability of stored datasets, an advanced search feature has been implemented in the database system. The initially implemented search functionality was a single form where a selection of metadata fields could be filtered, showing the matching results upon submitting the form (
Figure 5). As with the old editor, this added a delay between entering the search and seeing the results. Especially for a search from the scratch, this is inconvenient, as each search refinement would require a full page load. Furthermore, only a limited number of different metadata criteria were searchable and more complex search queries were not possible with this system. For example, it only allowed users to search for a single keyword, but not a combination of several keywords.
To improve this, the entire search functionality has been overhauled and a more interactive search function has been implemented. The aim was to build a search function that is similar to other filter-based searches that researchers are familiar with from the literature research to keep hurdles as small as possible. Based on this, the new search was implemented similarly to the search feature on Web of Science or arxiv.org. To feel more interactive, the search is now implemented using JavaScript calls to a server-based script. When a user sets up a search, their entire filter settings are sent to the server, which sends back a list of all datasets matching those filters. The table underneath the filters then updates in real time to only show the matching datasets. Even for the databases with over 1000 datasets, the results are displayed essentially instantaneously, allowing a responsive user experience that works without page reloads.
The website now uses a JavaScript library, “Tabulator” (
https://www.tabulator.info/), to display tables. This unifies the look across various pages, and the feature-rich library has reduced the amount of work required to display lists to the user, especially when their content is dynamic. Doing so with PHP is often tedious work which could be offloaded to an existing solution.
5. Discussion
5.1. Rewrites and Combining Codebases
In order to tackle the task of maintaining three separate RDM repositories with diverging components and features, it was decided to combine the internal code bases for the TR32DB, CRC1211DB and TRR228DB. This was thought to have a multitude of advantages. Importantly, a single system is easier to maintain and fix in case of errors, as well as significantly reducing the hassle resulting from the divergence introduced into the systems over time. By using a common basic infrastructure, it is also possible to maintain the (now unfunded) TR32DB website and to keep it up-to-date without investing development effort into that system specifically. This is especially relevant if a security issue is discovered (such as an unprotected database query), requiring urgent attention. By using the same code in all systems, a single fix can be quickly moved to all of them, reducing the time a vulnerability is active.
By rebuilding parts of the website and its functionalities, it was possible to unify the code into a coherent structure as well as removing code that was outdated, duplicated or simply non-functional. For example, the system now uses the jQuery JavaScript library (maintained by the OpenJS Foundation, San Francisco, CA, USA) for its many convenient features. Previously, multiple different versions of this library were included by various parts of the website simply because those had been the most current version when that particular part was written. As a result, the user might download the same library multiple times, causing load times to become unnecessarily long. From a development perspective, there is also no good reason to use multiple versions of the same library. These (and other) library imports have been centralized, making it much easier to switch to a new version or reverting to a previous one across the entire page without requiring changes to dozens of files.
Similarly, when originally written, the TR32DB used mainly PHP and did most of the processing on the server side. In 2013, this was a common and reasonable approach, as JavaScript was still seen as somewhat risky technology by many users, who simply disabled it. However, with improvements in security and advances in browser technology, JavaScript has become one of the most popular languages in use today (see for example [
33], though JavaScript is regularly in the top 5 languages in such rankings), and it allows for a much more interactive user experience. The amount of processing power available to the users on their local machines (e.g., laptops or smartphones) has also increased, so there are no issues with JavaScript causing sluggish behaviour by the website.
The updated search with its instant results display has made searching for datasets easier, but the main benefit has actually been to the developers. The pre-defined searches on the main database page (such as listing all pictures or presentations) used to link to dedicated pages, each with a distinct PHP file handling the fetching of relevant data and displaying them. In the rewritten system, they instead link to the advanced search page with a filter predefined that corresponds to the relevant data type. This way, a single file is responsible for listing datasets, making it much more convenient to change anything related to the display of search results. Additionally, this allows all the sorting and filtering offered by the new search system to be used to further narrow down the lists, such as displaying all presentations from a certain year.
During various training sessions with the researchers, it also became clear that especially the less tech-savvy users formerly struggled with the upload of larger files to the system. To ease this process, the browser-based upload system was rewritten to allow larger files to be supplied by the user through a standard form interface, similar to the upload system for video hosting sites. In fact, the relevant JavaScript library, resumable.js, is maintained by a video hosting website and designed to securely handle the upload of larger files. Having the option to pause and resume uploads, is particularly helpful for larger datasets or users with weak internet connections to reduce the time lost during the upload process due to disconnections or other technical issues.
Despite being the main data repository for three large multi-year projects, the websites have a narrow userbase in the low hundreds, which makes it difficult to quantitatively assess the success of these changes. This means that getting statistically reliable feedback on changes is challenging at best and impossible at worst. When large (especially commercial) websites make changes to their interface, they obtain a lot of feedback from the high number of users. For a comparatively small website, understanding if a modified feature actually is an improvement would require surveying the researchers regularly, and responses are usually slim. From the perspective of data re-usability, the number of data downloads has increased over the last five years. In the TRR228DB, there were 426 dataset downloads in 2021 and 1368 in 2025, indicating that there has been an increased interest in the repository. The same numbers for the CRC1211DB are 350 to 963, showing a similar increase. Although it cannot be specifically determined to what degree this increase is positively influenced by the changes made to the system, the figures suggest that the website has become more usable and fit for purpose with the updates described above. While our main focus is on providing the physical infrastructure necessary to facilitate the provision of research data, we do regularly provide training sessions to researchers to explain the usage of our system and the importance of RDM. User feedback from such in person meetings and training sessions additionally reflects the success of the implementations.
As DFG-funded interdisciplinary projects are relatively long-term projects with individual funding phases lasting for four years [
11] and the total project duration being potentially up to 12 years, the programming work for the supporting RDM system took place across multiple phases in multiple projects, which also meant many different student programmers contributed to the codebase. This resulted in a varied development process and makes sustainable coding practices especially important. Since the rewrite, some new programmers have been added to the development team. This has allowed us to confirm that the new internal structure has dramatically increased the onboarding speed, thus making it much quicker for a new developer to start improving the codebase rather than spending weeks reading it before any contributions can be made. And while the sample size is fairly small and thus no statistically significant conclusions can be drawn, at least subjectively the number of reported issues about the system has decreased while the incoming flow of data has increased.
Finally, while we try to anticipate and respond to users’ needs wherever possible, the nature of large research projects leads to unusual requests from time to time. No amount of improvement can bypass the need for manual interference on occasion. For example, because a data repository needs to be persistent (especially since datasets can be associated with a DOI), there is no way for a user to delete a file they accidentally uploaded. We do provide various features to enable uploading a new or updated version of a dataset and setting a relation between the old and new version (using the “replaces” and “is replaced by” metadata entries). In such cases, it is also possible to copy over the metadata from the old dataset to avoid having to input the mostly identical information twice. It also happens on occasion that a research group produces files of such volume that no standard website can handle them. For such cases, it will always be necessary to intervene manually.
Using a fully custom-built solution also has some significant downsides. It must be acknowledged that a professional service that is well-established and has a large user base would have a bigger development team and could build new features faster. Despite our best efforts to standardize the programming and infrastructure, some tech debt remains and there simply is no way to put a robust, test-based code review process in place. Even though it looks likely that we will be able to get the system in a state that is cheap to maintain for the long term, it is nonetheless possible that some unknown glitch would force it offline after there is no longer an active team around to fix it, leading to a loss of the data. Transitioning to S3 as a storage backend will help mitigate this, as it is hosted separately from our project entirely, and has some basic abilities to keep the files publicly available. It should also be noted that we really only rely on PHP and MariaDB, which have historically been steady and reliable. Especially when no external repositories are available that fully match the requirements of a project, we find this approach easier than hosting a solution based on existing third party software, as we are not at the mercy of external updates.
5.2. Increasing Responsiveness and Interactivity
The original website was launched in 2007 and had minor and major updates since, with a significant relaunch in 2014. Nonetheless, by 2020 it was outdated in terms of the underlying code, especially since not all code was fully rewritten in those earlier updates, meaning there was a need for continuing work and updating components. While an RDM website does not need to be particularly flashy, it must nonetheless be modern and responsive to ease the use by researchers. Ideally, the use of the website must be intuitive and comfortable to encourage the sharing of research data by scientists. Additionally, a computer system facing the internet needs to be kept up to date simply to counter security threats, especially when hosting files that may contain sensitive or personal information, which may be the case with research data that concerns human subjects.
Good design principles focus on creating user-friendly interfaces that make it easy for users to navigate and interact with a system. This leads to a more enjoyable experience, encouraging users to engage more with the product [
34]. Jongmans et al. (2022) [
35] showed that focusing on creating visually appealing and user-friendly interfaces largely improves user engagement and retention and pleasure is the most important driver of positive website evaluation. Creating visually appealing and user-friendly interfaces improves the user’s engagement and reduces their reluctance to use a website. Consequently, our goal in changing the user interface was something called “conversion”, a term used in commercial settings to describe a user going from potential customer to actual customer. While we are not a commercial entity and therefore cannot (and should not) use all such suggestions, the techniques used elsewhere (especially in a commercial setting) may in part still be useful to an academic platform. We used findings from such commercial endeavours to inform our design decisions.
The general issue of having a website be inviting and motivating for users to interact with it has previously been considered mostly by commercial entities. A lot of research has gone into making it more likely for people to buy things from online shops, and websites like amazon.com closely monitor any changes they make to measure their impact on shopping behaviour. Of course, an RDM system is not a commercial endeavour, but at the core, we still want users to engage with the system and “entice” them to deal with forms and features we design. For us, this does not mean increased revenue but an improvement in the number of useful datasets and metadata we store, which increases the scientific value of the project databases. Advice on increasing engagement can often be found in lengthy lists on commercial developers’ blogs, and care must be taken that some points on these lists are not relevant to our projects. We do not need to implement changes that make people want to give us their credit card details, but we do want them to not close the metadata editor halfway through entering the dataset information because the experience became too frustrating.
Historically, the old upload method of using a third party software and waiting for the file to be transferred overnight led to a number of files being uploaded but never having metadata added to them. This issue appears to have completely disappeared with the new uploader due to the overnight delay being removed. Of course, uploading a large file inevitably takes time, but once the upload is complete, the user can now immediately provide the metadata as well, preventing the previously common “I’ll do it tomorrow” issue. The change also completely removed the need to help researchers with using an SFTP client, instead allowing them to use a web browser that anyone would already be familiar with.
The advantages of a self-built solution are the ability to closely match the requirements of the individual projects and adapt the system according to requests by researchers for changes. Furthermore, we do not rely on external companies to fix errors and have a high degree of control over the website without being constrained by the capabilities of an out-of-the-box software system licensed from a third party. In particular, any security issues that arise can be solved quickly in-house, rather than waiting for an external company to patch them. Additionally, the risk of software updates to some component breaking the whole website is significantly decreased. However, it also means that any such work needs to be done by us, so the effort is increased, especially when the system was first created.
Researchers often feel conflicted about sharing their data because they recognize its benefits, such as increased transparency, scientific progress, and career advantages, but also fear potential downsides like extra workload, loss of control, and data misuse [
36]. This ambivalence can lead to delays, frustration, or even avoidance. These multiple hurdles often prevent researchers from sharing their data [
37], and we can only tackle some of them, but nonetheless a smooth system that supports rather than hinders the process is helpful. In highly fragmented research areas like large-scale interdisciplinary environmental studies, establishing unified data management standards is particularly challenging. Although a comprehensive analysis of collected data is only feasible when all data and metadata are available on a shared platform, not every researcher contributing data to this platform participates in such comprehensive analyses or requires access to data from many other researchers within the consortium. This lack of necessity diminishes the intrinsic motivation to adhere to a common data management framework [
38,
39]. We cannot handle a researcher’s fear of letting go of their data through website design, but we can lower the technical hurdle of actually supplying the data once they have decided to do so. However, a deliberate decision was made that the more psychological causes of submission reluctance are not solvable through technological means. No matter our website design, if a researcher does not want to provide their data, they will find excuses not to do so, such as lack of time, the complexity of the archive or inability to find the data [
37].
The task of adding metadata to a dataset can be time consuming for researchers. There is a lot of information that needs to be provided to adequately describe a dataset. The previous metadata editor, while being functional, added some hurdles to this process, which could be removed with the advances in technology. The main editor went from over 1000 ms to fully load down to about 500 ms. A large factor in achieving this decrease is the map system loading separately from the rest of the page, allowing for interaction before it is done. Since the map is not needed until a later stage of metadata editing, this delayed load is purely advantageous. The main improvement to user comfort is that the editor no longer needs repeated reloading. Even if the load times had not improved at all, simply adding 10 keywords and authors would have come at the cost of 10 reloads in the old version. This incentivised the user to only provide the absolute minimum required metadata, while we want the metadata to be as rich as possible. Of course, we cannot reduce the actual effort required to input the required metadata, but that is not something any software could solve. As this step is of major importance, it was the first feature that was being rewritten to fit more modern standards. Internally, this meant switching from largely PHP-based code to almost the entire functionality being handled by client-side JavaScript code. This allows for a much more fluid user experience that reduces the time needed for this step because most interactions no longer require a server request to be sent. Furthermore, missing or erroneous entries are directly visible in real-time and can be directly addressed by the user without the need of reloading the page.
From the perspective of developing the editor further or making changes to it, the new codebase is more organized and has been documented, where previously it was split into multiple files. Since the rewrite, multiple new programmers have been brought on, and the onboarding process, which largely involves making them familiar with the metadata editor, could be sped up significantly.
5.3. Enabling FAIRness in Our Repository
Our database makes it possible to share data in a FAIR way [
2]. On findability, we support DOI application for datasets, giving them a globally unique identifier that links back to our repository. Data also cannot be published without significant metadata, which is linked via our database to the underlying files. Our search feature also ensures that datasets are findable according to their metadata. Accessibility is also ensured. The metadata can be viewed in a human friendly interface, as well as through a machine readable XML format, both in our custom metadata schema and converted to a DataCite compatible format. All metadata in the repositories is public (even if the data itself has restrictions applied to it). Using XML as a language to represent metadata makes it interoperable, as does our usage of keywords and vocabularies from established metadata standards. Where desired, metadata can be linked to other datasets, using relations defined in the DataCite standard. Finally, reusability: The most common license used in our systems is our data policy, which generally governs data produced in the projects, but other licenses (Creative Commons/Open Data Commons) are available to select when entering metadata. Versioning of data is supported as well, and the (meta)data is linked to the person or persons responsible for it.
Through these features, FAIR data sharing is possible. However, the nature of the projects may preclude some datasets from being fully compatible with these principles. For example, data provided and licensed from third parties may not be available for public download. Another case is personal data that contain sensitive information cannot be published without risks to the people involved. Such concerns are not uncommon in research projects [
40] and in some cases, a public download option simply cannot be provided under the applicable licenses. However, the TRR228DB data policy states that all data not explicitly excluded by these exceptions will be made public two years after project end, and the CRC1211DB data policy contains similar terms. Additionally, it is still possible for visitors that do not have direct access to a file to request access from the metadata creator, who can then decide whether an exception can be made (such as releasing a dataset for a researcher where only commercial use is forbidden).
6. Conclusions and Outlook
The recent updates to our database system, finalized in 2022, have introduced significant improvements, enhancing the user experience and simplifying the process for developers and administrators to maintain and update the system. These enhancements provide a robust foundation for the system’s successful operation over the next decade and ensure long-term data availability, regarding FAIR principles. However, managing such a system for over a decade has shown that it is not sufficient to implement a functionality just once and never touch it again. It is crucial to continuously update the RDM infrastructure. Consequently, this system will likely never be ‘finished’, and the databases discussed here will require ongoing adjustments to align with current technologies and developments, as well as security threats. For example, the underlying file system used to host the actual files uploaded by researchers is the Andrew File System (AFS), which was chosen largely because the university computing centre, the ITCC, offered and supported this file system. However, this is now considered a legacy solution that would need to be phased out and moved to a more modern file storage solution to ensure a long-term availability of the stored datasets. The ITCC offers such a system, based on the Amazon Simple Storage Service (S3), hosted in-house and available for free to the project. While this will require a significant rewrite of the file interactions for the website, it will also ensure that the database can be maintained for a long time even after funding for the original project has run out with only minimal maintenance work.
The upside from an RDM perspective is that increased “conversion” leads to more and better datasets in the repository, with increased metadata supplied for those datasets, while the user has the benefit of being able to use the editor more comfortably and thus is less likely to avoid it in future. Beyond the immediate benefits, the changes made here, should have created a basis that will make it possible to significantly reduce the revision effort in the near future. Because the end of the funding period also means a significant reduction in available support personnel, a stable solution should also not rely on any system that requires frequent updates or intervention from IT personnel.
With the advent of AI coding, it might be possible in future to speed up development of platforms like ours despite the small amount of developer time that is available due to funding constraints. As it stands, this technology is still too immature to be usable in a web-facing project due to the inherent risk of running code that the developer has not fully read or understood. This is especially dangerous with novice programmers who might take such AI-generated code and push it into production because “it works” and not notice security flaws or less obvious bugs. However, considering the rapid advances in this realm, it may well be possible to take advantage of these emergent technologies in a few years.
Another feature that could be of great value in the future is a section of “related datasets” in the metadata overview of a dataset. Linking metadata this way could greatly improve findability, and our metadata may be rich enough to draw meaningful links between individual entries. Implementing such a feature is not trivial. However, once again owing to a fairly small user base, it would also require specific efforts to make such a system stable against manipulation as we do not want researchers to add meaningless metadata entries to their datasets simply to artificially boost the links to other datasets.
In conclusion, ongoing efforts to maintain and update our architecture have enabled us to keep it available for almost 20 years, and by making a major effort to modernize the technological basis, we are set to continue offering our repositories for the foreseeable future.