The goal of computational simulation research is to imitate a physical phenomenon to learn how it behaves under specific conditions. For this, a mathematical model of the phenomenon is designed in the form of code. Testing a simulation involves running the model iteratively, each time using different input configurations that can result in numerous and large-sized output files. Those may be graphed or transformed into visualizations for interpretation. One of the main challenges of curating and publishing understandable and reusable simulation datasets is to represent the research process just summarized.
This paper describes the design, evaluation and enhancement of an interactive curation pipeline and corresponding publication representation for simulation datasets. The effort was undertaken for DesignSafe (DS), a web platform that offers end-to-end data management and computational services to enable lifecycle natural hazards engineering research [1
]. Researchers in the space generate vast and complex datasets derived from the investigative methods they use, which include: experiments, field reconnaissance, simulations, hybrid simulations, and social science studies. Building on more than a decade of work curating and publishing natural hazards engineering datasets [2
], DS expanded capabilities, adding open science high performance computing (HPC) infrastructure to enable computational research and to improve scalability. The availability of such infrastructure allows implementing data curation as a lifecycle process, through interconnected workspaces and interactive functions that carry data from the research planning phase through computation and into publication [3
]. In the platform, data curation activities are operationalized as selecting, organizing, describing, preserving, sharing, and checking for data and metadata completeness. Those can be done from the simulation project planning, in tandem with conducting simulations, and in preparation for publication. To achieve integration of data curation, we designed and implemented data and metadata models that correspond to the steps and processes of the research methods used in the natural hazards engineering space [4
]. In this paper, we focus on simulation data. We define this as data design including methods and technical elements to implement and evaluate curation interactivities and a data publication representation.
Administratively, DS is a virtual organization formed by multi-disciplinary teams with different and related functions. The group involved in the simulation data design was formed by research and development technical staff, and by members of the simulation requirements team. Technical staff includes software engineers, curators, interface designers and user experience experts, charged with designing, developing and testing the platform’s functionalities. The simulation requirements team is formed by domain scientists in a natural hazard type (e.g., geotechnical, wind, structural, and storm surge) whose role is to identify needs, establish policies, and provide guidance regarding simulations.
Our goal in the data design was to reflect how simulation researchers conceptualize their investigative process so that curation in the platform felt intuitive, intertwined with active research processes, and that the publication represented the work accomplished. Aided by the simulation requirements team experts, the data design was completed using diverse research and development methods including modeling research narratives, and iterative interface development and testing. In the design, we embedded lifecycle curation best practices such as: enabling paths between active, curated, and public data; recording data provenance; the possibility to add general and domain specific metadata; assuring long-term data permanence; and implementing different data navigation strategies to facilitate access. A main concern was designing for big data, so that managing large numbers of files is not a burden. Integrating curation and simulation services was also key and required balancing curation and publishing workflows with computational functionalities through the platform’s underlying data management technology. Flexibility and consistency in data representation were also important. While simulation researchers use diverse types of software and their datasets result in unique structures, across projects, curation functionalities and the general look and feel of the publications landing pages are the same. Beyond the technical challenges, achieving a seamless research flow within the platform required blending differences in professional practices and expertise. This needed a strategy to effectively involve the natural hazards experts in the data design. We captured and bridged the knowledge of experts and technical staff through a unique simulation data and metadata model and interactive mockups that were used to show and tell.
Designing for flexibility entails continuous evaluation. Upon the first release of the curation and publication interfaces we evaluated their fitness observing how users curated their projects, which data they selected to publish, and how they documented their work. We also attended their questions and concerns during curation virtual office hours. This feedback informed a round of design changes, which were mapped to real data cases and tested by users prior to their implementation.1
In this paper, we present the research and development methodology used to design a simulation data and metadata model that resulted as interactive interfaces for curation and publication. The following sections include: a review of related work; a description of the research and development methods used in the simulation data design; an account of how evaluations led to enhancements; and conclusions and discussion of future work. The enhancements are currently implemented and we point to simulation data publications.
2. Related Work
Many institutional data repositories, such as those based on Data Verse [5
], have generic data and metadata models and policies that facilitate depositing datasets derived from different domains and generated by all types of research methods. Moreover, many domain specific repositories map to a generic metadata standard [6
] as well as the Cyverse platform that combines data analysis and repository functions for plant genomics data [8
]. While generic metadata schemas allow interoperability with other repositories and aggregators [9
], they fall short on describing the complexity of simulation projects. Simulation data accessible in these repositories are represented as flat lists of files whose relations to the processes from which they derive and more specialized descriptions may be included in the file names, in the abstract, in a readme file, or in an adjacent publication. There are hybrid metadata models such as the Core Scientific Metadata Model [10
], and the one developed for the Digital Rocks Portal [11
] that accommodate experiments and simulations that use material samples as a departure. Such models allow representing complex data structures including multiple experiments and or simulations. A simpler hybrid data structure combining general information with project specific parameters was developed for the DataCenterHub, which hosts experimental and numerical simulation datasets [12
]. Departing from a landing page with general information, metadata and the different data components are displayed in tables. In DS, we saw the need and the opportunity to develop a model specifically to represent all the components and complex structure of simulation data. The methods used to develop it were similar to those followed to design experimental and field research models and services in DS [13
] and in other data cyberinfrastructure projects [11
The majority of open repositories receive data at the end of a research project’s lifecycle. Thus, curation happens outside of the repository and further interactive engagement consists in uploading files and adding metadata to a form. Dallmeier-Tiessen and colleagues stress the importance of connecting research workflows to data publication both upwards and downwards, to improve documentation and long-term preservation prospects [15
]. They review projects that connect research workflows to publication through loosely coupled modules, a model they recommend in the form of different, centralized services that can be called from diverse points in a research workflow to assure FAIR data publications. At the same time, Goble et al. cautioned about the current excess of curation web services and the difficulties of integrating them as useful and understandable workflows [16
]. As an end-to-end data platform, DS presents a self-contained case with guided, interactive curation services to move, unzip, categorize, tag, check metadata and data completeness, and publish data through a simulation research workflow.
Within a platform in which data is managed both for conducting simulations and for publishing, each data design consideration is twofold. For example, in relation to how to organize data, active research and publication steps have different requirements. While conducting simulations, researchers organize their files as hierarchies, in many cases using naming conventions that are acted upon by the simulation software, and that structure gets adopted when they move into storage systems [17
]. In turn, there is no consensus on how to organize simulation datasets for publication [18
]. While hierarchical file arrangement is adequate for staging and computing simulations, it does little to support large data understandability and discovery. Moreover, there are different and sometimes conflicting notions, including issues of storage limitations, over which of the many files used as simulation inputs and those derived as outputs should be published [19
]. In DesignSafe, we address these issues through a mix of policies and technical solutions. First, we identified which data and metadata components represent a complete data publication and considered the need to make available generous storage space. Simulation research modeling was the starting point to design functionalities that allow flexible data organization and description in the platform that combines the possibility to use the known folder system with categories and tags that point to the main structure of the dataset. Informed by the authors of [23
], we created a browsing interface that allows navigation between different categories that point to the data provenance.
4. Discussion: Evaluation of the Interactive Curation and Representation Interfaces
The data design process was iterative and done in collaboration with the simulation requirements team experts. We generated multiple opportunities to make changes to the interactive mockups based on their feedback and on the real data mapping cases. Evaluation became particularly informative once the first interface was implemented and a number of users had published datasets. We observed this first group of simulations focusing on how users categorized their data, the types of files they selected to publish, and how they described them. With all this information, we decided what aspects of the data and metadata model were suitable, and what to modify. Next, we present the outstanding issues that emerged during the observations, each illustrated by a simulation publication, and describe how we addressed them in the interface enhancements. The solutions are reflected in the figures that illustrate Section 3.3
, and in new simulation publications.
The category simulation model was interpreted differently by users. While some added graphs, formulas, and map files [36
] to the category (which mapped with what we intended for the category), a majority had a hard time figuring out which files to include. Users from the latter group resorted to describing the software that contains the model [37
], included a final report with a detailed explanation about the project, and one of them categorized the same readme file as simulation model and as report [38
]. A publication in which the user included a database belonging to a Federal Emergency Management Agency (FEMA) hydrography model confirmed that the user considered the input files as part of the simulation model [39
]. Through further discussions with the experts, we concluded that the simulation model is both the concept of the simulation and the files required to run it. Currently, the simulation model category is still required as a space to clarify the specifics of the model used and how it was run, especially for projects with more than one model configuration/parameters. However, we removed the requirement to upload files, which can be added to the input category. We will continue monitoring usage as we consider further changes to the simulation model category, including merging with simulation input and facilitating a description of the model as an abstract at the simulation project level.
The requirement to publish both input and output files did not present issues to the users, and the experts agreed that a simulation publication is complete when it is clear what files are used to run it and its outputs. However, we noted that users may not publish every simulation run but those they want to highlight, and that relations between inputs and outputs are not necessarily one to one. In some simulations, each input file may have a corresponding output file, while in others there could be multiple input files and one output [41
]. As referred in the literary review, we corroborated that a user conceived the publication as an extension of how he computed the data and organized the files following the same structure in which the simulation was run [38
]. A more confusing aspect is that, depending on the software used to conduct the simulation, all input files may have the same file name, although their contents and roles are different [37
]. Our data design is flexible to allow users to organize their files as they see fit. Categorization allows distinguishing inputs from outputs, and using file tags allow describing their contents, enhancing the understandability of the simulation results (see Figure 8
and Figure 9
Simulations often reuse data as input files [37
]. Many of these data are open, from agencies such as FEMA and NOAA [40
]. To facilitate citing reused data in the landing pages, we implemented the Related Work metadata element, which allows inserting citations for reused data with respective DOIs, or with a URL when available (see Figure 10
). Data reuse is related to data licensing. We offer different licenses for users to choose from and provide explanations about the impact of their licenses in their new publications in a Frequently Asked Questions section whose link is available throughout the curation and publication interfaces.
To simulation users, the concept of a static data publication is not as familiar as the process of publishing a paper. Frequently, users ask the curator if after publication they will be able to amend data or metadata, upload more documentation, or publish a new simulation under the same project. While we remind them about data permanence, and authenticity, we acknowledge the need for data publications to be managed over time for data versioning and corresponding metadata changes and we are currently working on functionalities that enable both.
Lastly, we noticed inconsistency in the clarity and depth of the projects’ descriptions. Depending on the level of expertise of the reader, one or more points can be unclear or too detailed. Discussing with the experts, they told us that they direct their data descriptions to other researchers or professionals which understand acronyms and terminology. To enable a broader public understanding of the research, we included at hand, simple suggestions about how to write each category description (see Figure 4
) and further guidance in our tutorials to use a language that engages both professionals and a broader audience, including hints of how the data can be reused. For users requiring the detailed explanations, we decided to make the report or help-me file a required category. The goal is that data consumers achieve an overall understanding of the project through the publication representation that invites them to dig deeper into previewing the files and reading the report if the project fits their interest.
We came full circle after releasing the interface enhancements. The work in [43
] is a simulation data publication with the improved curation interactivities. This is a very large project entailing seven bridge classes and 7000 realizations, each with an output of 10–12 files. The project’s goal was to achieve statistical highly parametrized bridge models that can be run in other scenarios (use of input data), and for using the results in ML applications (use of output data). This reuse scenario required publishing all the simulation components, including inputs, code, and the large number of outputs. In this case, the simulation model category was used to include model files that are common to all the simulation runs. The possibility of using both folder structure and categorization aided the organization and clarity of presentation of the dataset. Curation and publication were facilitated by the fact that all the processes, including planning documentation, computing, interactive curation and publication, were completed in the platform. The dataset representation in the landing page including the indentation and the tree view, provide an overview of the publication clearly showing the relations between the categories (model, inputs and outputs) for purposes of understanding the provenance of each realization results.
5. Conclusions and Discussions of Future Work
This work contributes to the fields of curation and open repositories by introducing a generalizable simulation data and metadata model, interactive curation services, and a navigable data representation that focuses on data structure and provenance. It also advances natural hazards engineering that did not have a data model and functionalities to organize data from simulations which is one of the main research methods used in the space. Without much precedent of simulation data curation services in open repositories, we modeled simulation research, designed curation interactions and interfaces based on the model, and evaluated the results through use cases and observations of data publications in DesignSafe. Most open repositories receive data at the end of the research lifecycle, and their interactive functions are limited to uploading files and filling metadata forms. Our curation service is integrated with data management and an HPC environment for which we balanced technology and professional culture. We completed a data design process that included opportunities for a multidisciplinary team to learn about simulation research workflows and about data curation. Demonstrating curation activities for purposes of obtaining feedback from simulation experts required interactive interface mock ups to visualize the steps involved in data curation. The data model and its interface implementation are flexible, allowing different simulation project configurations, while assuring a consistent representation across simulation data publications. We suggest that the basic model, with the option of building layers to describe specific characteristics, can be useful to simulations in many domains. In the publication representation, we emphasized making the structure of the dataset self-explainable and navigable. Versioning features and the possibility to amend certain metadata fields for error correction after publication have been prototyped and are in line for development.
In the self-publishing context of open repositories, clarity and completeness of published datasets are often irregular. We mentioned that there are conflicting notions about what constitutes a complete simulation publication, and much speculation about how large could those be. Our policy about what constitutes a complete simulation data publication is in line with a recent recommendation that it is the role of the experts to decide what to keep from a simulation project [44
]. As a team, we decided that users should be able to publish projects that include documentation of the simulation model and clearly identify inputs and outputs. Through the first release of the interface, we observed how users interpreted the data and metadata model as they were using it to publish data. Accordingly, we made changes during the first year of production.
Providing a model and policies for organizing files, and consistent interfaces for navigation and browsing across simulation projects is a step forward. Our data model emphasizes broad categories to accommodate and structure projects with many realizations/runs and large numbers of files [43
]. We also introduced facilities to make categorization and tagging less time consuming, but we already know that those are not enough as users do not want to manually tag thousands of files, even if they can do it in bulk. Thus, we note key research gaps such as the design of big data interfaces, and to make curation of large datasets semi-automated and more efficient. To better understand how to present big datasets to users, in collaboration with our USEX consultants, we will conduct studies focusing on understandability of our large simulation data representations, including aspects of accessibility and data reuse. We also see promises in the prospect of using ontologies based on the data model and machine learning to categorize files for curation purposes, and plan to formalize a research project and pursue funding for it. As simulation data publications increase, our plan is to continue evaluating the fitness of our design. We will continue tracking the amounts and types of simulation data that the researchers select to publish to inform publication policies and share this information with the engineering and curation communities.