Metadata Stewardship in Nanosafety Research: Community-Driven Organisation of Metadata Schemas to Support FAIR Nanoscience Data

The emergence of nanoinformatics as a key component of nanotechnology and nanosafety assessment for the prediction of engineered nanomaterials (NMs) properties, interactions, and hazards, and for grouping and read-across to reduce reliance on animal testing, has put the spotlight firmly on the need for access to high-quality, curated datasets. To date, the focus has been around what constitutes data quality and completeness, on the development of minimum reporting standards, and on the FAIR (findable, accessible, interoperable, and reusable) data principles. However, moving from the theoretical realm to practical implementation requires human intervention, which will be facilitated by the definition of clear roles and responsibilities across the complete data lifecycle and a deeper appreciation of what metadata is, and how to capture and index it. Here, we demonstrate, using specific worked case studies, how to organise the nano-community efforts to define metadata schemas, by organising the data management cycle as a joint effort of all players (data creators, analysts, curators, managers, and customers) supervised by the newly defined role of data shepherd. We propose that once researchers understand their tasks and responsibilities, they will naturally apply the available tools. Two case studies are presented (modelling of particle agglomeration for dose metrics, and consensus for NM dissolution), along with a survey of the currently implemented metadata schema in existing nanosafety databases. We conclude by offering recommendations on the steps forward and the needed workflows for metadata capture to ensure FAIR nanosafety data.

No definition of terms / not available in any ontology repository, for example: rpm (revolutions per minute) Please rank your databases stance on meta-data description:

Essential to have
Taking into account the different types of metadata: bibliographical (dataset owner(s), contact information, etc.), descriptive (dataset abstract, ontologies used, revisions, data format etc.), technical (the methods and protocols used to produce the data, instrument details and settings), which types of metadata are included in your database? (Select all that apply)

Bibliographical, Descriptive, Technical
For the types of metadata included, do you have a metadata QA/QC tool (e.g. common system, unified methodology) to replace manual evaluation?

Yes
You answered Yes to the question above. Can you please provide some more details?
For some fields when the metadata is collected we are using EMBL-EBI's Ontology Lookup Service as mentioned previously. However the difficulty of the user is to decide which terms to use when is listed in more than one repository.

Does your database link to
Yes underlying protocols used to generate the data?
What are the main challenges you experience in relation to metadata?
-Creation of inter-dependencies between different metadata information, when it becomes more complex -Users do not always complete all metadata fields (reasons -time constrains, information not available, etc.). For this reason, not all metadata fields were made obligatory, so its completeness relies on owners's willingness to complete it.
Do you have examples of best practice from your database or elsewhere that should be widely adopted? Please describe and add link / screenshot etc.
Use for data selection / filtering, automatic analysis, included in the analysis reports What could you do with metadata that you are not currently doing? Are there any plans to work with the existing metadata (e.g. statistical analysis)?
In progress: using the metadata for search in the entire database, additional selection and filters option, fully exploitation during the analysis especially when different data sets need to be combined and compared Can your database handle raw data / images / code etc.?

Yes
Do you consider that images could themselves be a type of metadata?
No Would you consider integrating FAIRness scores for data into your database?

Yes
Are there any success stories regarding the use of metadata you would like to An initial example of a general workflow on how the metadata and data could be used: https://www.linkedin.com/pulse/workflow-collect-

Nanotechnology Working Group
Meta data questionnaire to database owners A manuscript is under development by members of the Nanomaterials Data Curation Initiative (NDCI) that will explain to the greater informatics community the role of a metadata in the current efforts of data curation in nanotechnology. The NDCI was started within the National Cancer Institute's (NCI's) National Cancer Informatics Program's (NCIP's) Nanotechnology Working Group (Nano WG), but is open to the broader scientific community's participation.
Stakeholders for this paper include groups that currently curate nanomaterials data and metadata and strategic thinkers involved in developing best practices for metadata and data curation.
Please fill out the survey below to the best of your knowledge. Your response will contribute greatly toward a better understanding of the current state of utilisation of metadata as part of data curation for nanotechnology and will help advance the field of nanoinformatics.
Thank you!

Dear Iseult
We have dealt with a series of databases across several projects. As you are probably aware we have more recently, in pretty general terms, moved from using our own management database and data gathering templates (as discussed in our Nanosolutions/nanomile "collaboration", and via NSC presentations etc, etc) to use of the (more or less mandated) eNanoMapper database framework, though still using our templates, updated, and more aligned to the jrc ones, at the same time for experimental data gathering.
We have not answered the questionnaire for each different db, but have tried to give some more generic answers and observations on experiences etc, across these, that are currently relevant.
Filled in the Word questionnaire but the formatting went awry and tick boxes didn't work so some untidiness below. Hope it helps anyway.
9. Do the available data originate from extracted literature data, experimental data (e.g. raw, processed, from images, directly from instruments), computational or simulation data or from all three?
Please tick all that apply.
Extracted literature data

X Experimental data
Simulation or computational data 10. For extracted literature data, is the extraction a passive or interactive process?
Passive Interactive Both 11. How was/is the database populated?
3 rd parties

X Internally
Both 12. Do you use a standardised curation method? Please provide some details.
Is there a standardised curation method? Not sophisticated to date: Have used IOM FP7 templates in recent projects, and now also some JRC NanoReg templates. Templates and meti information gathered together into "admin" database and selected data extracted to related "results" database. The latter now being done via the eNM database parsing and loading 13. Is the data in the database FAIR (Findable, Accessible, Inter-operable and re-usable)?

X Technical
None of the above 20. For the types of metadata included, do you have a metadata QA/QC tool (e.g. common system, unified methodology) to replace manual evaluation?
X Yes to some extent as principles but also X No not as a formal tool -minimum requirements QA/QC are developed and applied but partial use due to workloads and time required I believe a. If yes, please describe / give details 21. Does your database link to underlying protocols used to generate the data? High quality "intelligent" metadata where it is available could greatly enhance reuse of data for analysis, modelling, grouping, read-across, safe-by-design frameworks etc.

E.g. use in data collection templates for labelling/linking to ontologies etc. Better search indexing for findable data.
26. Can your database handle raw data / images / code etc.?

Yes -Older databases are flexible and could be adapted to include or directly link to raw data & image files etc.
No 27. Do you consider that images could themselves be a type of metadata?
Yes -Yes (can contains information where, when it was captured and also meta information about the image itself in image-file-inherent metadata (if preserved by the technology/workflow); Some are results in themselves, and may be scaled, many/some more qualitative in nature

Would you consider integrating FAIRness scores for data into your database?
Yes -Probably for the majority; It is certainly anticipated that in time we will use such anyway

Yes, We use the NanoInformatics Knowledge
Commons -Instance Organizational Structure (NIKC-IOS), which informs how we structure the data that are curated into the NIKC database. Researchers populate an Excel Template incorporating the NIKC-IOS, which enables users to capture experimental and bibliographic metadata. The NIKC database was designed to capture as much metadata is necessary to make curated datasets re-usable. The data is mapped before it is uploaded into the NIKC Excel template to help organize data and metadata in to the NIKC-IOS.
Is the data in the database FAIR (Findable, Accessible, Inter-operable and reusable)?
No a. What are the re-use conditions?
We currently do not have re-use conditions.

b. Who owns the data once in the database?
The NIKC database is meant to be a repository for researchers' datasets. The individual or individual's organization continues to own the data once uploaded onto the database. Any use or sharing of the data must be permitted by the curator of the data, even with other NIKC database users.

Does the database have a Quality Management
System (e.g. ISO9000/9001)? We are still developing guidelines for best practices.

What do you do with metadata (handle, analyse, exploit)?
We are currently using metadata for app development.
What could you do with metadata that you are not currently doing? Are there any plans to work with the existing metadata (e.g. statistical analysis)?
Our plan is eventually use metadata for analysis (statistical, machine learning, best practices for regulation, further app development, and assay development). We request users of our data to reference the DB portal (s2nano.org) and published articles Do you DOI your datasets or use a unique referencing/ID system?

Do you have a licensing system for your data? No
Do the available data originate from extracted literature data, experimental data (e.g. raw, processed, from images, directly from instruments), computational or simulation data or from all three? Please tick all that apply.

All of the above
For extracted literature data, is the extraction a passive or interactive process?

Passive
How was/is the database populated? Both Do you use a standardised curation method? Please provide some details.
We have used specific physicochemical (PChem) score screening and nano-specific data gap filling method proposed by S2NANO for data curation [1,2,3]. The PChem score screening system evaluates the quality and completeness of PChem data while the nanospecific data gap filling method replaces missing values with manufacturer's specifications and/or estimations.The quality and completeness of PChem data were determined by a set of rules that specifically gave a score for each PChem attribute (i.e. core size, hydrodynamic size, surface charge and specific surface area). The PChem score for each attribute is composed of two sub-scores; one for the reliability of the data source and another for the reliability of the measurement method.

Yes
You answered Yes to the question above. Can you please provide some more details?
We have a scoring system for QC of metadata, such as INFO score from bibliographical metadata (journal name, journal information, etc.), PChem and Tox scores from technical metadata (the methods and protocols used to produce the data, instrument details and settings) Does your database link to underlying protocols used to generate the data?

No
What are the main challenges you experience in relation to metadata?
In our experience, the completeness and quality of data are challenges. Concerns for data completeness and quality are not only for physicochemical data of nanomaterials but also for in vitro toxicity data. The completeness of data is referred to missing data that some groups perform experiment to measure a parameter, but other groups do not. Concerns of data quality is related to standard measurements (e.g. Good Laboratory Practice, ISO protocols) that not all groups would refer to.
What do you do with metadata (handle, analyse, exploit)?
We mostly use our metadata for pre-processing purpose, to generate datasets for predictive model development. For example, we use metadata to sort, filter and screen original dataset to generate higher quality datasets or fit-for-purpose datasets.
What could you do with metadata that you are not currently doing? Are there any plans to work with the existing metadata (e.g. statistical analysis)?
We hope to expand current collection of metadata in terms of scope as well as quantity, and would like to perform exploratory data analysis for new model developments.
Can your database handle raw data / images / code etc.?
Yes Do you consider that images could themselves be a type of metadata?

No
Would you consider integrating FAIRness scores for data into your database? Each database ntry is manually checked. However, no strict criteria have been applied.

Comments/Notes
Is the data in the database FAIR (Findable, Accessible, Inter-operable and reusable)?
No a. What are the re-use conditions? None -once the database is exported to e-Nanomapper anyone can use the data.

b. Who owns the data once in the database?
The database i currently owned by RIVM but will be made publically available.

Does the database have a Quality Management
System (e.g. ISO9000/9001)?

No
Is there a Data Management Plan (DMP) and if yes can you provide some details / a link?
No Do you use specific ontologies to annotate your data and metadata?

No
Have you experienced difficulties due to different definitions of key terms?

No
Please rank your databases stance on meta-data description:

Nice to have
Taking into account the different types of metadata: bibliographical (dataset owner(s), contact information, etc.), descriptive (dataset abstract, ontologies used, revisions, data format etc.), technical (the methods and protocols used to produce the data, instrument details and settings), which types of metadata are included in your database? (Select all that apply)

Bibliographical, Technical
For the types of metadata included, do you have a metadata QA/QC tool (e.g. common system, unified methodology) to replace manual evaluation?

No
Does your database link to underlying protocols used to generate the data?

No
What are the main challenges you experience in relation to metadata?
The main challenge is simply: to obtain the metadata What do you do with metadata (handle, analyse, These are included in the database as separate entries.

exploit)?
What could you do with metadata that you are not currently doing? Are there any plans to work with the existing metadata (e.g. statistical analysis)?
Fill in data gaps.
Can your database handle raw data / images / code etc.?
No Do you consider that images could themselves be a type of metadata?
No Would you consider integrating FAIRness scores for data into your database?