The VAMDC Portal as a Major Enabler of Atomic and Molecular Data Citation

: VAMDC bridged the gap between atomic and molecular (A&M) producers and users through providing an interoperable e-infrastructure connecting A&M databases, as well as tools to extract and manipulate those data. The current paper highlights the usage of the VAMDC Portal, recalls how data citation is implemented within VAMDC and provides insights about usage of VAMDC that will increase the impact factor of A&M producers and will offer a more reliable citation of A&M datasets included in application ﬁelds.


Introduction: State-of-the-Art
The "Virtual Atomic and Molecular Data Centre Consortium" (VAMDC Consortium, http:// www.vamdc.eu) [1] is a worldwide consortium which federates Atomic and Molecular databases through an e-science infrastructure and an organisation to support this activity (http://www.vamdc. org/structure/how-to-join-us/). About 90% of the inter-connected databases handle data that are used for the interpretation of astronomical spectra and for the modeling in media of many fields of astrophysics. Other application fields include atmospheric physics, plasmas, fusion, radiation damage.
The current VAMDC e-infrastructure interconnects about 30 atomic and molecular databases that cover atomic and molecular spectroscopy and processes. VAMDC offers a common entry point to all encorporated databases through the VAMDC portal (http://portal.vamdc.eu) and VAMDC develops also standalone tools in order to retrieve and handle the data, the SPECTCOL tool [2,3] is an example (http://www.vamdc.eu/software). VAMDC provides also software [4] and support in order to include new databases within the VAMDC e-infrastructure. One feature of VAMDC e-infrastructure is the constrained environment for the description of data, in particular the VAMDC-XSAMS, a standard XML file (XML Schema for Atomic Molecular and Solid Data) 1 , and other standardized protocols (http: //www.vamdc.eu/standards) that ensure a higher quality for the distribution of data. VAMDC-XSAMS is an evolved version of the XSAMS schema presented by [5] and it should be used together with the description of molecular quantum numbers provided by what we call the case-by-case schema 2 . Our recent publication [1] provides details about VAMDC-XSAMS and about the main databases included in the VAMDC e-infrastructure.
By 2016 the VAMDC Consortium started to collaborate with the Research Data Alliance (RDA) 3 and to work in its Data Citation Working Group 4 . Indeed the VAMDC Consortium intended to find a way for users to cite the datasets that the infrastructure provides. The RDA Data Citation Working Group provided the researchers and data centres communities with a recommendation to identify and cite dynamic data [6]: the proposed solution relies on a query centric view and the set-up of a Query Store. Data should be stored in a versioned time-stamped manner and accessed through queries. The Query Store will store all the identified and time-stamped queries, together with relevant metadata and availability to recover the data as it existed at the time when a given query was executed.
Recently VAMDC, commissioned by the Research Data Alliance, has implemented the recommendations [6] of the RDA data citation group. Indeed VAMDC provides an interesting science use case of a typical distributed infrastracture with geographically distributed databases, registries (meaning "yellow pages") and access tools. Within this context a first work has been done on provenance of datasets [7], meaning that versioning and data-timestamping has been included in the VAMDC-XSAMS schema. The second work, for which RDA provided funding to the VAMDC Consortium, has implemented the concept of Query Store which impacts both the node software [4], the registries and has been implemented in the VAMDC Portal (http://portal.vamdc.eu). The technical description of the Query Store that stores timestamped queries submitted to the VAMDC infrastructure, can be found in [8].
The current paper aims at presenting the VAMDC Portal coupled to the Query Store. We re-call the key features of the VAMDC e-infrastructure, we show how the VAMDC Portal provides the users with the ability to access the Query Store and thus create DOI for their datasets, and finally we discuss the potential impact of the VAMDC citation features for data users through presenting different science use cases.

Results: VAMDC Portal and Query Store
The VAMDC portal uses the VAMDC standards and technological developments, it provides a seamless access to the inter-connected VAMDC databases. Through this unique interface as displayed on Figure 1, a user can query any database member of the VAMDC infrastructure, and can retrieve data in the common shared file format VAMDC-XSAMS 5 .
In order to be visible from the portal, as from any VAMDC tool or user, each database must provide a VAMDC compatible access thanks to the node software (as described in Section 2.1).

Connecting Heterogeneous Databases into the VAMDC Interoperable Infrastructure-The Role of the Node Software
The e-infrastructure connects in an interoperable way about 30 heterogeneous atomic and molecular databases. By providing data producers and compilers a large dissemination platform for their works, VAMDC is successful in removing the bottleneck between data producers and the wide body of A&M data users. The "V" of VAMDC stands for "virtual" in the sense that the e-infrastructure does not contain data: it is a wrapping for exposing in a unified way a set of heterogeneous databases. An ad hoc generic wrapping software, called the node-software [4] transforms an autonomous database into a VAMDC federated database, called data-node. Each data-node accepts queries submitted in a standard grammar (cf. Section 2.3) and provides output formatted into VAMDC-XSAMS 6 . Then each database must be registered into a central repository called a registry that provides a standardized application programming interface (API) to explore its content and to discover the available VAMDC resources (cf. Section 2.2). Figure 1. VAMDC Portal welcome page: Userssee they may access data using a simplified or advanced interface, they may save queries, have access to tools, and must accept the citation policy and the condition of the disclaimer. Information are provided to users in real-time about the number of databases interconnected by the VAMDC infrastructure.

Registry
The VAMDC registry is located at http://registry.vamdc.eu/registry-12.07/main/index.jsp. It is based on the work of the Astrogrid project 7 that was the UK's Virtual Observatory development project from 2001 to 2010 [9]. They developed a registry whose interface is based on the International Virtual Observatory Alliance standard [10]. Thus any user or service can easily know how to write queries to find the services. Each of the VAMDC service is registered in the VAMDC registry as a VOResource [11] to describe its metadata (service name, address, query parameters, type of data). To simplify the access to this registry, the VAMDC consortium provides a java library 8 , that is used by the VAMDC portal, among other applications.

Query Language
Another key element used by the VAMDC portal, and central to the VAMDC infrastructure, is the query language which we chose to be a subset of SQL, called VAMDC QSL Subset 9 , and that allows the user to query multiple databases simultaneously. This query language is understood by the data access protocol VAMDC-TAP 10 implemented on each database. The protocol accepts such queries and return files in the VAMDC-XSAMS format. VAMDC-TAP is a simplified version of the TAP protocol from the IVOA [12]. VAMDC-TAP exposes only one table, simplifying the query by removing the join part. To achieve this result, a dictionary has been defined 11 . Each quantity that any database can potentially return is defined in the dictionary with a given keyword. Then the node software can do the mapping between this quantity and an actual column in the database. For example a request returning all the data related to the helium atom is written: Select all where AtomSymbol = 'he'

Species Database
In order to provide an efficient search environment, being able to search by species name is a key element. However, as necessary as it might be, it is has been proven to be a complex matter.
Looking for an atom is simple as it is efficiently identified with its name or its symbol. For a molecule though, things become more complex. A molecule's formula can be written in several ways, it can also be searched by its stoichiometric formula or by one of the many standards or classification that have been developed ( SMILE, Inchi, CAS number ...).
To overcome this difficulty, the VAMDC infrastructure created a centralized chemical species repository, called the species database. Updated daily, it contains the list of all the species in each of the VAMDC databases. Each species is identified uniquely by an InChiKey, an identifier generated from an InChi description 12 . In the species database each InChiKey is associated to the different ways to identify a species, e.g., their chemical names, formula, stoechiometic formula, CAS number. By adding a REST API 13 and a web graphical interface to this species database (https://species.vamdc.eu), we provide a versatile tool to explore the species content of the atomic and molecular VAMDC connected databases.
In addition to relying on this REST API of the species database, the VAMDC portal provides both an auto-completion possibility and an isotopologues discovery feature. So it becomes possible to specify very precisely which species is the most relevant to your search.

Portal Search Interface
There are two search interfaces available as displayed on Figure 1. Both of them are graphical tools that build VAMDC-TAP requests, and thus that mask the complexity of the internal language to the users.
The first graphical interface follows a step by step approach as displayed on Figure 2. Each time a user chooses an option, a limited set of new options appears. It is particularly recommended for the people discovering the portal as the user is guided throughout the selection process.
The second interface, which is called "advanced", lets the users choose by themselves their search criteria. They can compose their own request by combining "Species", "Processes" and "Environment" characteristics, as it is displayed on the left hand side of Figure 3. Each time the content of the query is updated, the list of databases that can answer is displayed in green on the right of Figure 3.

Portal Results Display, Query Store and Data Citation
Once a request has been completely parameterized by the user, it is converted into a VAMDC-TAP process, and is sent to each database that knows about the sent dictionary keywords (c.f. Query section). The results are displayed in a table where each line is the answer from a database, as shown on Figure 4.
For each database we provide a summary of the returned content, i.e., the number of species, states, of processes that can be described as collisional transitions, radiative or non-radiative transitions following the VAMDC-XSAMS schema. Then the user can click on the "XSAMS File" button in order to download the data in a VAMDC-XSAMS format. The VAMDC-XSAMS format can be uploaded in other tools such as the SPECTCOL tool [2,3] for collisional and radiative transition, or can be viewed by any editor.
The second column of the table, called "View data", provides a list of transformation options for the VAMDC-XSAMS file. The display options can convert the XML contained in the VAMDC-XSAMS file to another format, such as an HTML page with export functions, an HITRAN format [13] for molecular data, a bibtex generated from references attached to the data. With those processors the user can use directly the data without having to use any other tools.   Finally under the "XSAMS file" button, a "Citation link" is displayed. This citation link leads to the Query Store as displayed on Figure 5. The node software [4] has been upgraded to be the bridge between the client software used by the final user, here the VAMDC Portal and the Query-Store. The node software generates a token that is notified to the Query Store along with the request, and this token is also returned to the VAMDC Portal. More technical details can be found in [8]. The VAMDC Portal uses the token to get the persistent identifier of the request from the Query Store. Once it has received it, it displays the "Citation link" in the result page of Figure 4, this link goes to the landing page at https://cite.vamdc.eu/persistentId, as represented in Figure 6. This landing page ( Figure 6) is the typical human readable Query-Store landing page that a user reaches when he resolves the persistent identifier associated with a given query. The landing page stores the persistent identifier (named "Query identifier" in Figure 6) associated with the query, the query itself, the name and version of the node answering the query, the dataset produced by the node while processing the query, together with the bibliographic references extracted from the dataset. This bibliographic information displays the references cited in the the Source element of the VAMDC-XSAMS file, and those references might be associated to any type of data included in the dataset. For a finer-grained understanding of how these references span the different elements of the dataset, one must investigate the VAMDC-XSAMS file generated by the query. In the future we will improve the management of this fine granularity.
Currently the citation feature is implemented on about a third of the VAMDC connected databases as displayed in real-time on Figure 5 below "Accessed resources". Once a node has implemented the Query Store feature, any requests to that node are registered in the Query Store. The Query Store content may be directly explored at the url "https://cite.vamdc.eu". By clicking on the Query tab, the results represented in Figure 5 are displayed: the user may filter the results by date or by database (Accessed resources) and search for a persistent identifier to which is attached the corresponding landing page. With the Query store, the VAMDC infrastructure has a way to remember queries permanently. The user can then use the uniquely generated and persistent link to view the detail of his query at a later date and even download the data again. Human readable Query-Store landing page, obtained while resolving the persistent identifier associated with a Query. In the present case the persistent identifier is: "17053a9a-e56e-451b-9bd2-8e0cddda0d5d". The database is STARK-B [14].

Beyond the Actual Data Citation Processes Used by Editors: The Scholix Initiative and the Query Store
Nowadays editors have essentially two basic mechanisms for linking articles to data repositories. One concerns entity linking, where the journal picks up a unique identifier or code in an article and establishes a deep link to the underlying data deposited elsewhere. The other one is banner linking, where an automated query is sent each time a new article is published to the external database in order to check whether there is data available for this article. We explain the limitation of these approaches with the following examples: • Let A be an article referencing a dataset D. If the data repository containing D has no ab initio idea that the article A exists, it has to search through all the Internet and through all the services of the existing journals if a paper citing D exists. The data trafic worldwidely generated by this approach is enormous, if we consider that there exists a lot of data centers containing thousands of datasets and that each datacenter, for each data-set it contains, will ask the same question to the same journals (currently approximatively 8M articles are registered into the different editor online services).

•
Let us consider a datacenter where a dataset D contains references to an article A . How may the publisher of A know that D exists?
These two examples show that the data-citation model currently adopted by the editors is not sustainable and does not meet the data-driven science community. The Scholix initiative [15] succeded in establishing a high level interoperability framework for exchanging information about the links between scholarly literature and data. It has been adopted by existing hubs or global aggregators of data-literature link information such as DataCite, CrossRef, OpenAIRE, EMBL-EBI, together with Eslevier and Springer editors.
The Scholix recommendation is not implemented directly on the Query Store, but is a consequence of the interlinking between the Query-Store and the Zenodo open science repository 14 . The link between the Query Store and Zenodo is implemented using the Zenodo public REST API 15 in the Query Store software. The landing page of Figure 5 displays a button "Get a DOI" if the query has not already been assigned a DOI. By clicking on this button, the user triggers the Zenodo registration process: the file associated with the query is uploaded to Zenodo using the Data Set upload type and all the query-associated metadata are copied to corresponding Zenodo fields. When the upload process finishes successfully, Zenodo provides the Query-Store with a DOI which is stored in the Query-Store and associated to the query. When a user displays a landing page/query which has already been copied to Zenodo, the button "Get a DOI" is replaced by a DOI badge.
Zenodo is indexed in OpenAIRE 16 , OpenAIRE implements Scholix through its Data-Litterature Interlinking Service 17 , therefore all the VAMDC queries registered by the Query Store in Zenodo are included in those infrastructures. When some data extracted from VAMDC are cited (in papers and/or other datasets) through the DOI obtained by the couple (Query Store/Zenodo), the authors of the works referenced by the VAMDC data receive credit automatically. Indeed the scholarly links harvested from the Data-Litterature Interlinking Service 18 flow to the hubs implementing Scholix (e.g., CrossRef and DataCite). In addition members from the SAO/NASA Astrophysics Data System (ADS) are active members of the RDA-Scholix Working Group and are working at implementing Scholix in ADS; therefore one implemented credits will flow automatically to the ADS system.

Discussion: Science Use Cases for Data Citation Impact
Therefore the VAMDC Portal provides access to the landing pages identified with Unique Identifiers. As all queries are stored in the Query Store for a period of time, users can find the landing page at a later stage using the Unique Identifier that they can store. We present below several science use cases where the implemented citation features can be used to identify the datasets that are kept in their final analysis.

Analysis of Astrophysical Spectra
An example of usage is the need to query atomic or molecular lines on a given frequency or wavelength range for Local Thermodynamic Equilibrium analysis. The VAMDC infrastructure facilitates greatly the obtaining of large quantities of data to which an equally large quantity of references are attached. The usual strategy of users is to cite the databases that have been queried, but rarely the original authors of the papers. Before VAMDC it was very painful to collect all the references, so it was understandable. With VAMDC the references are readily available, but the number of citation pages would outnumbered the content of the paper. We believe that this problem is now solved with our system. Users can query any databases, they can download the data files, use them in various applications, and prior to publication of their spectra analysis, he will decide which spectroscopic data have been the most relevant to their study. For those data they will assign a DOI to each dataset through the Query Store and will cite the DOIs in their publication. As explained above (see Section 2.6) the authors of the works referenced by the VAMDC data receive credit automatically.

A Scientific Use Case: Intercomparison of Databases
We provide below a scientific use case that could be attractive to atmospheric physicists and planetologists. These communities usually use the HITRAN 160-character file format [13] as an input for their radiative transfer or atmospheric modeling codes. Thus, the original VAMDC-XSAMS output format produced by the data-nodes 19 is not directly nor easily useable by the different users. Within this context, we developed a new tool for converting to the HITRAN file-format any VAMDC-XSAMS file produced by any data-node containing molecular data; this application follows the VAMDC-XSAMS consumer protocol 20 standard of VAMDC. In order to reach this new level of interoperability, several databases had to perform some adjustments to their contents. For instance, the HITRAN intensity unit (at 296 K) was added to the VAMDC-XSAMS standard; the CDMS [16][17][18] and JPL [18,19] databases added a new field for this intensity unit in their VAMDC-XSAMS processor.
As already mentioned, the HITRAN conversion (see Figure 4) is available for all the molecular databases already integrated into the VAMDC infrastructure (e.g., HITRAN [20], CDMS [16][17][18], JPL [18,19], MeCaSDa [21,22], . . . ): this allows quick and direct comparisons between them. To achieve this aim, a graphical chart can be produced by loading two HITRAN output files, either coming from the VAMDC portal thanks to the above-mentioned converter or from the HITRAN On line service 21 . The following on-line tool 22 might be used for that purpose. Such combination of tools may help data producers to check the consistency of their data and to point out database content differences. For instance, some databases include experimental, fitted or calculated ab initio data only, or mixings between these different sources, etc. Also, some databases include isotopic abundance factors, some other not. Figure 7 gives an example of database comparison between the HITRAN and JPL databases: such comparison would have been very cumbersome in the past. We mention that we intend to provide the same tools for the GEISA [23] file format as soon as possible.
If VAMDC already provided interoperable access to all its data-nodes, the described HITRAN processor provides the community with an easy tool for comparing and crossmatching data coming from heterogeneous molecular databases. Molecular-scientists may particularly welcome these new features. As the potential adoption of this tool is wide, embedding the citation feature into the VAMDC portal is very important for data producers and providers: for a given analysis, the relevant and interesting data can be uniquely identified through the VAMDC-citation feature. The combined usage of the processor for visualisation and of the data-citation feature of the VAMDC portal provides a new paragdim for carrying out analysis of spectra.

Prospective Use of VAMDC in Numerical Codes Packages: Example of Cloudy and PDR Code
Cloudy is a non-local thermodynamic equilibrium spectral synthesis and plasma simulation code designed to simulate astrophysical environments and predict their spectra. A recent effort [24] has been to move Cloudy's atomic and molecular data into external databases. They use external databases such as CHIANTI 23 and LAMDA 24 . For some ions they indicate using data from version 7.1.4 of the 19 We remember (cf. Section 2.1) that we call data-node a database which joined the interoperable VAMDC e-infrastructure 20 https://standards.vamdc.eu/#xsams-processor-service 21 http://hitran.org 22 http://www.vamdc.org/hitran-display/ 23 http://www.chiantidatabase.org/ 24 http://www.strw.leidenuniv.nl/~moldata/ CHIANTI database. For versioning of LAMDA it is indicated that they downloaded the LAMDA data on the 30 June 2015. For data citation they advise users to go to the website of the CHIANTI [25] and of the LAMDA [26] databases.
It should be noted that CHIANTI is part of VAMDC, when LAMDA is not. Nevertheless the collisional data of LAMDA are in the BASECOL [27] database. The BASECOL 25 database has a version attached to each of its dataset and can be queried via the VAMDC portal or with the SPECTCOL tool [2,3]. BASECOL will be referenced in the Query Store.
The Meudon PDR code 26 [28], that can be used to study the physics and chemistry of diffuse clouds, photodissociation regions (PDRs), dark clouds, is another example of numerical code where the atomic and molecular data are externalized.
By using VAMDC facilities to query data, it will be possible to uniquely identify the data in time and to cite the DOI in the code; again the producers of atomic and molecular data will be acknowledged through the pipeline that we have put together.

Prospective Use of VAMDC for Secondary Databases
The VAMDC e-infrastructure could be heavily used to produce secondary databases, for example to prepare the analysis of specific space missions or for other specific purposes. For example the Belgian repository of fundamental atomic data and stellar spectra (BRASS) [29] aims to provide the largest systematic and homogeneous quality assessment of atomic data to date in terms of wavelength, atomic and stellar parameter coverage. To do so they retrieved atomic data from repositories and did cross-matching of data. They mention that the majority of repositories were retrieved via the VAMDC e-infrastructure and that they are grateful for the current efforts of the VAMDC team in homogenising the repositories as this has helped to expedite their comparisons and cross-match work. This is of course one of the main achievement of VAMDC: to facilitate retrieval and comparisons of data. The new citation feature will allow them to trace the data that they queried in the VAMDC repositories.

Conclusion and Future Work
From the start of the VAMDC project in 2009 one of our goals has been to increase the citation impact of data producers. Indeed we find that the current status of citing spectroscopic data is to cite the database. For example a search of the ADS NASA system with the word "CDMS" in the text shows that only the CDMS database's [16][17][18] URL and/or reference are provided in the majority of the astrophysical papers. It should be stressed that atomic and molecular data require months to be either measured or calculated, and therefore it is a loss of visibility and recognition that only databases be cited in users' papers. We believe that the Query Store coupled to the VAMDC portal now allow this flaw to be overcome, even if additional refinements need to be carried out. This paper encourages users to explore the different tools and to provide the VAMDC collaboration feedbacks of usage in order to improve the system.
While writing this paper we have found another key interest of the Query Store, which is to have a better view of references attached to a given set of data. Of course references have been present in the VAMDC-XSAMS files and accessible from the visualisation tools for years. Nevertheless only with the Query Store did we figure out that references attached to some files were totally incoherent (we thank one of the referees for pointing out such incoherence). This is not related to the Query Store, but to the VAMDC-XSAMS files provided by the node. This shows that a large scientific survey must be carried out using the Query Store in order to improve the output VAMDC-XSAMS files. Another interesting remark of one of the referee was the absence of selection in the references list as the references might be attached to different quantities in the VAMDC-XSAMS files, and the user might not want to use all of them. This issue will be addressed in the future.
Currently only a few VAMDC connected databases have implemented the Query Store feature, Figure 6 shows the list of implemented databases at the time of publication. Future work includes the implementation of the citation feature on all the VAMDC connected databases by the end of 2019.
Some additional future technical work will be to group the queries so that only one DOI is assigned to several queries. Finally we are currently implementing the Query Store feature into SPECTCOL [2,3], another VAMDC tool that can query molecular spectroscopic and collisional databases, and then display and match the molecular data.

Materials and Methods
The VAMDC portal is fully operational and the query store has been implemented as described above. The users can now access the capability of the citation features, the number of implemented nodes will increase.

Conflicts of Interest:
The authors declare no conflict of interest.