Consistency, Inconsistency, and Ambiguity of Metabolite Names in Biochemical Databases Used for Genome-Scale Metabolic Modelling

Genome-scale metabolic models (GEMs) are manually curated repositories describing the metabolic capabilities of an organism. GEMs have been successfully used in different research areas, ranging from systems medicine to biotechnology. However, the different naming conventions (namespaces) of databases used to build GEMs limit model reusability and prevent the integration of existing models. This problem is known in the GEM community, but its extent has not been analyzed in depth. In this study, we investigate the name ambiguity and the multiplicity of non-systematic identifiers and we highlight the (in)consistency in their use in 11 biochemical databases of biochemical reactions and the problems that arise when mapping between different namespaces and databases. We found that such inconsistencies can be as high as 83.1%, thus emphasizing the need for strategies to deal with these issues. Currently, manual verification of the mappings appears to be the only solution to remove inconsistencies when combining models. Finally, we discuss several possible approaches to facilitate (future) unambiguous mapping.


Introduction
Genome-scale metabolic models (GEMs) combine available metabolic knowledge of an organism in a consistent and structured way that allows prediction and simulation of metabolic phenotypes [1]. GEMs have been successfully used in different research areas, ranging from biotechnology to systems medicine, often resulting in new insights on metabolic processes in living organisms [2][3][4][5]. GEMs may differ in content and scope, and can contain anything from a few hundred to a few thousand reactions and metabolites. However, the structure of the model remains similar regardless of the application: the main components are metabolites, metabolic reactions, enzymes and the corresponding encoding genes.
The construction of a GEM includes three main steps [6,7]. First, the genome of the organism considered is functionally annotated to identify enzymes and the associated reactions and metabolites. Second, the list of enzymes and reactions is converted into a mathematical model, a so-called draft model, in the form of a stoichiometric matrix to which constraints are added to account for reaction reversibility and uptake and secretion of metabolites. Last, the model is manually curated using experimental data (such as growth data), information from literature, and/or expert knowledge.
Manual curation involves human workload and entails the verification of each reaction in the model and its corresponding constraints, which is a very time-consuming task. Tools and pipelines (such as, for example, the SEED [8], Pathway Tools [9], and the Raven toolbox [10]) have been developed to automatize the annotation, draft the reconstruction and to aid high-throughput creation of genome-scale draft models [11].
The tools for automated draft reconstruction rely on biochemical databases that are used to find reactions associated with the enzymes identified in the genome through annotation. In general, different tools use different databases. For instance: the SEED uses its own naming system [8], Pathway Tools [9] uses MetaCyc [12], and Raven [10] uses KEGG [13]. Every database uses its own namespace which is a particular set of identifiers (such as numerical tags or names) for metabolites and reactions: because of this, it can happen that the same metabolites and reactions have different naming conventions when different tools are used to generate draft GEMs. To complicate the matter further, researchers often tend to use their own naming conventions such as custom abbreviations for metabolites or consecutive numbering for metabolites and reactions and this adds up to the observed heterogeneity of names and identifiers found in GEMs available in the literature [14]: the use of unique identifiers, independent from the particular databases used, such as InChI [15,16], or references to interlink different namespaces, have been suggested as an essential and fundamental part of GEM [17] but this is seldom implemented.
GEMs are manually curated knowledge repositories integrating information from independent (organism-specific) sources and thereby provide a comprehensive representation of what is presently known about the metabolism of the modelled organism. There is often the need to combine the information stored in individual GEMs to arrive to a consensus metabolic model for a given organism [18,19]. The use of different namespaces limits the reusability of a GEM and often makes it impossible, or extremely laborious, to combine two GEMs. Furthermore, it often hampers model expansion, which is the addition of new reactions and/or metabolites to an existing model because if different namespaces are used the same metabolite can be added many times with different names and, consequently, considered as different chemical entities which can, in the worst case, invalidate the model. In principle, different GEMs can be combined into a community model (partially) representing the different organisms present in a microbial community, with the aim of modelling community metabolic interactions such as cross feeding or substrate competition [20].
Since mapping manually different namespaces is highly laborious and practically unfeasible for large models [18], the only viable solution to integrate different GEMs has often been to rebuild de novo the required models [21,22]. However, while this approach leads to models that can be easily combined, it causes the loss of all the expert knowledge introduced in the manual curation process.
Naive direct comparison of names using string algorithms is often insufficient [23] and to help mapping among different namespaces in a more systematic way tools for consensus model generation and for automatic translation have been introduced [19,24], together with databases such as MNXRef from MetaNetX [25] and MetRxn [26], developed to provide cross-linking among the identifiers in the namespaces of different databases.
In fact, mapping different namespaces using metabolite or reaction identifiers is not a trivial task because researchers often refer to compounds with many different names and abbreviations and the namespaces reflect this ( Figure 1A). Often in GEMs different chemical entities (like, for instance, citrate and citric acid) are used as exchangeable names and may end up in databases such as Biochemical, Genetic and Genomic (BiGG) (which harvest reactions which have been used in metabolic modelling) resulting in imprecise, misleading, and sometimes incorrect synonyms. Similarly, GEMs are often built featuring reactions using generic compound classes (such as 'Lipids' or 'Protein'). When these are included in GEMs databases they cause the same compound to be linked to different identifiers.
Internal database inconsistency is also often caused by ambiguous abbreviations, with the same shorthand used for different compound ( Figure 1B). To make the matter worse, the same abbreviation can refer to different compounds in different databases (see Figure 1C). The problems deriving from the inconsistency and the ambiguity in the namespaces of reaction databases used to build GEMs have been mentioned before [27][28][29][30] and are a well-known source of complaints in the modelling community. However, since the extent of the namespace mapping problem has not been so far analyzed in depth, we investigate the level of inconsistency and ambiguity encountered when (i) mapping metabolites within a database and (ii) mapping metabolites between two databases. To this task, we analyzed and compared naming and identifier conventions in 11 biochemical databases commonly used for metabolic modelling and metabolomic data analysis. Similar research has been done for small-molecule databases that have been used in pharmaceutical research but did not consider databases used for metabolic modelling [31]. With this work we aim at raising awareness on this problem within the modelling community; provide a framework for evaluating when (or whether) GEMs and databases can be combined, suggest practices for dealing with this issue on the short term and outline a strategy for a long-term solution.

Results
To avoid ambiguity, we explicitly define the specific terms used in this study as follows: • Identifier (ID): Identifiers are strings of alpha-numeric characters used to identify uniquely a metabolite or a reaction in a database. Examples are C00001 in KEGG or ATPM in BiGG. • Name: Here we use name to refer not only to the chemical name, but also to the set of aliases, synonyms, and abbreviations that are often included in a database as other names of the compound. For instance, the KEGG ID C00001 is associated with the name 'water'. • Multiplicity: describes the case on which a single ID is linked to multiple names. For instance, the KEGG ID C00001 is associated with the names 'water' and 'H20'; therefore, we state that this ID has a multiplicity of 2. • Ambiguous. The Merriam-Webster dictionary defines ambiguous (second entry) as 'capable of being understood in two or more possible senses or ways'. Here, we use ambiguous (and its derivatives) to refer to the case on which the same name links to more than one ID in the same database. An example is shown in Figure 1B, where the name 'H' links to the MetaCyc IDs 'PROTON' and 'HIS', associated with 'hydrogen ion' and 'L-histidine', respectively. • Consistency: We use consistency (and its derivatives such as consistent) to refer to mappings on which a molecular entity is mapped to itself. It follows that inconsistency is used to indicate a mapping or a database on which a molecular entity is associated with a different one.
We have analyzed 11 biochemical databases for their consistency, and we have performed pairwise comparisons to investigate the degree of inter-database consistency. These databases were chosen for this study, primarily, because they were integrated in MetaNetX which facilitates data retrieval. Many of them (BiGG, KEGG, SEED, HMDB, ChEBI and MetaCyc) are commonly often used for metabolic model reconstruction [14,32]; HMDB is the reference database for metabolomic studies.

Name Ambiguity
We calculated the average number of IDs per compound name for each of the 11 databases: results are summarized in Table 1. With ChEBI and Reactome as exceptions, in most databases the average ID number is around 1: however, there is a low consistency. Reactome has the lowest consistency: nearly 30% of compounds are associated with more than one ID, metabolites with generic descriptive names such as 'secretory granule lumen proteins', 'secretory granule membrane proteins', and 'ficolin-rich granule lumen proteins' associate to 34 different IDs; there are also more specific names, such as 'hydron', 'water' and 'ATP' associated with 21, 14, and 11 IDs, respectively. In the latter cases the cause is that different IDs are used to indicate the same metabolite in different subcellular compartments, although they all get assigned to the same name, for example the ID 5278291 indicates water in the cytoplasm while water in extracellular compartment is identified as 109276.
Overall, the most ambiguous metabolite name is 'lecithin', which is associated with 921 different IDs in the Human Metabolome database (HMDB). In this database, the most ambiguous names are general compound classes such as 'diacylglycerol', 'PPP' and 'pyridin-3-ylboronic acid'.
The overall consistency of HMDB is very high, as only 1.7% of names are linked to multiple IDs, followed by ChEBI and KEGG, where 14.8% and 13.3% of names map to multiple IDs; also in ChEBI 'lecithin' is the most ambiguous compound, linked to 413 IDs; other ambiguous names are, again, generic names such as 'Diglyceride', 'Diacylglycerol', 'Triglyceride' and 'Triacylglycerol' (see Figure 2A). Also in KEGG the most ambiguous names refer to generic compounds such as 'DS-18' with 16 corresponding IDs. Furthermore, this compound shares ID with 'Chondroitin 4-sulfate' which is a sulfated glycosaminoglycan while DS-18 generally refers to glycan, which further complicates metabolite characterization, as shown in Figure 2B.
EnviPath and SLM databases have also relatively low consistency with 7% names being ambiguous. SLM is the largest database considered (>1.2 × 10 6 entries) and the most ambiguous name refers to 'Triacylglycerol'. In enviPath the most ambiguous compound is 'compound 0044249', with SMILES representation CC1=CC=C(C=C1O)O that corresponds to 4-methyl-1,3-benzenediol. In this database, many metabolites are renamed with numbers, i.e., 'P06','M320I23', or 'compound 869', which makes it cumbersome to the human user to identify them.
Other databases, namely SABIO-RK, MetaCyc, and LIPID MAPS are highly consistent, with SABIO-RK containing only 8 metabolites with ambiguous names.

ID Multiplicity and Use of Synonyms
In an effort to increase readability of entries in the database, often multiple names are linked to the same ID, i.e., IDs have a multiplicity larger than 1. Please note that multiplicity is different from ambiguity as defined at the start of the Results section. Multiplicity increases human readability and is beneficial, as long as the alias, names, and synonyms describe the same metabolite. Table 2 presents the average ID multiplicity for the 11 databases considered. Table 2. ID multiplicity in each database: number of IDs in each database, average number of names per ID (average multiplicity), percentage and number of IDs that associate to more than one name, and highest number of names an ID links to; s.d. stands for standard deviation. Blue boxes are used to highlight highest numbers.

Database
#ID Average Multiplicity ± s.d. BiGG is the only exception to this rule. Every metabolite identifier is associated with one and only one metabolite name, but, as shown in Table 1, the contrary does not hold true. BiGG is the smallest database here considered (with only 5102 metabolite names and 5174 metabolite IDs), although it should be stressed that this database has been built by integrating reactions and metabolites appearing in several published and manually curated genome-scale metabolic networks.

% of IDs with
All other databases have some extent of multiplicity: in ChEBI, HMDB, MetaCyc, SLM and LIPID MAPS nearly 100% of IDs are linked to more than 1 name. The use of multiple names is intended to increase usability of the database. However, inconsistencies might arise when ambiguous names are linked to IDs with high multiplicity, as illustrated in Figure 2C,D. This can result in errors and mismatches when identifying compounds. A most extreme case is Reactome identifier reactome:5278291 which is linked to 1106 difference names (see Figure 2C), among them 'H2O', 'water', 'phys-ent-participant60981' and 'phys-ent-participant63109'. The latter two names are linked to identifiers pointing to 'diphosphate' and 'pyruvate', which means that within this database is possible to map 'water' to 'pyruvate'. Other striking examples can be found in Table 3. When mapping with these compounds extra care needs to be taken. MNXRef is a common namespace derived from MetaNetX and has been developed to combine namespaces from multiple databases and provides links between compounds (and identifiers) from different databases, the overarching goal is to enable bringing together GEMs.
We found that each of the IDs in the 11 databases link to a MNXRef ID; however, as shown in Table 4, one MNXRef ID can connect to several IDs within a database resulting in a multiplicity larger than 1. This happens, for instance, when MetaNetX associates one ID to several compound synonyms. This might be due to conscious modelling-specific decisions. For instance, it would make sense to combine citrate/citric acid identifiers in different databases to deal with protonation state differences. Thus, linking several IDs to the same MNXRef ID addresses the multiplicity present in the database. However, this also generates errors if the ID links to ambiguous names. The most striking case is observed when mapping Reactome and MetaNetX: 2058 MetaNetX IDs are associated with Reactome IDs and 41.93% of them link to more than one Reactome ID.

Namespace Mapping between Databases
To study namespace consistency between databases, we performed a pairwise mapping of the 11 databases. We performed the mapping using both the names in the corresponding database and MNXRef identifiers. Table 5 shows the results of pairwise database mapping using metabolite names. Here, we map IDs in the databases using associated names. The databases have different metabolite coverages, for instance SLM contains 1218750 names while BiGG only 5102, this is because some are specific for a certain class of compounds (like SLM for lipids) while others aim to be comprehensive and do not describe all compound classes in exhaustive details (like HMDB for lipids). The difference in coverage and multiplicity of names associated with IDs (previously presented in Tables 1 and 2) can cause the mapping between two databases not to be symmetric as evident from Table 5.

Mapping between Databases Using Metabolite Names
In all comparisons, the fraction of compounds sharing the same name is rather limited. Overall, except for mapping from SEED to KEGG and ChEBI with 60.1% and 57.2% overlap, respectively, all databases have less than 50% of compound names in common. The namespace of ChEBI has the largest overlap with other namespaces: around 40% towards MetaCyc, Reactome, and KEGG can be mapped to ChEBI. The namespaces of SLM, enviPath, and LIPID MAPS have the smallest overlap with other namespaces, which is most likely because these are very specific databases. The low ratios in Table 5, indicate that mapping using string algorithms is not effective since trivial differences in the names (such as the use of underscore and hyphen) can results in mismatches.
Ambiguous naming, i.e., one name associated with more than one ID, can result in mapping inconsistencies where one ID in the first database, gets mapped to multiple IDs in the second database. The fraction of non-univocal mappings is indicated in Table 6. Hence, although 40.2% of the Reactome IDs can be mapped to ChEBI (see Table 5), 81.3% of the successfully mapped Reactome IDs are ambiguously mapped to multiple ChEBI IDs.
In some cases, more than 50% of the mappings are non-unique. The highest fractions of non-unique ID mapping occurs when mapping to ChEBI, although when mapping from ChEBI to the other databases, this fraction reduces significantly. When considering Reactome, both mappings to and from this database lead to relatively high number of non-univocal assignments. SLM and SABIO-RK have a significant low ambiguity when mapping from other databases, although as shown in Table 5, only a small fraction in these databases can be mapped from other databases.

Mapping between Databases Using MNXRef ID
Another approach to map IDs from different databases is to use MetaNetX/MNXRef as a bridge. Table 7 shows the fraction of IDs in each database pair that can be mapped through MetaNetX/MNXRef. Again, the differences in coverage between the databases cause this table to be non-symmetric. Figure 3 shows that mapping via MNXRef ID results in more identified mappings than the previous approach that used names. Nevertheless, the overall map is also not high. None of tested databases maps higher than 70% either to or from other databases. The highest match is 67.7% when mapping MetaCyc to SEED. SEED can be mapped fairly well from BiGG, Reactome and KEGG with more than 40% match. Please note that these are all databases specialized in reactions and metabolic pathways. There is almost no overlap between SEED and SLM, the latter specialized in lipids. Databases with overall good match are ChEBI, KEGG, and MetaCyc. Among them, ChEBI has the highest overlap with other databases. Almost 50% of IDs in SEED, Reactome, MetaCyc, KEGG, SABIO-RK, and BiGG can be mapped to ChEBI. However, there is not so much overlap when mapping enviPath (12.8%), LIPID MAPS (13.5%) and especially SLM (0.8%) to ChEBI. The remaining databases have a significant low overlap percentage when mapping via IDs. Especially SLM, there is just a minor part of the database that can be mapped to other databases.
This approach also results in instances of one ID from the first database associated with more than one ID in the target database, an example is provided in Figure 4 and Table 8 summarizes the identified cases.    Name ambiguity and non-unique ID mapping between databases can lead to inconsistencies (different metabolites considered to be equivalent) and included as such in the metabolic model. Table 9 lists some illustrative examples. These examples show that automatic mapping (manual mapping is impossible for large scale models) of compounds between or within databases can lead to introduction of unrealistic reactions that can potentially reduce the accuracy of the predictions of the model.

Discussion
GEMs aim to be comprehensive representations of the metabolism of one organism. They are often built based on more than one database. As explained, the initial step of model constructions is typically automated model drafting. Tool selection will determine with which namespace the model is associated. For instance, modelSEED uses SEED as a reference reaction database while Pathway Tools uses MetaCyc. In the next step in the model building process-manual curation-gap-filling is possibly the most important task. Tools for gap-filling often systematically explore the GEM to identify possible gaps [33]. Other methods rely on additional experimental data such as measured metabolites to identify the gaps [34]. In this step, researchers may use different sources and databases to identify reactions and associated metabolites. Errors might arise due to inconsistencies in this mapping.
A second application of GEMs is the integration and contextualization of 'omics' data such as transcriptomic, proteomic, metabolomic and/or fluxomic data. These applications often require a mapping of metabolite identifiers to match the namespace of the model and that of the database that has been used in the data generation process. Both applications may imply potential problem(s) caused by ambiguous names or identifiers.
Among the 11 explored databases, KEGG, BiGG, ChEBI, MetaCyc, HMDB, and SEED are the most commonly used in metabolic modelling. We calculated the ambiguity of names and the multiplicity of non-systematic identifiers within and between 11 databases. Within the same database, the percentage of identifiers with multiplicity larger than one varies from 0% to 100%, whereas the ambiguity of names ranges from 0.07% to 29.4%. When mapping between databases, these ambiguities and multiplicities lead to larger inconsistencies, and this agree with previous observations regarding small molecules databases [31,35]. The inconsistencies when mapping using metabolite names range from 0% to 81.2%. Similar results are obtained while mapping via MNXRef ID, between databases, as the number of inconsistencies varies from 0% to 83%; however, on average, better results are obtained. Mapping with the databases with the highest number of ambiguous names also results in higher number of inconsistencies than when mapping between other databases. Among the 11 tested databases, Reactome, HMDB, ChEBI, and KEGG are those that show the highest intraand inter-database ambiguity.
Most of the ambiguous names are associated with general compounds such as triacylglycerol, glycan, or protein. These names and IDs represent classes of compounds rather than metabolites with defined structures and are included in metabolic models as they have a clear biological interpretation. However, care should be taken when introducing them in databases and these names should not be included in the list of synonyms for specific compounds, as mentioned in [35]. Using abbreviations to refer to compounds is also highly ambiguous as the same abbreviations can represent different compounds in the same or in different databases.
Our findings show that compound names or IDs cannot be clearly mapped automatically. Even if we use non-ambiguous identifiers, many mappings are still inconsistent because they can link to ambiguous names. MetaNetX solved some of the issues as shown in Table 9. However, not all compounds in the 11 tested databases can be mapped with MNXRef. Mapping from MetaCyc to SEED, SEED to ChEBI, and SEED to KEGG using MNXRef give the highest number of matches, but still only around 60% of compounds matched. Other databases show much lower coverage.
To use MetaNetX/MNXRef ID to map compounds in a GEM, the namespace of the model needs to be related to at least one of the 11 databases considered. However, many models use custom-made naming conventions [14]. For these models, mapping through name is the only option.
Ambiguous namespaces also hamper the (re)use of models from different research labs or organisms. Due to a low level of interoperability, in practice it is impossible to directly compare models, as metabolites can hardly be cross-mapped, which in turn makes it impossible to compare reactions in both models, see examples in Table 9. Nevertheless, comparing models is important and necessary: it helps to reduce the time to build models for closely related species; to combine efforts from different research groups that study the same organism; and to study the metabolic differences between different organisms. In addition, microbial communities are notoriously difficult to characterize. While transcriptomic and proteomic measurements can be associated (to a great extent) to the originating microorganism, it is not possible to do this for metabolites. Therefore, there is a need for models that can help combine both types of measurements. As a result, there are on-going efforts to define modelling frameworks, based on combining GEMs of individual organisms, to characterize the behaviour of the community [21,22,36,37]. Enabling unambiguous mapping will be required to take full advantage of these on-going developments.
Below we have enumerated several recommendations that may increase the level of interoperability of GEMs, facilitating unambiguous mapping • Limit the use of aliases, i.e., compound classes or abbreviations, as synonyms in databases. These aliases increase human readability, but should be clearly distinguished from names and synonyms in the databases and should not be used for mapping. • In the context of metabolic modelling it is frequent and desirable to use compound classes to identify generic compounds [28]. Compounds such as 'biomass' or 'lipid' are often used in GEMs; this does not affect the use of the model, except when predicting or simulating the production (of a specific component) of generic compounds, i.e., when 'lipids' are the main focus of the model. In fact, it is often better to use generic compounds whenever a specific compound is not needed, as they can be universal. For instance, 'biomass' has been used as a standard among the modelling community as an artificial compound that represents the growth objective of the cell [6,17]. Another reason is that often the precise identity of the compounds is not needed and there is a lack of experimental data for their characterization. Therefore, when using generic compounds, it is desirable to add extensive annotation to the model to clearly state which compounds they represent, and for which purpose they are used in the model. These generic compounds are among the most ambiguous entities in the 11 analyzed databases and we therefore advise to exclude them from any automatic mapping process. • Avoid using highly ambiguous names as the sole description of the compound in the model. When referring to these compounds, clear annotation needs to be included to prevent mismatches and inconsistencies. • In addition to human-readable identifiers and database-dependent identifiers, include database-identifiers, such as InChI [15,16], whenever possible for compounds with defined structures. Using InChI can help to fully automate the mapping [28]. However, it should be taken into account that mismatches and errors can also happen because identifiers can also link to incorrect InChI as shown in [29,35,38]. • Model mapping only based on metabolite information can imply certain mismatch due to differences in namespaces, even if systematic identifiers were used. Hence, different mapping strategies, i.e., mapping through encoding genes and network topology [19], should be used to complement name or identifier-based mapping. • GEMs also need to have a unique standard annotation so that they generate the same output even when different tools are used for the simulation. Neal et al. [39] suggest that semantic annotation can help to store and combine models, but these models need to stick to a unique standard annotation format.
Simply deciding a standard database/identifier/annotation to represent metabolites in models will also not help to improve the situation, as they will limit the available model construction tools. Nevertheless, while increasing the level of interoperability none of the presented approaches above can by itself ensure automated mapping without errors. Different approaches need to be combined when translating between namespaces. Manual curation is still required, at least for compounds with highly ambiguous names.
We did analyze the (in)consistency of databases (commonly) used in metabolic modelling but we did not analyze the (in)consistency of GEMs built using different databases. However, since every metabolite in a GEM is usually associated with at least one identifier from biochemical databases such as KEGG, BiGG, SEED, or MetaCyc, every GEM can be considered as a small subset of the identifiers and names from those database(s). Hence, the ambiguity of the compound names in GEMs can be considered to be equivalent the ambiguity of the compound names in the tested databases. Moreover, it should be noted that some databases (such as BiGG) aggregate compounds names used in deposited GEMs and thus mapping of these databases against other databases provide an overall, direct measure of the ambiguity of the compound names in GEMs. In addition to solving mapping inconsistencies, GEM namespace translation can be further improved by using tools that analyze the consistency of the generated models [19].
Finally, our analysis has some limitations. It should be noted that the list of inconsistencies provided represents just an upper bound to the number of possible errors when changing namespaces. We have only studied non-systematic identifier and names. We did not use structure data such as MOL files, we cannot evaluate how many of the consistent mappings are actually correct. We have not included such information in our analysis because it is not often found in metabolic models. In any case, the inconsistencies here described pertain automatic mapping and most (or all) of them should be fixable upon manual curation. Comparing names between databases is not trivial due to heterogeneity issues: our approach may be over simplified, which may reflect in the results shown. It should be noted that in some databases, synonyms are clearly differentiated, in this case, the inconsistency will not arise. However, in many databases considered in this study, synonyms are not well distinguished. For instance, H in MetaCyc belong to the synonyms list of both proton and L-histidine. This is one of the primary causes of ambiguous mapping. In addition, MetaNetX data that was downloaded at the moment of conducting this study contained data from the originating databases that was produced in 2017 and some in 2016 (see Section 4.1 for more detail). As databases change over time, a similar analysis with the most recent database updates might lead to different results. Stat Roma pristina nomine, nomina nuda tenemus.

Data Collection and Preprocessing
Data about compound identifiers and synonyms were downloaded from MetaNetX [25]. MetaNetX is a repository of GEMs and biochemical pathways. It contains entries from some of the most relevant databases that have been used in GEMs construction and simulation such as KEGG, BiGG, MetaCyc and SEED [32]. The platform (http://www.metanetx.org/) allows access to these databases as well as provides tools to map/translate them. In this study, the chem_xref.tsv file was downloaded from the MetaNetX website on 31 October 2018. In the following, we provide a brief description of the content of these databases.
BiGG models [40] is a knowledge database of genome scale metabolic models (GEMs). Currently, it contains 85 high-quality, manually curated GEMs, 24,311 reactions, and 7339 metabolites (data retrieved on 30 November 2018 from http://bigg.ucsd.edu/). In BiGG, the metabolite is identified as the abbreviation of its name. For example, '10fthf' for 10-Formyltetrahydrofolate. MetaNetX obtained data from BiGG on 11 April 2017.
Model SEED [41] is a platform to construct GEMs that uses its own database for metabolites and reactions. This database combines information from KEGG and existing metabolic models in a non-redundant set of reactions. In this database, metabolite identifiers start with "cpd" and followed by a 5 digits number. For example, D-Glucose-1-Phosphate is cpd00089. The database can be downloaded from https://github.com/ModelSEED/ModelSEEDDatabase/tree/master/Biochemistry. MetaNetX obtained data from SEED on 13 April 2017.
HMDB [44], (http://www.hmdb.ca), is a comprehensive and curated collection of human metabolite and human metabolism data. Data in MetaNetX was obtained on 12 April 2017.
KEGG [13] (http://www.KEGG.jp). The Kyoto Encyclopedia of Genes and Genomes is a resource that provides information about pathways and reactions in organisms. In KEGG, metabolites started with a letter 'C' (compound) and followed by 5-digit numbers. For example, D-Glucose-1-Phosphate is identified as C00103. Data in MetaNetX were obtained on 12 April 2017.
LIPID MAPS [45]. (http://www.lipidmaps.org). Is a database that contains structures and annotations of biologically relevant lipids. Data in MetaNetX were obtained on 13 April 2017.
MetaCyc [12]. (http://metacyc.org). Is a curated database of metabolic pathways. All data in MetaCyc are experimentally validated. The metabolite is identified by its full name. For example, D-glucose-1-Phosphate is D-glucose-1-phosphate. The database can be downloaded here http: //bioinformatics.ai.sri.com/ptools/flatfile-format.html. Data in MetaNetX were obtained on 13 April 2017.
Reactome [46]. The original data file was modified prior to analyzing. The modification includes the removal of the description part, of IDs starting by bigg:M as they are not real compound ID in BiGG, and the removal of 'biomass' compounds. Data from MetaNetX were organized in four columns in this order: compound ID in original database with database indicator in front, for example bigg:10fthf, corresponding compound IDs in MetaNetX, evidence and description (name).

Intra-Database Analysis
For intra-database consistency analysis, the first, the second and the last column of the MetaNetX data file were used for mapping. Name ambiguity was calculated as the number of ID each name links to. Similarly, the name multiplicity of each ID was calculated as the number of names it refers to.

Inter-Database Analysis
We mapped compound IDs between databases. A direct map between IDs in the database is not possible. The tested databases use different system for compound identifiers. For instance, in KEGG, the compound ID is a capital 'C' following by a 5-digit numbers, i.e., 'C00002' for ATP. In contrast, in BiGG, the compound is identified as abbreviation of its name, for example, 'atp' for ATP. Therefore, to map from one ID in database 1 to other ID in database 2, we used either the associated compound name or the associated MNXRef ID. That is also what MNXRef is meant for, as a link between databases.
Mappings via name were done by link from name to name in one database to the other. We first identified all compound names from one database, i.e., database A. From this list, we counted the number of IDs in the second database, i.e., database B, that link to each name in the database A. It means in this case, we did not use any string processing algorithm, i.e., processing case sensitive, underscore, or brackets, the name was mapped as exact match. Ambiguous names were treated as normal name in the database. In other words, we did not distinguish ambiguous names from unambiguous names from the mapping.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.