Nanoinformatics: Emerging Databases and Available Tools

Nanotechnology has arisen as a key player in the field of nanomedicine. Although the use of engineered nanoparticles is rapidly increasing, safety assessment is also important for the beneficial use of new nanomaterials. Considering that the experimental assessment of new nanomaterials is costly and laborious, in silico approaches hold promise. Several major challenges in nanotechnology indicate a need for nanoinformatics. New database initiatives such as ISA-TAB-Nano, caNanoLab, and Nanomaterial Registry will help in data sharing and developing data standards, and, as the amount of nanomaterials data grows, will provide a way to develop methods and tools specific to the nanolevel. In this review, we describe emerging databases and tools that should aid in the progress of nanotechnology research.


Introduction
Nanotechnology enables the design and assembly of components with dimensions ranging from 1 to 100 nanometers (nm) [1]. The word nano, which is derived from the Greek word for dwarf, has been part of the scientific nomenclature since 1960 [2]. Macroparticles, which are particles with dimensions greater than 100 nm, have properties that are comparable to those of bulk materials; however, when particles are smaller than 100 nm, their behavior is influenced by atomic, molecular, and ionic interactions. Nanoparticles show novel and changeable properties due to quantum effects [3]. The  [20] and GNU Image Manipulation Program (GIMP) [21]. All the images in this figure were obtained from Wikimedia Commons.

Growth of Bioinformatics and Applications for Nanoinformatics
In the 1980s, it became mandatory to deposit all published DNA sequences in a central repository. At the same time, a standard was adopted by journals to make gene and protein sequences freely accessible to all and to compile all data in the form of openly available databases. The open accessibility of DNA sequences in databases such as GenBank [22] and the availability of protein structures in databases such as the Protein Data Bank [23] have motivated many researchers to develop powerful methods, tools, and resources for large-scale data analysis [24]. Simultaneously, growth in computational speed and memory storage capacity has led to a new era in the analysis of biological data. Bioinformatics has emerged as a powerful discipline and almost 1000 databases are currently publicly available [25] along with a large number of bioinformatics tools [26]. The growth and development of bioinformatics can provide valuable lessons for applying the same practice to benefit nanoinformatics development [27].

Databases
An overview of nanomaterial-related databases is provided in Table 1. The emergence of recent databases such as Investigation Study Assay (ISA) tab-delimited (TAB) format (ISA-TAB-Nano) [19], Cancer Nanotechnology Laboratory (caNanoLab) [17], and Nanomaterial Registry seems encouraging and is discussed below. The knowledgebase serves as a repository for annotated data on nanomaterial characterization (purity, size, shape, charge, composition, functionalization, and agglomeration state), synthesis methods, and nanomaterial-biological interactions. [28] 2 InterNano InterNano supports the information needs of the nanomanufacturing community by bringing together resources related to advances in applications, devices, metrology, and materials that will facilitate the commercial development and/or marketable application of nanotechnology. [29] 3

Nano-EHS Database
Analysis Tool This web tool provides a quick and thorough synopsis of the Environment, Health and Safety Database. [30] 4 Nanoparticle Information

Library
The goal of the NIL is to help occupational health professionals, industrial users, worker groups, and researchers organize and share information on nanomaterials, including their health and safety-associated properties. [31]

ISA-TAB-Nano
Motivated by an absence of standardization in representing and distributing nanomaterial data, the ISA-TAB-Nano was established for the public development of data sharing [19]. The ISA-TAB-Nano file format offers a common framework to record and integrate nanomaterial descriptions and uses the ISA-TAB open-source framework file format. ISA-TAB-Nano defines four file formats for the distribution of the data: (1) the Investigation file; (2) the Study file; (3) the Assay file; and (4) the Material file. The Investigation file provides reference data about each investigation, study, assay, and protocol. The Study file contains the names and attributes of procedures applied for preparing samples for examination (sources and characteristics of biospecimens). The Assay file contains the values of measured endpoint variables and references to external files for each analyzed sample. The Material file describes the material sample and its structural and chemical components. File format templates are available [36], which also provide information on using ontology terms to support standardized descriptions and assist searches for and incorporation of information. Ontology is a formal way of representing knowledge abstracted from the growing body of biological science in a coded form, which can be translated into a programming language [43,44]. A benefit of ontologies is to facilitate the description logic that can be utilized together for querying a dataset. Moreover, the data sets are analyzed which are not usually accessible to searching and comparing. Ontologies are used in databases in various ways. For example, ontologies are characterized in a computer-readable form that can be interpreted by computers as well as domain experts. Thomas et al. have provided a review of ontology terms used for standardized descriptions [45]. A list of ontology-related resources for nanotechnology is shown in Table 2.

caNanoLab
caNanoLab is a data repository that researchers can use for the submission and retrieval of information on nanoparticles, including their composition, function (e.g., therapeutic, targeting, diagnostic imaging), physical (e.g., size, molecular weight), and in vitro experimental characterizations (e.g., cytotoxicity, immunotoxicity) [17]. It also encourages data sharing and analysis across the cancer community to accelerate and authenticate the use of nanoparticles in biomedicine. To facilitate data submission, web-based forms are offered so that submitters can restrict the visibility of their records to be private, to be distributed to particular collaboration groups, or to be public. In addition, caNanoLab can be utilized for discovery purposes by using it to search publicly accessible physical and in vitro characterizations while providing access to the associated publications. Moreover, the query results are downloadable in spreadsheet-based formats. The caNanoLab project team is also currently engaged with ISA-TAB-Nano to facilitate the submission and exchange of data across the breadth of nanotechnology information. Future features of caNanoLab may include support for the validation, import, and export of ISA-TAB-Nano files by means of customized ISA-TAB tools. A summary of data available from caNanoLab is presented in Figure 2.

Nanomaterial Registry
For data to be useful to a wide community of researchers, the data must be easily available in a usable form. However, providing usable data is a task that becomes increasingly difficult as the quantity of information grows. Curation is particularly important for nanomaterials because the experimental conditions and process of sample preparation can affect the actual data measurement. Data curation is best achieved through increasing proficiency in assessing a specific feature of a specific dataset. To increase the efficiency of the information curation method, the Nanomaterial Registry developed a method to assist the combinatorial analysis of a variety of datasets [16]. The Nanomaterial Registry is a web-based database that offers curated data for characterized nanomaterials. The web-based curation tool has a dual functionality that maintains (a) manual data entry and (b) programmatic data upload. The curation tool resides in a protected environment that contains user accounts, roles, and permissions. User roles are supported by an administration tool that offers access, controls, and permissions for curators. As the curation technique improves and the dataset grows, these resources will become increasingly available to researchers and information analysts. Data available from the Nanomaterial Registry are shown in Figure 3.

Text Mining
Text mining is a computerized process for utilizing the enormous amount of knowledge existing in the literature [56,57]. Three important approaches to text mining are currently prevalent in the field of biology: (1) co-occurrence-based methods; (2) rule-based or knowledge-based approaches; and (3) statistical or machine learning-based approaches. Co-occurrence-based methods scan for specific terms or ideas that occur in the same sentence but are sometimes abstract, and then hypothesize an association between them. Rule-based systems make use of common knowledge about how language is structured, specific knowledge about how biologically significant information is stated in the literature, and the relationships that these two types of knowledge can have with one another. In contrast, statistical or machine learning-based systems function by building classifiers that may operate at any level. However, free full-text access for a large proportion of scientific journals is still unavailable. In several disciplines, such as chemistry, it is still hard to find enough abstracts for large-scale analysis [56,57]. Although the application of text mining to nanotechnology is at an early stage, some recent studies have been published [58][59][60]. The few available text-mining tools are as follows: (1) the Google Scholar search engine [61]; (2) the GoPubMed engine [62]; (3) Textpresso [63]; and (4) BioRAT [64]. When literature is available for analysis, it is possible to predict interactions via text mining, for example, using iHOP [65], which is an online service through which the names of genes and proteins in sentences from abstracts in PubMed can be hyperlinked and assembled into a conceptual network. Likewise, it might be possible to use text mining to predict nanoparticle-protein interactions for nanotechnology.

Molecular Modeling
The structures of nucleic acids and many proteins have been determined by crystallography, nuclear magnetic resonance, electron microscopy, and many other techniques. Structural biology provides information on the static structures of biomolecules. However, in reality, biomolecules are highly dynamic, and their motion is important to their function. Different experimental techniques are available to help study the dynamics of biomolecules. Computational power continues to increase, and the development of new theoretical methods offers hope of solving scientific problems at the molecular level. All the theoretical methods and computational techniques that are used to model the behavior of molecules are defined as molecular modeling. Macromolecules can be studied only with molecular mechanics since other quantum methods based on the Schrodinger equations such as ab initio, semi-empirical, and density functional theory (DFT) methods require extensive computational time. Molecular mechanics uses classical physics and relies on force field. Ab initio and semi-empirical methods are based on approximate solutions of the Schrodinger equation. In addition, semi-empirical methods need additional empirical parameters (i.e., using parameterization) for the calculations. DFT methods are based on electron density. Novel molecules are best studied with ab initio or possibly DFT calculations, since the parameterization intrinsic to molecular mechanics or semi-empirical methods makes them unreliable for molecules that are different from those used in the parameterization. The energies of molecules such as proteins or DNA can be calculated using molecular mechanics; small molecule energies can be calculated using ab initio and semi-empirical methods [66,67]. There is an extensive body of three-dimensional structures of molecules essential to molecular modeling that can be visualized using software freely available for users for this purpose. A list of software for visualizing macromolecules is presented in Table 3.

Docking
Docking is an important computational tool in the drug discovery process and is used to specifically predict protein-ligand interactions. The two basic features of docking software are docking accuracy and scoring reliability [84]. Docking accuracy indicates how similar the prediction of ligand binding is to the ligand conformation that is determined experimentally, whereas scoring reliability ranks ligands based on their affinities. Docking accuracy and scoring reliability are used to assess the searching algorithm and the scoring functions, respectively, of docking software. The numerous searching algorithms used in docking software vary with respect to randomness, speed, and the area covered. Most of the searching algorithms perform well when tested against the known structure. In contrast, scoring functions are rarely successful, and a number of issues needed to be addressed to improve the docking features. Many types of docking software are currently available; AutoDock is a software that is highly accessed and freely available [85]. At present, as protein-nanoparticle complexes are difficult to study using experimental approaches, computational tools show promise. Structural models for carbon nanomaterials such as carbon nanotubes and fullerenes are available, and thus many protein-nanoparticle interactions can be studied computationally. Fullerenols, which are derivatives of fullerenes, are currently in trial for diagnostic and therapeutic uses, although insufficient information is available about the structural interactions and toxicity of fullerenols in biosystems. Yang et al. computationally studied the interactions in the fullerenol-lysozyme complex and compared the computational results with experimental results [86]. The structure of fullerenol was constructed using Chem3D Ultra and subjected to blind docking in the active site of chicken lysozyme. The fullerenol bound close to Tryptophan 62, and a π-π stacking interaction was observed. The predicted computational binding result was consistent with all known experimental results. As a result, new proteins that can interact with fullerenol can be identified for future applications as drug targets [87]. The addition of molecular dynamic (MD) simulations to docking studies can also assist in studying interactions. A list of well-known docking software is provided in Table 4.

Quantitative Structure-Activity Relationships (QSAR)
QSAR is based on the idea that similar structural properties generate similar biological effects [94]. It is used to establish associations between the structural and electronic properties of potential drug candidates along with their binding affinities for common macromolecular targets that are widely used in drug discovery. QSAR studies were originally based on a single physicochemical property such as the solubility or the pKa (acid dissociation constant) value to elucidate the biological effect of a molecule (one-dimensional QSAR). Two-dimensional QSAR further examines the connectivity of a compound by considering the physicochemical properties of single atoms and functional groups and their roles in biological activity. Current models contain three-dimensional structural descriptors such as the length or width of a substituent. With a suitable force field, it is theoretically possible to determine the feasible binding mode of any given existing molecule to a target protein. Most three-dimensional QSAR models depend on the superposition of ligands, the identification of which is impossible without sufficient structural information for the target protein. The results of QSAR can be used to understand the interactions occurring between functional groups in the molecules with those of their target. QSAR methods are fast and can deal with real complex biological systems. QSAR models provide an alternative to animal testing and offer the advantages of reduced time for experimentation and minimized costs [95]. A few QSAR studies related to nanoparticles have been published to date [96][97][98][99][100]. QSAR can also predict the toxicity of nanoparticles. In addition, it can support targeting and filling gaps in knowledge for known nanoparticles [101]. QSAR tools are presented in Table 5. Recent reviews provide clear future directions for the field [102,103]. High-Throughput Screening Data Analysis Tools (HDAT) HDAT is a set of web-based high-throughput screening (HTS) data analysis tools [110,111]. The use of HTS in toxicity studies of engineered nanomaterials requires tools for quick and consistent processing and analyses of large HTS datasets. To facilitate this, a web-based platform was developed that presents statistical methods appropriate for toxicity data on engineered nanomaterials. It also offers different plate normalization methods, different HTS summarization statistics, self-organizing map (SOM)-based clustering analysis, and visualization of raw and processed information using heat maps and SOMs. HDAT has been successfully applied in the analysis of a number of HTS studies on the toxicity of engineered nanomaterials, thus enabling analysis of toxicity mechanisms and the development of data-driven structure-activity relationships for nanomaterial toxicity. HDAT offers online instructions and video demonstrations of basic operations such as data formatting and data uploading.

MD Simulations
The most broadly used computational technique is MD simulation, which elucidates the structure-function relationships in biomolecules. In MD simulation, atoms and molecules are allowed to interact over time at a given temperature and are assumed to follow the laws of classical mechanics. MD simulation offers a detailed description of atomic motion, and the forces acting on the atoms are computed using a model known as a force field [158]. Such a simulation may serve as a "computational microscope" [159] that reveals interactions that may be difficult to observe experimentally. Dror et al. explained MD simulation as follows: "Static structural information might be likened to a photograph of a football game; to understand more readily how the game is played, we want a video recording" [159]. Computer simulation of biomolecule-nanomaterial interactions is gaining popularity as a complement to experimental techniques. When entering human bodies, nanoparticles interact with biomolecules and begin a series of nanoparticle/biological interactions that rely upon on forces as well as dynamic biophysicochemical interactions [11]. Many interactions such as hydrogen bonds, Van der Waals (VDW) forces, and electrostatic, hydrophobic, and π-π stacking interactions contribute to these protein-nanoparticle complexes [11]. The VDW attraction increases with increase in the proximity of the atoms at the interface, within a distance close to the total of the VDW radii of the two atoms. The VDW force acts at a short range and reduces significantly as the contacting atoms depart (Lennard-Jones potential or 6-12 potential). Hydrogen bonding is another important player in intermolecular interactions. Although the hydrogen bond is much stronger than the VDW interaction, the number of hydrogen bonds connecting proteins and nanoparticle is much smaller than that of VDW interactions. Hydrophobic interactions are an entropic effect originating from the repulsion of ordered water molecules from a nonpolar surface. The π-π stacking is an attraction involving aromatic rings. However, as only a few amino acids contain aromatic rings and the rings are often buried within the hydrophobic core of the proteins, only specific interactions are possible. A recently published review of molecular modeling in structural nanotoxicology discusses methods for predicting interactions between nanomaterials and biomolecules [160]. The review also discusses how nanoparticles of different sizes, shapes, structures, and chemical properties are able to influence the organization and function of nanomachinery in cells. In summary, computer modeling studies have become a dominant tool in risk assessment studies. The molecular modeling tools currently available are shown in Table 6.

Imaging
Medical imaging technologies are used for diagnosis as well as for many other biomedical applications. The use of nanomaterials in imaging applications is growing rapidly. Molecular imaging can assist in early diagnosis and also provides information on pathological processes. Compared with currently used fluorescent proteins and small-molecule dyes, nanotechnology imaging probes can offer signals that are several-fold brighter and more stable. Nanoparticles, predominantly those made of precious metals, enhance signals in imaging approaches. For example, semiconductor quantum dots are tiny light-emitting particles that are used as fluorescent probes for molecular imaging and medical diagnostics. However, there are still many disadvantages associated with imaging technologies, such as tissue specificity and systemic toxicity. Progress in nanotechnology may overcome these challenges and offer more sensitive and specific information [177]. In vivo and in vitro imaging applications of nanomaterials have been reviewed [3,178]. The new informatics methods link tissue banks with histology data in order to offer enhanced image annotation down to the nano level [13]. Due to the increase in the use of image analysis tools and methods in medical applications, well-supported tools and databases that can be used for nanotechnology have become increasingly available; these are shown in Table 7 [179]. Image databases and tools used in fluorescence microscopy can also be used to enhance image annotation with novel methods as the amount of experimental data at the nanolevel grows.

Computational Design of Biomolecular Nanostructures
DNA and protein self-assemblies are used to build nanostructures because of the programmability property of DNA and protein and the surplus of chemical techniques available for their manipulation. For example, DNA is a valuable self-assembly material because, with appropriate design of its base sequences, assembly is enabled through specific interactions among complementary base pairs [180]. Many self-assembly molecules that can be used as vehicles for drug delivery are built using DNA [181]. Some of the software currently available for designing three-dimensional structures of carbon nanotubes and for DNA origami design is listed in Table 8. Wrapping Windows [192]

Specific Challenges and Opportunities in Nanoinformatics
The databases and tools presented here highlight the increasing resources that are available to users. Their improvement is accompanied by the continuous expansion in the number of databases (Table 1) and various computer programs (Tables 6 and 7). The new database initiatives such as ISA-TAB-Nano, caNanoLab, and Nanomaterial Registry will facilitate data sharing, data standards, and, depending on the growth of nanomaterials data, the development of methods and tools specific to the nanolevel. Moreover, the growth of these fields essentially requires that the ISA-TAB-Nano standard be adopted by journals and other organizations to ensure consistent representation of nanotechnological data [19]. It should be noted that open accessibility and the liberty to use published DNA sequences in databases such as GenBank has encouraged scientists to build powerful methods, tools, and resources that have substantially enriched the field of bioinformatics [24]. It should also be noted that nanomaterial data are much more complex than sequence information or molecular data. An additional challenge is the lack of availability of scientific literature from subscription-based journals in nanotechnology, particularly in chemistry, for which even abstracts are inaccessible for large-scale analyses such as text mining; this is a big drawback for text mining and nanoinformatics analysis [193]. With respect to QSAR, a recent review provided a summary of the advances and a clear roadmap for future research and also identified major gaps in the field [102]. Molecular modeling and imaging in nanotoxicology appear promising, as many software tools mentioned in this review are free under the General Public License, so that that there is freedom to use, study, share, and modify the software. Although the modeling behavior of nanomaterials presents different challenges compared to those for drugs, it is nonetheless possible to adapt new methods and develop specific tools from existing tools. There are also anticipated beneficial side effects of nanoinformatics development, such as the possible minimization of animal studies. Recent initiatives and efforts emerging from the nanotechnology community are encouraging.

Conclusions
The databases and tools presented herein will be helpful to the nanotechnology community. We highlight the following recommendations that we hope will assist this community: (i) Consensus standards and ontology terms should be used in nanotechnology-related research; (ii) Journals should adopt formats to ensure uniform representation of information; (iii) Authors should submit their data to the databases; (iv) Similar to the current model of molecular biology databases, free and unrestricted access to nanomaterial data would enable nanoinformatics researchers to create tools that will benefit the broad nanotechnology community; (v) Open access and freely available nanotechnology literature will also benefit the nanotechnology community.